# Search a file with findstr



## Squashman (Apr 4, 2003)

My next order of busines is to do a profanity search. I have a file with a list of profanity words in it. One word per line. Need to search a data file and output the lines that match any of the profanity words in my Profanity file to one file and the ones that don't match to another file. I think I could do that with a For, Type,echo and Findstr commands. Get the errorlevel of findstr and echo the line to the appropriate output file. Something like

for /f %%a in (type somefile.txt) do echo %%a | findstr /grofanitywords.txt | if errorlevel==0 echo %%a >> profanity_output.txt else echo %%a>>not_profanity.txt

Something like that. I know the syntax isn't correct. Just trying to convey what I am thinking in a more technical manner.


----------



## TheOutcaste (Aug 8, 2007)

Try this:

```
@echo off
Set _WL=Profanitywords.txt
Set _OF1=profanity_output.txt
Set _OF2=Not_Profanity.txt
Set _SF=file 1.txt
If EXIST %_OF1% del %_OF1%
If EXIST %_OF2% del %_OF2%
For /f "usebackq tokens=*" %%a In ("%_SF%") Do (
   echo.%%a|findstr /g:"%_WL%">>%_OF1%
   echo.%%a|findstr /v /g:"%_WL%">>%_OF2%)
For %%A In (WL OF1 OF2 SF) Do Set _%%A=
```
Jerry


----------



## TheOutcaste (Aug 8, 2007)

This one is much faster, as it only does the findstr once per line


```
@echo off
Echo.%time%>time1.txt
setlocal enabledelayedexpansion
Set _WL=search.txt
Set _OF0=Profanity.txt
Set _OF1=Not_Profanity.txt
Set _SF=file 1.txt
If EXIST %_OF0% del %_OF0%
If EXIST %_OF1% del %_OF1%
For /f "usebackq tokens=*" %%a In ("%_SF%") Do (
  Set _T0=%%a & echo.%%a|findstr /g:"%_WL%">nul & call :_Output!errorlevel!
  )
Echo.%time%>>time1.txt
Goto:EOF
:_Output0
echo.!_T0!>>%_OF0%
Goto:EOF
:_Output1
echo.!_T0!>>%_OF1%
Goto:EOF
```
But your original idea using *if then else* is the fastest:

```
@echo off
setlocal enabledelayedexpansion
Set _WL=search.txt
Set _OF0=Profanity.txt
Set _OF1=Not_Profanity.txt
Set _SF=file 1.txt
If EXIST %_OF0% del %_OF0%
If EXIST %_OF1% del %_OF1%
For /f "usebackq tokens=*" %%a In ("%_SF%") Do (
  echo.%%a|findstr /g:"%_WL%">nul
  If !errorlevel!==0 (echo.%%a>>%_OF0%) else echo.%%a>>%_OF1%
  )
```
On a relative scale, times for the 1st, 2nd, and 3rd versions are:

1.000
0.556
0.500

Not too surprising that the first one takes twice as long, as it does the findstr twice.
Doesn't seem to make any difference if you use the /v switch on findstr.

Jerry


----------



## Squashman (Apr 4, 2003)

I am think I should probably throw in a /I so it is not case sensitive.


----------



## Squashman (Apr 4, 2003)

And I forgot to say Thanks again!!!!


----------



## TheOutcaste (Aug 8, 2007)

Squashman said:


> I am think I should probably throw in a /I so it is not case sensitive.


Good catch! Seems I did forget to do that.

And You're Welcome!

BTW, I did finally convert the other batch to an exe; didn't make a noticeable difference in speed, still 35 minutes. It was a freeware converter - a paid version might make a difference, but probably won't come close to beating the *nix utilities.


----------



## Squashman (Apr 4, 2003)

I found a really nice bat2exe that is free. I took the batch file I wrote and it let me compile it with all my ported unix utilities. Glad you mentioned that. Makes things much easier.

Now, I justed need to see how long it will take to run your profanity suppression on something like 1 million lines.


----------



## devil_himself (Apr 7, 2007)

you can also do it like this

```
for /f %%a in (myfile.txt) do (find /i  "%%a" "Profanitywords.txt">nul)&& echo %%a>>pr.txt || echo %%a>>npr.txt
```


----------



## Squashman (Apr 4, 2003)

devil_himself said:


> you can also do it like this
> 
> ```
> for /f %%a in (myfile.txt) do (find /i  "%%a" "Profanitywords.txt">nul)&& echo %%a>>pr.txt || echo %%a>>npr.txt
> ```


Thanks, I will test that out when I get back to work. Need to mock up a file with a couple million records to see what happens.


----------



## Squashman (Apr 4, 2003)

Trying to use the ConGetFile Utility to prompt me for a filename. Can't get it to work.


```
@echo off
setlocal enabledelayedexpansion
Set _WL=search.txt
Set _OF0=Profanity.txt
Set _OF1=Not_Profanity.txt
If EXIST %_OF0% del %_OF0%
If EXIST %_OF1% del %_OF1%
For /f "usebackq tokens=*" %%a In ('ConGetFile') Do (
  set InFile=%%a
  echo.%InFile%|findstr /I /g:"%_WL%">nul
  If !errorlevel!==0 (echo.%%a>>%_OF0%) else echo.%%a>>%_OF1%
  )
```


----------



## devil_himself (Apr 7, 2007)

```
@for /f "tokens=*" %%a in ('ConGetFile') do set file=%%a
```


----------



## devil_himself (Apr 7, 2007)

oops .. in full code


```
@echo off
setlocal enabledelayedexpansion
Set _WL=search.txt
Set _OF0=Profanity.txt
Set _OF1=Not_Profanity.txt
If EXIST %_OF0% del %_OF0%
If EXIST %_OF1% del %_OF1%
for /f "tokens=*" %%a in ('ConGetFile') do (
  for /f "usebackq tokens=*" %%b in ("%%a") do (  
  echo.%%b|findstr /I /g:"%_WL%">nul
  If !errorlevel!==0 (echo.%%b>>%_OF0%) else echo.%%b>>%_OF1%
 )
)
```


----------



## Squashman (Apr 4, 2003)

Nevermind. I got it.


```
@echo off
setlocal enabledelayedexpansion
Set _WL=search.txt
Set _OF0=Profanity.txt
Set _OF1=Not_Profanity.txt
If EXIST %_OF0% del %_OF0%
If EXIST %_OF1% del %_OF1%
FOR /F "tokens=*" %%A in ('ConGetFile.exe') Do Set _Input=%%A
For /f "usebackq tokens=*" %%a In ("%_Input%") Do (
  echo.%InFile%|findstr /I /g:"%_WL%">nul
  If !errorlevel!==0 (echo.%%a>>%_OF0%) else echo.%%a>>%_OF1%
  )
```


----------



## TheOutcaste (Aug 8, 2007)

You've already figured it out, but just to clarify, when using the *usebackq* option you have to use an Accent Grave (*`* {usually under tilde ~}) around the command instead of the Apostrophe/Single Quote (*'*)
So, instead of this:
For /f "usebackq tokens=*" %%a In ('ConGetFile') Do
you have to use this:
For /f "usebackq tokens=*" %%a In (`ConGetFile`) Do

But the only time you need to use the *usebackq* option is if the IN part of the For command contains a file name with quotes.
Some examples:

For /f "tokens=*" %a In (copy of search.txt) Do @echo Variable=%a
Gives an error:
The system cannot find the file copy.
If you use quotes around the file name
For /f "tokens=*" %a In ("copy of search.txt") Do @echo Variable=%a
the output is:
Variable=copy of search.txt
if you use this:
For /f "usebackq tokens=*" %a In ("copy of search.txt") Do @echo Variable=%a
The output will be the contents of the file "copy of search.txt"
Variable=This is line one
Variable=This is line two

Jerry


----------



## Squashman (Apr 4, 2003)

Thanks again for all the great input!


----------



## Squashman (Apr 4, 2003)

Another stupid question.
echo.%%a>>%_OF0%

I thought echo. sends a blank line to output. Why do we need the period.


----------



## devil_himself (Apr 7, 2007)

I use period to avoid the potential "ECHO is off."

check out these two examples

With Period

```
@echo off
  setlocal
  set var=This is a test
  echo.%var%
  set var=
  echo.%var%
  echo Test Over
  endlocal
```
Without Period


```
@echo off
  setlocal
  set var=This is a test
  echo%var%
  set var=
  echo%var%
  echo Test Over
  endlocal
```


----------



## Squashman (Apr 4, 2003)

TheOutcaste said:


> This one is much faster, as it only does the findstr once per line
> 
> 
> ```
> ...


Well, I finally got around to testing it on a rather large file. It has over 2 million records in it. The batch file immediatley chokes on it. I am not sure why.



> C:\8343>profanity.bat
> Not enough storage is available to process this command.
> Out of memory.
> C:\8343>


Now I know this will run if I use grep. I had originally used a simple one line of code using grep but it would only output the records that had profanity in them. I would then run the grep with a reverse search to give me the non profanity records. Which is what brought me to start this thread. Was there a way to do this natively with a batch file. Apparently it can't handle the large file size or something. My grep batch file that I tested with did over 10 million records. Each record being 900 bytes long.

I got the same error using Devil's code as well.

So now I am back to square one again.


----------



## TheOutcaste (Aug 8, 2007)

Yeah, I don't think the command prompt was really designed to handle large files. I just did a test with a file that was only 411094 KB in size (a list of addresses) using just two search strings. Watching memory usage for cmd.exe in Task Manger, it jumped to 827 MB while the batch file was running, and it pauses for quite a bit before it actually does the echo|findstr part. I'm guessing it's loading the entire file into memory to work on, so your 9GB + file would choke it. (unless you have say 32 or 64 GB of ram on your system)

Running it now on a 51387KB file, and memory usage is 106804 KB. Seems to be about 2x file size plus 4 KB. When I first open a prompt, it's using about 2480 KB.

And after running on this 51387KB file for an hour, I'm guessing it will take another 22 hours to finish.
That's using this version:

```
@echo off
Echo.%time%>time.txt
Set _WL=search1.txt
Set _OF1=Profanity.txt
Set _OF2=Not_Profanity.txt
Set _SF=big 1.txt
If EXIST %_OF1% del %_OF1%
If EXIST %_OF2% del %_OF2%
For /f "usebackq tokens=*" %%a In ("%_SF%") Do (
  echo.%%a|findstr /i /g:"%_WL%">>%_OF1%
  echo.%%a|findstr /i /v /g:"%_WL%" >>%_OF2%)
For %%A In (WL OF1 OF2 SF) Do Set _%%A=
Echo.%time%>>time.txt
```
The version using the *IF THEN ELSE* format shows the same memory usage.

Surprisingly, just using 2 findstr statements took 45.5 seconds for the 411094 KB file, and memory usage peaked at 2776 KB. Increased the search strings to 10 and it didn't make much difference.
However, findstr.exe peaked at about 411000 KB. I then created a 4,110,0938 KB file, and findstr.exe peaked at 2288 KB, so it seems it will use RAM if available, if not, it uses a buffer and reads in chunks.

But it seems to have crashed. CPU usage was about 20-25% (40-50% of one core), then usage hit 49% (98% on one core) and while Task Manager still shows I/O reads, I/O writes have stopped, and the output file is not growing. Wasn't paying attention to see when this happened though, checked it about 10 minutes after I started it.
Hit CTRL+C, got prompted to Terminate, chose NO, and it started on the 2nd portion, so it seems it hung after finishing the first findstr statement.
2nd portion finished after about 2-3 minutes, and hung again. CTRL+C, said no, and the batch finished. Not sure why.
Ran findstr with a junk search string just to count the number of records in one of the result files (28,799,994) and it hung when it finished. Had to CTRL+C to get the prompt back. Seems findstr doesn't like large files, so that probably won't be a solution for you either, unless someone has an idea how to stop it from hanging -- course that may be something quirky with my system as well, as it does have issues with long running processes, so you might want to test it on yours.

Much Much faster than a For loop though.

So, try this:

```
@echo off
Echo.%time%>time4.txt
Set _WL=search1.txt
Set _OF1=Profanity.txt
Set _OF2=Not_Profanity.txt
Set _SF=big3.txt
If EXIST %_OF1% del %_OF1%
If EXIST %_OF2% del %_OF2%
findstr /i /g:"%_WL%" "%_SF%">>%_OF1%
findstr /i /v /g:"%_WL%" "%_SF%">>%_OF2%
For %%A In (WL OF1 OF2 SF) Do Set _%%A=
Echo.%time%>>time4.txt
```
Jerry


----------



## Squashman (Apr 4, 2003)

That was much faster. Kind of sad that we have to pass the data twice though. Your new script took.
23:46:49.24
23:50:18.27
That was on a 1.88GB file. With 2.5 million records.
Weird thing is that I can't figure out why I got some records in my Profanity output file.

I may have found a work around using SED though. Will test it to see if it is faster than yours.

Why do you have that for statement in your last set of code?


----------



## Squashman (Apr 4, 2003)

Well the SED script isn't working much better.
5 minutes in and it has only processed about 400MB of the file. 

I think I will test it using 2 GREPS.


----------



## Squashman (Apr 4, 2003)

I ran it with a Double grep and the output records amount was the same which is a good thing.
Grep was a bit faster.
0:59:29.08
1:02:35.54
Then Findstr.
1:06:49.53
1:10:19.66

The weird thing about both was that I can't figure out why it chose some records. There is plenty of records that I am looking at and they don't seem to have any profanity in them. I am not sure what they are matching.

My SED script took almost 24 minutes and for some reason the match output came out looking funny. I am not sure what happend but sed tacked on an extra LineFeed at the end. It didn't do it on my smaller test files but for some reason it did it this time.


----------



## TheOutcaste (Aug 8, 2007)

About 11% faster. *nix wins again


Squashman said:


> Why do you have that for statement in your last set of code?


* For %%A In (WL OF1 OF2 SF) Do Set _%%A=*

This is just to clean up the variables used. Equivalent to 4 separate Set statements:
*Set _WL=
Set _OF1=
Set _OF2=
Set _SF=*

Just a habit in case I may want to use the value in a variable in a different batch file. I'd just have to remove that one variable name from the For loop.
This also lets me comment out the For statement and then be able to use set after the batch ends to look at the variable values. You can use setlocal to do the same, but changes to a variable after the setlocal statement are not saved.

Another variation is to use numbered variables like _t*X*
Then you can use *For /L %%A in (0,5,1) do Set _t%%A= *to clear them, just adjusting the start and end numbers.

I guessing you have more than 1.88GB of RAM. Still wonder if findstr will hang if you test a file that is larger than your physical RAM.

As for why it would match some records with no obvious words, you might try putting some of the records it found in a file by themselves and do the findstr again just to see if they match again. This will verify that it's not some weird glitch (not likely)
I'd also look closely for typos. For example, if you transpose the *f* and *t* in *shifting, *it would be flagged but might be real hard to spot by eye.
Or put those records in a word processor and search for the words in your list. Will help find matches embedded in the middle of words.

To avoid matching on embedded strings, add a space before or after the search word in the search.txt file (or search1.txt, just noticed I changed the name somewheres along the line). 
{space}*profanity
profanity*{space}
Can't say I can think of any "proper" words that have an embedded profanity though.

The problem with this is that the first won't match if it's the first word on a line, and the second won't match if it's the last, so some words might be missed. Might be able to use regular expressions with findstr to work around that though.
I've not played with findstr and it's regular expression features much as I haven't found much documentation on it yet. My quick test with a 51MB file shows that just adding the /R switch will make the search take 1.5 times as long. Adding checks for beginning and end of line, or beginning and end of word ups that to nearly 6 times as long. 3.6 seconds with normal search, 5.1 seconds with /R, 18 seconds with checking beginning of line. If you have multiple searches specifying beginning or end positions I suspect it will take even longer.
{space}*profanity
profanity*{space}
* ^profanity *(to catch at start of line)
* $profanity* (to catch at end of line)

If you need to do something like that, might be better to search the Not_Profanity.txt file for just the beginning/ending matches. Would only be a help if the Not_profanity file is much smaller than the original though.

Jerry


----------



## Squashman (Apr 4, 2003)

512MB of ram. on a Pentium 4. 2.4Ghz.

We kind of need to match on embeded strings. Some of our clients data has some really bad data because they have a Web sign up form. So I need to make sure I match things like 
C
U
N
T
L
Y
Peniston.

Had to get around the forum filters to post that.


----------



## TheOutcaste (Aug 8, 2007)

Figured it was something like that.

Strangely though, the filters still caught it on the email notice, it came through as ****ly. Unless you edited it quick enough; edits made in the first minute or two don't always flag the post as having been edited.

It's obviously matching on something, just a matter of spotting it to see if it's valid, or something you need to work around.

If you can zip your search file and a few of the records it's found that don't appear obvious I'd be happy to look at them. Fresh pair of eyes might help.

Assuming of course that the records aren't confidential. I can PM an email addy if it's too large to post, or you don't want it public.

Jerry


----------



## TheOutcaste (Aug 8, 2007)

And seems findstr hanging may be an issue with just my PC. Have to try one with a file just barely over the 2GB size of my RAM.

Jerry


----------



## Squashman (Apr 4, 2003)

I was thinking of just pulling a few of those records out and running them by themselves or manually searching those records with find/replace in notepad or something. The last couple of records in the file are really suspicious. I can just tail the last 10 records from the file and rerun it. I unfortunately can't send the data to you. But thanks for the offer to help.


----------



## ghostdog74 (Dec 7, 2005)

Squashman said:


> I may have found a work around using SED though. Will test it to see if it is faster than yours.


download GNU grep for windows instead. Then on the command prompt

```
c:\test> grep -f profanity file
```
something like that. read the docs for GNU grep for more info.


----------



## Squashman (Apr 4, 2003)

ghostdog74 said:


> download GNU grep for windows instead. Then on the command prompt
> 
> ```
> c:\test> grep -f profanity file
> ...


I am glad you read this whole thread.


----------

