# Solved: Batch Find and Replace (Unicode)



## Anandajoti (Sep 26, 2010)

I am completely new to batch file work , but I need to find characters written in a legacy font and change them to unicode.


It is only for html files but must cover all the files in a directory and all its sub-directories.
The sort of change needed is 



find à replace to &#257;; 
find ã replace to &#299; 



these characters will look funny, but these are two actual examples. If anyone can give me a working example I can build on it, and of course any help would be very much appreciated.


----------



## TheOutcaste (Aug 8, 2007)

Welcome to TSG!

A batch file can't manipulate unicode text, it would convert the whole file to ANSI. And the <> symbols used in html tags can be difficult to work with.

Be much easier to use something like Notepad++, it can do a find and replace for files on disk.


----------



## Anandajoti (Sep 26, 2010)

Thanks for the clarification. But I really need a script that I could run automatically after updating files regularly on disk. Is there some other script programming language that may be possible?


----------



## TheOutcaste (Aug 8, 2007)

VBScript might work, depends on if the files are actually encoded as unicode, or just using unicode characters.
This page shows the basics:
How Can I Find and Replace Text in a Text File
Just need to add the format option to the Open and Save commands so it will treat them as Unicode (the -1 I've added):

```
Const ForReading = 1
Const ForWriting = 2

Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objFile = objFSO.OpenTextFile("C:\Test\Unitext.txt", ForReading,, -1)

strText = objFile.ReadAll
objFile.Close
strNewText = Replace(strText, "à", "&#257;")
strText = Replace(strNewText, "ã", "&#299;")

Set objFile = objFSO.OpenTextFile("C:\Test\UnitextNew.txt", ForWriting, True, -1)
objFile.Write strText
objFile.Close
```
You can write back to the same file, I used a different output file name just for testing.

Use a For loop in a batch file to pass the filenames to the Replace.vbs script, or you could use VBScript to get the list of names as well.
Haven't tested this, but it should get all the files and process them. I'd use the New file name for testing:

```
Const ForReading = 1
Const ForWriting = 2

Set objFSO = CreateObject("Scripting.FileSystemObject")
strComputer = "."
StrFolderName = "C:\Test"
Set objWMIService = GetObject("winmgmts:\\" & strComputer & "\root\CIMV2")
Set colItems = objWMIService.ExecQuery("Associators of {Win32_Directory.Name='" & strFolderName & "'} Where ResultClass = CIM_DataFile")
For Each objItem in colItems
  strFileName = objItem.Drive & objItem.Path & objItem.FileName & "." & objItem.Extension
  strNewFileName = objItem.Drive & objItem.Path & objItem.FileName & "New." & objItem.Extension
  Set objFile = objFSO.OpenTextFile(strFileName, ForReading,,-1)
  
  strText = objFile.ReadAll
  objFile.Close
  strNewText = Replace(strText, "à", "&#257;")
  strText = Replace(strNewText, "ã", "&#299;")
  
  Set objFile = objFSO.OpenTextFile(strNewFileName, ForWriting,True,-1)
  objFile.Write strText
  objFile.Close
Next
```


----------



## Anandajoti (Sep 26, 2010)

I just tried it out. All the characters changed to Chinese, completely scrambled the text . I then made sure it was encoded as UTF, and tried again. Same result.

I now just tried it with this code:

strNewText = Replace(strText, "à", "& #257;") //I have added a space here else it gets interpreted.
strText = Replace(strNewText, "ã", "& #259;")

Same result.

One other problem (maybe fortunately): it is not recursive.

A 3rd problem is the filesnames changed to lowercase.

This is the top part of the original html:

<?xml version="1.0"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

Original Texts and Studies - Home Page

This is what it looks like now:

㼼浸⁬敶獲潩㵮ㄢ〮㼢ാ㰊䐡䍏奔䕐栠浴⁬啐䱂䍉∠⼭圯䌳⼯呄⁄䡘䵔⁌⸱‰牔湡楳楴湯污⼯久•栢瑴㩰⼯睷⹷㍷漮杲启⽒桸浴ㅬ䐯䑔砯瑨汭ⴱ牴湡楳楴湯污搮摴㸢਍格浴⁬浸湬㵳栢瑴㩰⼯睷⹷㍷漮杲ㄯ㤹⼹桸浴≬ാഊ㰊敨摡ാഊ㰊敭慴栠瑴⵰煥極㵶䌢湯整瑮吭灹≥挠湯整瑮∽整瑸栯浴㭬挠慨獲瑥甽晴㠭•㸯਍਍琼瑩敬伾楲楧慮⁬敔瑸⁳湡⁤瑓摵敩⁳*潈敭倠条㱥琯瑩敬ാ㰊慢敳琠牡敧㵴弢潴≰⼠ാഊ㰊敭慴渠浡㵥欢祥潷摲≳挠湯整瑮∽畂摤楨浳‬畂摤楨瑳吠硥獴‬慐楬‬慐楬吠硥獴‬慓獮牫瑩‬慓獮牫瑩吠硥獴‬畂摤楨瑳䠠批楲⁤慓獮牫瑩‬潃灭牡瑡癩⁥敔瑸㭳䐠慨浭灡摡㭡唠慤慮慶杲㭡唠慤慮※瑉癩瑵慴慫※慃畴桢湡癡牡灡污㭩倠瑡浯歯桫㭡䬠畨摤慨慫慰桴≡⼠ാ㰊敭慴渠浡㵥搢獥牣灩楴湯•潣瑮湥㵴吢楨⁳敳瑣潩⁮潣瑮楡獮䈠摵桤獩⁴整瑸⁳湩琠敨漠楲楧慮⁬慬杮慵敧ⱳ琠杯瑥敨⁲楷桴猠畴楤獥漠⁦桴⁥整瑸ⱳ攠灳捥慩汬⁹湩爠来牡⁤潴琠敨物挠浯楰楬瑡潩⁮湡⁤牰獯摯⹹•㸯਍洼瑥⁡慮敭∽慲楴杮•潣瑮湥㵴䜢湥牥污•㸯਍洼瑥⁡慮敭∽硥楰敲≳挠湯整瑮∽敎敶≲⼠ാ

Um ah, um ah


----------



## TheOutcaste (Aug 8, 2007)

Anandajoti said:


> charset=utf-8"


There's why, it not enocde in unicode, which uses 15 bits per character, it using UTF-8, which is 8 bit characters.

Might try treating the file as ASCII text (change the -1 to 0, or leave it off). When I do that it sees *ā* as *a* and *ī* as *i*, but that may depend on Regional Settings, so might work on your system, or using the *& #XXX; *format for the replacement characters.


----------



## Anandajoti (Sep 26, 2010)

Utf-8 is a form of Unicode, 8-bit as opposed to 16-bit, and is now the dominant form of Unicode, see http://en.wikipedia.org/wiki/Utf-8

I tried implimenting both suggestions: changing -1 to 0 or leaving off then produces no result (ie the file is not changed), this is when using the &#xxx; format. Including -1 always lands up scrambled.

I also don't know how to do a recursive find and replace at the moment, or how to preserve case.

By the way I must say I truly appreciate the time you are taking to look into this.


----------



## TheOutcaste (Aug 8, 2007)

OK, this searches recursively, and seems to do the replacing OK.
The .vbs and .html files I used for testing were encoded as "UTF-8 without BOM".
The replace function is case sensitive by default. If you want to replace "À" with "Ā" for example, use this:
Replace(strText, "À", "& #256;")

Remove the space between the *&* and *#*


```
Set objFSO = CreateObject("Scripting.FileSystemObject")
Const ForReading = 1
Const ForWriting = 2
objStartFolder = "C:\Temp Dir\test\Test Source"

Set objFolder = objFSO.GetFolder(objStartFolder)
Wscript.Echo "Current Folder - " & objFolder.Path
Set colFiles = objFolder.Files
For Each objFile in colFiles
  Set objCurrentFile = objFSO.OpenTextFile(objFile.Path, ForReading)
  strText = objCurrentFile.ReadAll
  objCurrentFile.Close
  strText = Replace(strText, "à", [COLOR=Red]"& #257;"[/COLOR])
  strText = Replace(strText, "ã", [COLOR=Red]"& #299;"[/COLOR])
  Set objCurrentFile = objFSO.OpenTextFile(objFile.Path, ForWriting,True)
  objCurrentFile.Write strText
  objCurrentFile.Close
Next

DoSubfolders objFSO.GetFolder(objStartFolder)

Sub DoSubfolders(Folder)
  For Each Subfolder in Folder.SubFolders
    Wscript.Echo "Current Folder - " & Subfolder.Path
    Set objFolder = objFSO.GetFolder(Subfolder.Path)
    Set colFiles = objFolder.Files
    For Each objFile in colFiles
      Wscript.Echo "Current File - " & objFile.Path
      Set objCurrentFile = objFSO.OpenTextFile(objFile.Path, ForReading)
      strText = objCurrentFile.ReadAll
      objCurrentFile.Close
      strText = Replace(strText, "à", [COLOR=Red]"& #257;"[/COLOR])
      strText = Replace(strText, "ã", [COLOR=Red]"& #299;"[/COLOR])
      Set objCurrentFile = objFSO.OpenTextFile(objFile.Path, ForWriting,True)
      objCurrentFile.Write strText
      objCurrentFile.Close
    Next
    DoSubfolders Subfolder
  Next
End Sub
```


----------



## Anandajoti (Sep 26, 2010)

Hi thanks for the new script, now the script is definitely recursive, but it picks up other things not only html.

And for some reason I can't figure it doesn't do any replacements. It is reading and writing, but there are no replacements. 

When I use plain text (ie up and till 127 on the keyboard) it works. 

eg: strText = Replace(strText, "aa", "& #257;") is OK

When I use VB code, writing eg: 

strText = Replace(strText, Chr(224), ChrW(257))

again it doesn't work. 

This last would be a better solution for me, as I already have the codes from VBA scripting.


----------



## TheOutcaste (Aug 8, 2007)

This will filter out all files that don't have *.html* in their name. This doesn't check the extension specifically, so it would check a file named *test.html.txt*.

I don't know why it's not doing the replacements with your files. I'm guessing it dependent on the encoding the files are saved with. This is separate from the encoding declared in the html code in the file.
The script works for me with the script saved as *ANSI* or *UTF-8*, but I have to re-enter the characters, as they change when the encoding is changed. For example, *ã* becomes *Ã£*.
I've tested with the html file saved as *ANSI*, *UTF-8*, and *UTF-8 without BOM*

```
Set objFSO = CreateObject("Scripting.FileSystemObject")
Const ForReading = 1
Const ForWriting = 2
objStartFolder = "C:\Temp Dir\test\Test Source"

Set objFolder = objFSO.GetFolder(objStartFolder)
Wscript.Echo "Current Folder - " & objFolder.Path
Set colFiles = objFolder.Files
For Each objFile in colFiles
  If InStr(1, objFile.Name, ".html", vbTextCompare) > 0 Then
    Wscript.Echo "Processing " & objFile.Name
    Set objCurrentFile = objFSO.OpenTextFile(objFile.Path, ForReading)
    strText = objCurrentFile.ReadAll
    objCurrentFile.Close
    strText = Replace(strText, "à", [COLOR=Red]"& #257;"[/COLOR])
    strText = Replace(strText, "ã", [COLOR=Red]"& #299;"[/COLOR])
    Set objCurrentFile = objFSO.OpenTextFile(objFile.Path, ForWriting,True)
    objCurrentFile.Write strText
    objCurrentFile.Close
  End If
Next

DoSubfolders objFSO.GetFolder(objStartFolder)

Sub DoSubfolders(Folder)
  For Each Subfolder in Folder.SubFolders
    Wscript.Echo "Current Folder - " & Subfolder.Path
    Set objFolder = objFSO.GetFolder(Subfolder.Path)
    Set colFiles = objFolder.Files
    For Each objFile in colFiles
      If InStr(1, objFile.Name, ".html", vbTextCompare) > 0 Then
        Wscript.Echo "Processing " & objFile.Path
        Set objCurrentFile = objFSO.OpenTextFile(objFile.Path, ForReading)
        strText = objCurrentFile.ReadAll
        objCurrentFile.Close
        strText = Replace(strText, "à", [COLOR=Red]"& #257;"[/COLOR])
        strText = Replace(strText, "ã", [COLOR=Red]"& #299;"[/COLOR])
        Set objCurrentFile = objFSO.OpenTextFile(objFile.Path, ForWriting,True)
        objCurrentFile.Write strText
        objCurrentFile.Close
      End If
    Next
    DoSubfolders Subfolder
  Next
End Sub
```


----------



## Anandajoti (Sep 26, 2010)

Now the coding is great and only picks up the html files. The coding for my files is ANSI with the declaration as utf-8, but still it doesn't make any changes. 

It does read and write, I have one of the files open in Notepad++ and it has to be reloaded as it has changed on disk, but the characters have not changed, no matter whether I use chr(224) or à. It DOES work when (for testing purposes) I change one of the characters to aa. 

Very frustrating, but I don't see where we go from here, cause it's working for you and not for me!


----------



## TheOutcaste (Aug 8, 2007)

Are these html files online by any chance? You can PM me a link of you don't want to post the link in public.
Or if there is no personal/confidential info you can zip one up and attach it.


----------



## Anandajoti (Sep 26, 2010)

The files are online at the following address: http://www.ancient-buddhist-texts.net/Buddhist-Texts/BT-index.htm, but I think easier if I just attach a small subset here, as there are about 2,500 online.

The ones I send here are saved as ascii with utf-8 encoding without the BOM, but I have others with the BOM. I am unsure how that will effect everything.


----------



## Anandajoti (Sep 26, 2010)

I don't see the attachment in the thread, so I will try again.


----------



## TheOutcaste (Aug 8, 2007)

All of these htm files are encoded as plain ANSI according to Notepad++. I saved my script as ANSI, and it replaced the characters just fine. (To avoid the text being changed in the .vbs file when you change the encoding, first select all the text then copy, change encoding, then paste).

I downloaded the BT-index.htm page directly from the web site, opened it in Notepad++, and it shows it's encoded as ANSI

I'm wondering if your system default is different than mine. The script will open the files as ASCII. You may want to change that so it opens them with your system default encoding.
See the reference for the OpenTextFile Method

Might try adding the *TristateUseDefault* value to the open commands:
Set objCurrentFile = objFSO.OpenTextFile(objFile.Path, ForReading*,,-2*)
Set objCurrentFile = objFSO.OpenTextFile(objFile.Path, ForWriting,True*,-2*)

Also, save two copies of the script, one encoded as ANSI, one as UTF-8, see if the encoding of the script makes a difference.


----------



## Anandajoti (Sep 26, 2010)

Now on my system when I open BT-index in Notepad++ it shows the encoding as utf-8 without BOM. If I change it to ansi, the script will convert it. 

On the other hand I have now changed the script to utf-8 without BOM, and it converts both the utf-8 with and without BOM, which is what the files are encoded with (at least here).

So in a way that's great: but... the text when it is encoded with html encoding (&#xxx becomes unreadable and so basically unusable for me. Because a simple word like Ud&#257;nap&#257;&#7735;i becomes Ud&#xxx;nap&#xxx;&#xxx;i

This also increases the size of the files enormaously, as there are hundreds, sometimes thousands of these changes each page.

I don't understand why a line like this won't work, as this is the encoding that I use for various operations in VBA:

strText = Replace(strText, chr(252), chrW(257))

So that the files will all go into a readable state.


----------



## TheOutcaste (Aug 8, 2007)

It will make the source larger as you are replacing 1 character with 6, and will make the source hard to read, but viewing in a browser should be fine.

I not sure why the ChrW functions aren't working properly, but I suspect it's because the file is saved with ANSI encoding. Using WScript.Echo ChrW(257) displays the correct character when using WScript.exe, but when saved in the file, it's converted to a lowercase "a".

I don't know what else to try as far as directly editing the file with a script. You might want to look into using someting like AutoIt or AutoHotKey to automate using Notepad++ to do the editing, or see if using a clip program in Notetab, or a macro in some other word processor can do this.


----------



## Anandajoti (Sep 26, 2010)

That is interesting about the Echo working but the saving not. I wonder why that is? I use the chr and chrW replacemenst all the time in word and of course they work including in the very htl files we are looking at. Anyway I will try and pursue this a little further and see if I can get it working, as it is really the only option. 

I have already tried doing the same thing with autohotkey, which has inadequate unicode support, and notepad, which has none. I hear that the latter are in fact working on introducing unicode, but when that is due I am not sure.

One other thing which is probably easy to implement is this: I only want to write back to the file if something has changed, otherwise I will be unable to distinguish between files that need updating online and those that are to all intents and purposes the same.


----------



## Anandajoti (Sep 26, 2010)

Now I have managed to get it all up and working by putting the final pieces in place. Once the encoding was changed in the script file itself, it was able to do Unicode replacements on the htm files.

The thing I really learned from this is how important the encoding is and that it must match if we are going to have success.

Finally I wrote a little if statement to make sure only files that have changed are written to. I might say that although I have been coding in VBA for years I had never extended it to VB Script, so that has opened up a new world of possibilities.

I could not have got anywhere, of course, without this great forum and its generous participants who are willing to share their knowledge so freely.

A very big thank you indeed to TheOutcaste, who in my books anyway is really a true Brahmin (in Buddhism it doesn't mean someone of a particular caste, but a truly good man). Sadhu sadhu sadhu - well done!

Here is the script, it's functional, but it needs greatly extending:

Set objFSO = CreateObject("Scripting.FileSystemObject")
Const ForReading = 1
Const ForWriting = 2
objStartFolder = "D:\OutputFolder"

Set objFolder = objFSO.GetFolder(objStartFolder)
'Wscript.Echo "Current Folder - " & objFolder.Path
Set colFiles = objFolder.Files
For Each objFile in colFiles
If InStr(1, objFile.Name, ".htm", vbTextCompare) > 0 Then
' Wscript.Echo "Processing " & objFile.Name
Set objCurrentFile = objFSO.OpenTextFile(objFile.Path, ForReading,,-2)
strTextOriginal = objCurrentFile.ReadAll
strText = strTextOriginal
objCurrentFile.Close
strText = Replace(strText, "à", "ā")
strText = Replace(strText, "ã", "ī")
If strText <> strTextOriginal Then
Set objCurrentFile = objFSO.OpenTextFile(objFile.Path, ForWriting,True,-2)
objCurrentFile.Write strText
objCurrentFile.Close
Else
End If
End If
Next

DoSubfolders objFSO.GetFolder(objStartFolder)

Sub DoSubfolders(Folder)
For Each Subfolder in Folder.SubFolders
' Wscript.Echo "Current Folder - " & Subfolder.Path
Set objFolder = objFSO.GetFolder(Subfolder.Path)
Set colFiles = objFolder.Files
For Each objFile in colFiles
If InStr(1, objFile.Name, ".htm", vbTextCompare) > 0 Then
' Wscript.Echo "Processing " & objFile.Path
Set objCurrentFile = objFSO.OpenTextFile(objFile.Path, ForReading,,-2)
strTextOriginal = objCurrentFile.ReadAll
strText = strTextOriginal
objCurrentFile.Close
strText = Replace(strText, "à", "ā")
strText = Replace(strText, "ã", "ī")
If strText <> strTextOriginal Then
Set objCurrentFile = objFSO.OpenTextFile(objFile.Path, ForWriting,True,-2)
objCurrentFile.Write strText
objCurrentFile.Close
Else
End If
End If
Next
DoSubfolders Subfolder
Next
End Sub


----------



## PatrickMc (Jun 5, 2009)

Excellent vbscript.

This may be a possible alternative script.


```
# ReplaceChars.txt
var string dir, pattern, string1, string2
var string list, file, content
# Go to directory.
cd $dir
# Collect a list of all files matching pattern.
lf -r -n $pattern > $list
# Go thru files one by one.
while ($list <> "")
do
    # Get next file
    lex "1" $list > $file
    # Get contents of file into a string variable.
    cat $file > $content
    # Replace $string1 with $string2 - all occurrences.
    while ( { sen ("^"+$string1+"^") $content } > 0 )
        sal ("^"+$string1+"^") $string2 $content  > null
    # Write updated $content back to file.
    echo $content > { echo $file }
done
```
This script is biterscripting script. Save the script to file say C:/ReplaceChars.txt. Run it with this command.

```
script "C:/ReplaceChars.txt" dir("C:/test") pattern("*.html") string1("à") string2("ã")
```
This will replace from à to ã in all *.html files in folder C:/test and its subfolders. Please test first in a test directory.

I have used scripts like this a lot. It has handled unicode characters that I came across. I am not sure if it will handle all unicode characters though.


----------



## Anandajoti (Sep 26, 2010)

Hi PatrickMc, that's a very interesting script and I must confess I had never heard of biterscript before, it certainly seems like a language worth investigating.

The problem for me is that I will actually need to make around 100 searches, and I haven't even set up the vbs properly yet, owing to other commitments. 

I would think though that there would be a problem with the biterscript as to run the commands I would need a batch file, and I am told batch files can't handle Unicode.

Am I missing something here?


----------



## PatrickMc (Jun 5, 2009)

To run biterscript you only need biterscript interpreter, no other batch software or executables. Biterscript will handle unicode characters. Just test this command on your computer (in biterscripting).

sal "^à^" "ã" "àbcdef"

It will replace the first string with second string in third string (input string). It will show you the input string with the replaced portion.

If you need to do 100s of searched, put the list of them in C:/list.txt file, one per line, search string and replace string separated by tab. Then run this master script.


```
# Script ReplaceList.txt
cd dir, pattern
set $wsep = "\t"
var str list, line, from, to
cat "C:/list.txt" > $list
# Get the first pair of strings.
lex "1" $list > $line
# Process string pairs one by one.
while ($line <> "")
do
    # Get the from string and to string.
    wex -p "1" $line > $from
    wex -p "2" $line > $to
    # Call our replace script with $from and $to.
    script "C:/ReplaceChars.txt" dir($dir) pattern($pattern) string1($from) string2($to)
    # Get the next line.
    lex "1" $list > $line
done
```
Run this master script with


```
script "C:/ReplaceList.txt" dir("C:/test") pattern("*.html")
```
This will go thru each pair of strings in the C:/list.txt file and replace them in all .html files under directory C:/test.

For bulk replacing strings, this is pretty standard operating procedure. I don't think you even have to learn biterscriptng for simple things like this.

Good luck with your project.


----------

