# Crowling through PDF files



## nesr (Nov 5, 2008)

I have a bunch of PDF files, these files are downloaded research papers af a journal that I subscribed to.
the problem is that the files are downloaded with the similar name with counter ddd0002, ddd0003, .....ddd0125, etc.
I want to rename each file according to the year of publication, issue number, and page range
this information are usually in the first line of the pdf file

Can this be done programatically?
How?
note: VB is preferrable.

thanks everybody for help


----------



## draceplace (Jun 8, 2001)

Yes very doable!! below is vbScript that will rename all files in a directory based on the file name. What you want to do is open the file, parse the first line (build your filename), close, then rename. many of the elements are here except opening and reading the file. let me know if you need that...

Beware..there is no Yes/No option (easily added) once you start the script. I would make a test directory to develop in.....not screw up the orginal files. Also you mentioned some files might not have the info in first line...if there arent to many just fix them manually. You might test by writing new file names to txt file instead of renaming..
Note the 'file extension" must be part of the name
____________________________________________


strComputer = "."
Path = "C:\Ascripts\TestRename"

Set objWMIService = GetObject("winmgmts:\\" & strComputer & "\root\cimv2")

Set colFiles = objWMIService.ExecQuery _
("ASSOCIATORS OF {Win32_Directory.Name='"& Path &"'} Where " _
& "ResultClass = CIM_DataFile")

WScript.Echo "Rename With Agency:" & vbCrLf & Path

'If Msgbx = 6 Then

For Each objFile In colFiles
strAgency1 = objFile.FileName
strAG = (right(strAgency1, 3))
strProgram = (left(strAgency1, 6))
'rename agency 
Select Case(strAG)

Case "A01"
strAgency2= "ARVAC"
Case "A02"
strAgency2 = "BRAD"
Case "A03"
strAgency2= "CADC"
 Case "A04"
strAgency2 = "CAPCA"
End Select

'strExtension = Replace(strExtension, "txt", "")
strExtension = "txt"
IF strProgram = "HU0664" Then
strFname1 = " TEA_HU0664"
Else
strFname1 = " SSI_HU0665"
End IF 
strNewName = objFile.Drive & objFile.Path & strAgency2 & "_" & strFname1 & _
"_" & strAG & "." & strExtension
errResult = objFile.Rename(strNewName)
Next

WScript.Echo "Finished!"

WScript.quit()


----------



## nesr (Nov 5, 2008)

Thank you draceplace very much for your efforts:up:

I read the code you,kindly, wrote
you said it contains no Opening code for the file
I think this is a problem for me as I know that VB opens only txt files for editing and the pdf is encrypted 
I attached two sample pdf files (only the first page of each one to compact size) that were sample of more than 5000 file in the path C:\MyJournal\
thank you very much in advance...


----------



## draceplace (Jun 8, 2001)

nesr, I've never opened a pdf with a script. As you pointed out that it may be an adventure!! lf you open these with notepad you can see the file as 'text'. You will need to build a 'parser' to build a file name with your script.


----------



## draceplace (Jun 8, 2001)

The closest thing I saw to a 'file name' looks like this...

R/Title(Soil water balance trial involving capacitance and neutron probe measurements)>

That was from ddd0001, ddd145 doesn't have the "R/Title" string...I wonder how many will have the r/title??


----------



## nesr (Nov 5, 2008)

Thank you for your trials


> I wonder how many will have the r/title??


Actually I don't know but it may depends on the year of publication 
The new style appeared from 1999 I think...

I know that many crowlers like Google desktop and Copernic having the capability to search pdf files
Is the code of them available?

Thanks again


----------



## draceplace (Jun 8, 2001)

Didn't get to look at this again. I'll be out till mid next week, might have time to look in...
Here's the script I was working on ..This was going to open the files and look for string and write a file of results...NOT FINNISHED untested...probably needs to stop and close file if it finds title or some that indicates no title (200 lines??)

'---------------------------------------------------------------
'-- Dan McCracken 08/04/2008----------------------------------------
'--Open a folder pdfs and find the title writ to file . 
' ---------------------------------------------------------------
strComputer = "."
Set objWMIService = GetObject("winmgmts:" _
& "{impersonationLevel=impersonate}!\\" & strComputer & "\root\cimv2")

'set up for write file
Const ForAppending = 8
Const ForWriting = 2
Const ForReading = 1
Public strFileOwner 'get owner sub routine'
'xxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objTextFile = objFSO.OpenTextFile _
("C:\MyJournal\PDFnames_040809.txt", ForWriting, True)'Output file here<---'
'enumerate file props
objStartFolder = "C:\MyJournal" 'Folder name here<----
'objStartFolder = "C:\ascripts" 'Folder name here<-----

Set objFolder = objFSO.GetFolder(objStartFolder)
Wscript.Echo "Starting Here - " & objFolder.Path
'--------------Write the tab headers-must be insync with output---
objTextFile.WriteLine ("File_Name" & vbTab & "Path" & vbTab& "25 bytes R/Title")

Set colFiles = objFolder.Files
For Each objFile in colFiles

'for file owner
strFile = objFSO.GetAbsolutePathName(objFile)
GetFileNewName(strFile) '----------call subroutine'

'write the file info' 
objTextFile.WriteLine (objFile.Name & vbTab _
& objFSO.GetAbsolutePathName(objFile)& vbTab & strFile)
Next

objTextFile.Close
Wscript.Echo "Wow! It Finnished - " & now()

'---------------SubRoutines---------------------------------------
'-----------------Get the File Owner-------------------------------
Sub GetFileNewName(strFile)
If IsNull(strFile) or IsEmpty(strFile) Then 
strFileOwner = "Null Value"
Else
Set f = objFSO.OpenTextfile(strFile, ForWriting, True)
f.WriteLine "Ping Report" & " " & DateTime 
End If
'****************************************************************
'********Loop til end of input file******************************
Do While objTextFile.AtEndOfStream <> True
strLine = objtextFile.ReadLine
If inStr(strLine, "R/title") Then 
csvPCRecord = split(strLine, ",")
' Wscript.Echo vbCrLf & "PC Name: " & arrPCRecord(0)

Else
NotMonitored_cnt = NotMonitored_cnt + 1 
End If
End If
Loop
'****End of Job Routine ***********************************************
Next
End Sub


----------



## IMM (Feb 1, 2002)

I'd guess that you want to get a name like *Agricultural Water Management 96 (2009) 905-911.pdf* from ddd0001.pdf ?

I don't usu. do any vb (and I'm not good at it) -- but here's a possible method 
This is MUCH slower than writing in C (or similar) with zlib inflate functions - but if this is a one-off thing ...
Download the package _xpdf-3.02pl2-win32.zip_ from foolabs and extract the _pdftotext.exe_ file from it.
Place that exe file and the vbs file (the code box below) in the folder containing the pdf files (C:\myjournal).
When you run the vbs file - it should create a _renPDF.bat_ file for you - which you can then check with an editor to make sure it's what you want before running the batch file

The following file could probably use some tuning 

```
Option Explicit
Dim workingDir, outputBatchFileName, cmdstr, cmdln, PdfFileList, TitleLine
Dim PdfFolder, TestFile, fname, TestFileList, tname, tempfile, objBatFile ,RegEx
Dim tempbasename
Dim objFSO, dso, tmpFSO

Const ForAppending = 8
Const ForWriting = 2
Const ForReading = 1
Const TemporaryFolder = 2

'*********************
'********************
' Items you need to configure
workingDir = "C:\\MyJournal"
outputBatchFileName = "renPDF.bat"
cmdstr = "pdftotext.exe -f 1 -l 1  -eol dos -enc Latin1 "
'********************

Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objBatFile = objFSO.CreateTextFile (outputBatchFileName, ForWriting, False)
Set dso = CreateObject("Scripting.FileSystemObject")
Set PdfFolder = dso.GetFolder(workingDir)
Set TestFileList = PdfFolder.files
Set tmpFSO = CreateObject("Scripting.FileSystemObject")
Set regEx = New RegExp
regEx.Pattern = "\xB1|\xAD"

For Each TestFile In TestFileList
fname = dso.GetAbsolutePathName(TestFile)
If (UCase(dso.GetExtensionName(TestFile)) = "PDF") Then
	PdfFileList = PdfFileList & fname & VbCrLf
	tempbasename = dso.GetBaseName(TestFile)
	cmdln = cmdstr & fname
	Run("%ComSpec% /c  " & cmdln )
	Set tempfile = tmpFSO.OpenTextFile(tempbasename & ".txt", ForReading, True, 0)
	TitleLine = tempfile.ReadLine
	TitleLine = regEx.Replace(TitleLine, "-")
	objBatFile.WriteLine("copy /y " & chr(34) & fname & chr(34) & "  " _
	&  chr(34) & PdfFolder & chr(92) & TitleLine & ".pdf" & chr(34))
	tempfile.Close
	' pdftotext defaults to basename + .txt if no output file given on cmdln
	tmpFSO.DeleteFile(tempbasename & ".txt")
End If
Next
objBatFile.Close

'************ Fcns
Function Run (ByVal cmd)
Dim sh: Set sh = CreateObject("WScript.Shell")
Dim wsx: Set wsx = Sh.Exec(cmd)
If wsx.ProcessID = 0 And wsx.Status = 1 Then
	Err.Raise vbObjectError,,"WshShell.Exec failed."
End If
Do
	Dim Status: Status = wsx.Status
	WScript.StdOut.Write wsx.StdOut.ReadAll()
	'WScript.StdErr.Write wsx.StdErr.ReadAll()
	If Status <> 0 Then Exit Do
		WScript.Sleep 10
	Loop
	Run = wsx.ExitCode
End Function
```


----------



## nesr (Nov 5, 2008)

dear draceplace


draceplace said:


> Didn't get to look at this again. I'll be out till mid next week, might have time to look in...
> Here's the script I was working on ..This was going to open the files and look for string and write a file of results...NOT FINNISHED untested...probably needs to stop and close file if it finds title or some that indicates no title (200 lines??)


I understand that the code you submitted still under development, and will be completed next week, Thank you very much for your efforts,
till next week, have a nice time...

=====================================================

Dear IMM


IMM said:


> I'd guess that you want to get a name like *Agricultural Water Management 96 (2009) 905-911.pdf* from ddd0001.pdf ?


No only the last portion Like : *A2009905911.pdf*



> I don't usu. do any vb (and I'm not good at it) -- but here's a possible method
> Download the package _xpdf-3.02pl2-win32.zip_ from foolabs and extract the _pdftotext.exe_ file from it.
> Place that exe file and the vbs file (the code box below) in the folder containing the pdf files (C:\myjournal).
> When you run the vbs file - it should create a _renPDF.bat_ file for you - which you can then check with an editor to make sure it's what you want before running the batch file
> The following file could probably use some tuning


I will do so tomoro morning and return to youy the results
thank you very much for the nice code


----------



## IMM (Feb 1, 2002)

nesr said:


> No only the last portion Like : *A2009905911.pdf*


w/o seeing what comes out of the first one I posted, I'm guessing a bit -- you could always use something like the following vbs.
It takes the output bat file from the first one and produces another bat file 
There is a commented line in it if you want to rename rather than copy.
It kinda grew organically so the regex style isn't good. 

```
'///////  redo_bat.vbs
Const ForWriting = 2
Const ForReading = 1
'********************
inputName  = "renPDF.bat"
outputName = "fixedBat.bat"
LocString  = "C:\\MyJournal\\"
'********************
OrigFname = ""
NewFname = ""
Set outFSO = CreateObject("Scripting.FileSystemObject")
Set outFile = outFSO.CreateTextFile (outputName, ForWriting, False)
Set inFSO = CreateObject("Scripting.FileSystemObject")
Set inFile = inFSO.OpenTextFile (inputName, ForReading, False)

Qt = Chr(34)
Do While inFile.AtEndOfStream <> True
	inputLine = inFile.ReadLine
	' remove the copy instruction
	inputLine = RegReplaceIt("copy /y ", "", inputLine, True)
	'remove the paths
	inputLine = RegReplaceIt(LocString, "", inputLine, True)
	'strip out the quotes
	inputLine = RegReplaceIt(Qt, "", inputLine, True)
	'break into substrings in an array (space as delimiter)
	StrArray = Split(inputLine, " ", -1, 1)
	OrigFname = StrArray(0)
	CharPos = Instr(Len(OrigFname), inputLine, "(", 1)
	inputLine = Mid(inputLine, CharPos)
	NewFname = RegReplaceIt("\(|\)|-|\s", "", inputLine, True)
	outFile.WriteLine("copy /y  "  & OrigFname & "  " & "A" & NewFname)
	'outFile.WriteLine("ren  "  & OrigFname & "  " & "A" & NewFname)
Loop

inFile.Close
outFile.Close

'*********  Fcns
Function RegReplaceIt(patrn, replStr, inputStr, glbl)
Dim regEx, str1
Set regEx = New RegExp
regEx.Pattern = patrn
regEx.Global = glbl
regEx.IgnoreCase = True
RegReplaceIt = regEx.Replace(inputStr, replStr)
End Function
```


----------



## IMM (Feb 1, 2002)

Make sure you are backed up 
As an aside - the first one might give issues if there isn't a title line as the first line in all the pdfs which has a similar format to the one's which were posted as examples.
I've rather assumed that what is in the folder is 8.3 named pdf files without any spaces in the names.
There's likely to be issues if that isn't the case.


----------



## draceplace (Jun 8, 2001)

IMM, Your stuff very interesting and perhaps better that what I came up with...Does it find titles for all?.just working on the raw pdfs, the one provided a 'Title' that I could find..perhaps another project would be to manually add the desired filename to this file (for the ones the code doesn't figure out i.e non 8.3?) and rename them based on the file...

Anyway this Script will go thru the folder and write a file with the old file name and what it could be renamed to (I'll add the rename piece tomorrow) this will allow you to check what's going to happen before we do the rename...

'---------------------------------------------------------------
'-- Dan McCracken 08/04/2009----------------------------------------
'--Open a folder pdfs and find the title-> write to file . 
'--'
' ---------------------------------------------------------------
strComputer = "."
Set objWMIService = GetObject("winmgmts:" _
& "{impersonationLevel=impersonate}!\\" & strComputer & "\root\cimv2")

'xxxxxxxxxxxxx--set up for write file--xxxxxxxxxxxxxxxxxx
Const ForAppending = 8
Const ForWriting = 2
Const ForReading = 1
Public strFileOwner 'get owner sub routine'
'xxxxxxxxx--Opens output file---xxoxxxxxxxxxxxxxxxxxx'
Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objTextFile = objFSO.OpenTextFile _
("C:\MyJournal\PDFnames_040809.txt", ForWriting, True)'Output file here<---'
'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
'--we are going to open every file in this folder except the output file-----
objStartFolder = "C:\MyJournal" 'Folder name here<----

Set objFolder = objFSO.GetFolder(objStartFolder)
Wscript.Echo "Starting Here - " & objFolder.Path
'--------------Write the tab headers-must be insync with output---
objTextFile.WriteLine ("File_Name" & vbTab & "Path" & vbTab& "25 bytes R/Title")

'xxxxxxxxxxxx---Get the Name of the every File and loope---xxxxxxxxxxxx
Set colFiles = objFolder.Files
For Each objFile in colFiles

strFile = objFSO.GetAbsolutePathName(objFile)
'Wscript.Echo strFile
If IsNull(strFile) or IsEmpty(strFile) Then 
strFileOwner = "Null Value"
Else
If objFile.Name <> "PDFnames_040809.txt" Then
strTitle = GetFileNewName(strFile) '----------call Function'

'write the file info' 
objTextFile.WriteLine (objFile.Name & vbTab _
& objFSO.GetAbsolutePathName(objFile)& vbTab & strFile)
End If 
End If 
Next

objTextFile.Close
Wscript.Echo "Wow! It Finnished - " & now()

'---------------Function returns new name-----------------------------------
'-----------------Read the pdf-------------------------------
Function GetFileNewName(strFile)

Set f = objFSO.OpenTextfile(strFile, ForReading, True)
'****************************************************************
'********Loop til end of input file******************************
'*******If we don't find Title in 1st 500 line quit***********************
cntPdfLoop = 0
Do While f.AtEndOfStream <> True and cntPdfLoop < 500
strLine = f.ReadLine
'Wscript.Echo vbCrLf & "The string: " & strLine 
startNewName = inStr(strLine, "R/Title")
If startNewName > 0 Then 
strNewName = Mid(strLine,startNewname+8)
endName= inStr(strNewName,")")	
strNewName = Left(strNewName,endName-1)
strFile = strNewName & ".pdf"
'Wscript.Echo cntPdfLoop & vbCrLf & startNewName & vbCrLf & endName & vbCrLf & _
'" The string: " & strNewName 
f.close
Exit Function '---to be safe lets get out---'
End If
cntPdfLoop = cntPdfLoop + 1
Loop
f.Close
End Function


----------

