# Solved: Extract data from XML file



## JTS123 (Dec 4, 2012)

This relates to my earlier query, in which the XML files were generated (http://forums.techguy.org/dos-other/1079516-solved-batch-file-question-search.html)

What I thought I could accomplish in Excel with macros proved to be more trouble than I had anticipated. And judging by the expertise from the earlier post, I hope someone can help me out again (last time!).

The XML files created contain two very important pieces of information for me which I need to extract. They _should_ only contain _one_ instance each, but I may be mistaken. And some files may not have these items in at all (both or none, never one). These files can be skipped.

What I'm looking for in the XML file is <latitude>123</latitude> (very long number) and <longitude>456</longitude> (another very long number). Both are encompassed by the tags shown.

For the output, the standard results.txt file would be perfect. On a single line, it would need to contain FILENAME LATVALUE LONGVALUE, delimited however is convenient.

If you followed the earlier post, you know there are a lot of XML files to run through. Luckily, all of these XML files are in a single (writeable) directory. The batch file can be run in this directory, and the results.txt file can be created in this directory as well.

The lat / long values can be on any line in the file (so not always 12th & 13th down).

Summary:
Long list of XML files in a single folder
Write *FILENAME* <latitude>*LATVALUE*</latitude> <longitude>*LONGVALUE*</longitude> in *results.txt* (<> values not needed, just displaying where the values are hidden. Filename extension not needed, but if it's trouble to remove just leave it in)
Repeat for all XML files

Thanks again for reading and for the help. This is the final step in my process and won't have another followup question in another thread.


----------



## foxidrive (Oct 20, 2012)

This should work. It expects the latitude and longitude to exist in the same order in each file and that is the order it writes it out as.

I was even more lazy and so I process each file, even if they don't have the data. You can probably write an Excel macro to delete the lines that don't have 3 items in it.


```
@echo off
set results=results.csv
del "%results%" 2>nul

for %%a in (*.xml) do (
<nul >> "%results%" set /p "=%%~na,"
for /f "tokens=2 delims=><" %%b in ( ' type "%%a" ^|findstr /i "latitude longitude" ' ) do (
<nul >> "%results%" set /p "=%%b,"
)
echo.>> "%results%"
)
pause
```


----------



## foxidrive (Oct 20, 2012)

It wasn't so hard to be more precise. This echos the file,lat,lon in the correct order and should only output data where both items are present.

It does expect the <longitude>11.23</longitude> etc to be the first item in the line.


```
@echo off
set "results=results.csv"
setlocal enabledelayedexpansion
del "%results%" 2>nul

for %%a in (*.xml) do (
set "lat="
set "lon="
for /f "tokens=2 delims=>< " %%b in ( ' type "%%a" ^|findstr /i "<latitude>" ' ) do set lat=%%b
for /f "tokens=2 delims=>< " %%b in ( ' type "%%a" ^|findstr /i "<longitude>" ' ) do set lon=%%b
if defined lat if defined lon >> "%results%" echo %%~na,!lat!,!lon!
)
pause
```


----------



## JTS123 (Dec 4, 2012)

Thanks again for the quick reply. Sorry for the delay in mine.

I ran both batch files, and both seem to run and find lat / long, but not for all instances. I get a lot of errors when I run it:

```
FINDSTR: Line 682687 is too long.
```
The number changes depending on the line, and I can't seem to make sense of where it's coming from. It varies in length from a 4 digit to an 8 digit IIRC, and each error repeats multiple times before moving to the next error.

The .csv it spits out looks like this:

```
71135 25.7326 -80.2565
71192 latitude longitude
71225 latitude longitude
71390 51.04897 -115.061
```
I would assume that the middle two instances are errors (or it doesn't have a value). Of 1537 files, 1504 of them have that error. If the error is from lat/long not being present in the file, I can easily nix it myself with a filter in excel.

This is a piece of a file I'm scanning (red added). <Placemark> starts at line 95. Sometimes it starts at line 20... 1121... anywhere.

```
<Placemark>
  <name>1300633</name>
  <LookAt>
   <longitude>[COLOR=red]-122.4141660654396[/COLOR]</longitude>
   <latitude>[COLOR=red]37.77929741501927[/COLOR]</latitude>
   <altitude>0</altitude>
   <heading>2.428894193593244e-010</heading>
   <tilt>13.20532735504014</tilt>
   <range>141.016182805692</range>
   <gx:altitudeMode>relativeToSeaFloor</gx:altitudeMode>
  </LookAt>
  <styleUrl>#m_ylw-pushpin0</styleUrl>
  <Point>
   <coordinates>-122.4143354316815,37.77949755379234,0</coordinates>
  </Point>
 </Placemark>
```
The more decimal places I get in this, the better it would be in the end. If that's possible.


----------



## foxidrive (Oct 20, 2012)

The XML has long lines which go over Findstr line length limit. That may be a harmless error as long as those lines don't have the long/lat in them.

Check those files listed here and search for <longitude>
71192 latitude longitude
71225 latitude longitude
It seems that there is something else that has longitude in it. Or something else is before it on the line.
Let me know and provide a sample please.

The batch is designed to grab the entire lat/log and doesn't truncate the decimal places - those files must only have 4 decimal places listed in them.


----------



## JTS123 (Dec 4, 2012)

It's failing from the 3 tabs before <lat / long>. If I edit an xml file to nix the tabs, the batch file works and gives me the result without being truncated. I'm not sure which files are giving me the length errors though or I would try to run it on that again.

Search for <latitude> and <longitude> yields results for both. Single instance in each.

If I add tabs to the batch file, it still won't pick up the 3tab<lat>, but still works for the other xml file without tabs. Both are attached as .txt files - .xml wasn't valid on these forums. Just swap the extension to .xml and you'll be seeing what I see.

Is it possible to add 3 tabs in the batch file?


----------



## foxidrive (Oct 20, 2012)

In this delims section *in two places* 
delims=>< "

Add a tab character before the space. Ensure that your editor doesn't convert tabs to spaces.

Then try it again - the long lines may not have relevant info and the errors can be ignored I think.

Another option is to use another tool like GnuSED or Grep if the long lines do turn out to be an issue.


----------



## JTS123 (Dec 4, 2012)

Once again you have created a masterpiece, foxi. Thank you again for your help. It worked.

Final batch file to find filename + tags latitude and longitude in an XML file:

```
@echo off
set "results=results.csv"
setlocal enabledelayedexpansion
del "%results%" 2>nul
for %%a in (*.xml) do (
set "lat="
set "lon="
for /f "tokens=2 delims=><  " %%b in ( ' type "%%a" ^|findstr /i "<latitude>" ' ) do set lat=%%b
for /f "tokens=2 delims=><  " %%b in ( ' type "%%a" ^|findstr /i "<longitude>" ' ) do set lon=%%b
if defined lat if defined lon >> "%results%" echo %%~na,!lat!,!lon!
)
pause
```
Lots of errors were coming out due to length, but out of ~2000 files, I have ~1500 results. Which sounds about right, knowing that not all files had lat/long in them.

One more question though, just for my own knowledge. Is this code going through each individual line of the XML file, searching for lat / long?


----------



## foxidrive (Oct 20, 2012)

JTS123 said:


> Thank you again for your help. It worked.
> 
> Lots of errors were coming out due to length, but out of ~2000 files, I have ~1500 results. Which sounds about right, knowing that not all files had lat/long in them.
> 
> One more question though, just for my own knowledge. Is this code going through each individual line of the XML file, searching for lat / long?


You're welcome.

The two lines with findstr use that tool and search the entire file for the search terms, yes.

JFTR I don't think it's possible to post TAB characters in the forum so your post above doesn't have them either. 
The TABs get converted into spaces.


----------

