# Solved: Text in jpeg to text in document. How to?



## aarhus2004 (Jan 10, 2004)

Hello,

I created a document. Then I used a screenshot freeware to convert the doc to a jpeg.

I 'lost' the original doc but have the jpeg.

Can I convert the jpeg image back to a text document? If so how?

Thanks.

Ben.

P.S. I hate typing!


----------



## ~Candy~ (Jan 27, 2001)

Ouch, I highly doubt it....sounds like you need to get your typing fingers ready.


----------



## stantley (May 22, 2005)

You can do it with Optical Character Recognition software. If you don't have any download a free trial of of On Screen OCR http://www.screenocr.com/


----------



## aarhus2004 (Jan 10, 2004)

AcaCandy said:


> Ouch, I highly doubt it....sounds like you need to get your typing fingers ready.


Hmmm, AcaCandy, have you no sympathy for an old guy with nobbly knuckles, next-to-no eyesight and grumpy to boot? And I hate cats - so there.



stantley said:


> You can do it with Optical Character Recognition software. If you don't have any download a free trial of of On Screen OCR http://www.screenocr.com/


OCR sounds right up my alley:up: now I just need to know whether they will allow a senior a serious discount when the trial is over!! Thanks, stantley.

Ben.


----------



## stantley (May 22, 2005)

You're welcome, let me know how it works out for you.


----------



## ChuckE (Aug 30, 2004)

I don't hold much hope on your success with OCR. OCR will work poorly on a JPG file, since the images of the letters of a screen capture will probably not have much detail captured per letter. OCR typically likes around a few hundred dots (pixels) per character to really get a good idea of what each character is. I will bet that a screen capture of what your document was has a LOT less than that.

Also the JPG edges will probably be more fuzzy, and thus less machine recognizable. A human can discern fuzzy characters and actually interpret meaningful context, as you often see on those password & enter the "scrambled messed characters" seen in the box websites. The type of OCR we have, currently, is not yet to that level. (Humans are still good at something!)

Spend the time and keystroke in your page again, or find a fast keyboardist and ask them to do it for you.


----------



## ~Candy~ (Jan 27, 2001)

Hope it works, I've never tried OCR on a jpeg 

Always text documents.............

And you be careful, my cat appears in the middle of the night and does things to computers


----------



## stantley (May 22, 2005)

It worked OK for me on most JPG files I tried, although sometimes not perfectly. Results varied with the size and type of font.


----------



## aarhus2004 (Jan 10, 2004)

ChuckE said:


> I don't hold much hope on your success with OCR. OCR will work poorly on a JPG file, since the images of the letters of a screen capture will probably not have much detail captured per letter. OCR typically likes around a few hundred dots (pixels) per character to really get a good idea of what each character is. I will bet that a screen capture of what your document was has a LOT less than that.
> 
> Also the JPG edges will probably be more fuzzy, and thus less machine recognizable. A human can discern fuzzy characters and actually interpret meaningful context, as you often see on those password & enter the "scrambled messed characters" seen in the box websites. The type of OCR we have, currently, is not yet to that level. (Humans are still good at something!)
> 
> Spend the time and keystroke in your page again, or find a fast keyboardist and ask them to do it for you.


*ChuckE*,

You are right. (Or else it's that damned cat).

Thanks for the wisdom.

Ben.


----------



## ~Candy~ (Jan 27, 2001)

Yeah, yeah, blame it on the poor helpless cat with the AK-47


----------



## Noyb (May 25, 2005)

Here's another OCR that's somewhat affordable ... 
http://www.abbyy.com/scantooffice/
It'll convert to a M$ Word document and has a trial.

Candy .... How about sounds with an Avatar ???


----------



## lister (Aug 10, 2004)

AcaCandy said:


> Hope it works, I've never tried OCR on a jpeg


Isn't that what OCR is for? Creating editable text from scanned (raster) images?

Though lower res than a scanned image, I would imagine a screenshot of a document to be far cleaner than a scanned image.


----------



## Noyb (May 25, 2005)

lister said:


> Though lower res than a scanned image, I would imagine a screenshot of a document to be far cleaner than a scanned image.


All depends on the saved format and amount of compression (if Applicable)  But youre probably right.

Aarhus2004 . I have the professional ABBYY OCR version and can edit the image in Photoshop (if needed)
If you want me to see what I can do  Email me the Image.
Or attach it here if you want.


----------



## aarhus2004 (Jan 10, 2004)

Noyb said:


> All depends on the saved format and amount of compression (if Applicable) … But you're probably right.
> 
> Aarhus2004 …. I have the professional ABBYY OCR version and can edit the image in Photoshop (if needed)
> If you want me to see what I can do … Email me the Image.
> Or attach it here if you want.


God, I loved that sound - even if it scared the hell out of me!

If the pro stuff would work I might invest...

I attach the first paragraph for you to try it for me. Thanks,* Noyb*.

Ben.:up:

BTW I can only handle .doc or .txt formats.


----------



## ChuckE (Aug 30, 2004)

Here's where I think OCR is going to have trouble, and why.
1) The JPG file, if you look at it magnified, has all kinds of artifacts around the letters. That is just what happens with JPG, it does not give clean edges around sharply defined text. A better format to use would be to use GIF, which you still will take advantage of a good space/file compression, and it will have no artifacts. There is a 256 color limit to GIF, but it would not matter for your purposes here. You probably do not have the luxury of having a GIF capture of your, now long gone, document. And making a GIF file of your existing JPG will not work either since the artifacts are already there. You'd end up only with a GIF file of the JPG artifacts.

You can eliminate the artifacts by simply reducing the color depth of the JPG file to 1-bit (2 colors).

2) A screen capture of a document has substantially fewer dots per character than a typical real scan of the actual paper document. Consider the fairly typical scan resolution or just 300 dpi (dots per inch), (and this is very low considering the much higher scan resolutions capable of modern scanners). If you have a printed font of size 10 points, you may see approximately 10 characters along one inch of the horizontal line (that is just a seat of the pants guess, since it really all depends upon the font and style and kerning). Now considering that there are 10 characters of width on a line, and that you are scanning at a (low end) resolution of 300 dpi, you will have approximately 30 dots or width per average character. And please consider that 300 dpi is low (600 or even 1200 would be much better and more detailed for OCR) and that a 10 point type is fairly small.

Most OCR, that I've played with, like to recommend a larger font to scan. Sure, they'll work with smaller, but that just means you really ought to scan at a higher dpi. The more clean character definition you have, the better of a chance your OCR package will have for accuracy. BUT, now look at the picture you have attached of your document. The number of dots you have are in the range of about 11 dots, that is just a third of the result you would have had via a real scanned hard-copy.

3) Look at the font you have in your attached image. The font is in italics, which does complicate the OCR deciphering, and also the font is something other than just a fairly plain, easily non-confusing, typical clean, font, like Arial or Univers.
The characters, for example the lowercase "s" has a stroke closing the loop in the bottom part of the "s"
Also the uppercase "M" is nearly all closed in. 
All these types of character oddities confuse the OCR process. The OCR process can "learn" the oddities, and eventually your accuracy can get higher than whatever your initial accuracy was (and I would expect a low accuracy to begin with), but all that takes time.

4) and time is my final point. In the time it takes for me to write all this, I could have re-typed in your text about 3 times already - and I am a SLOW typist! I just use essentially two fingers, one on each hand , and yet I could have done it. You probably could have done it too. Save your time, just keystroke it back in. Sure you hate it (at least I do). But the more you do, the better and faster you'll get.

I wish *Noyb* the best of luck, if he trys the OCR on your test image. And if he actually does get good results, in a reasonable amount of time, then I will most humbly beg his, and your, forgiveness  , but I don't think that is going to be needed.

I really do hope and wish you and him the best, though.


----------



## Noyb (May 25, 2005)

I had two major problems ..
1: I never got an Email from TSG saying that you had responded ... this seems to be happening lately.
2: TSG wont let me save the attachment(s) ... but luckily, I can have TSG email it to me.

The rest was easy ... just told ABBYY to read it .. and save it to a word format .... took about 15 seconds.
Attached is the output from ABBYY... All I've done is increase the font size, which came out 5 points and change the line spacing.

I see some spell checking is needed ... I left it the way it came out of ABBYY
*************************************************************
The Record-Maker from Down-Under.
Gosford, Mew South Wales, is today a thriving centre, about an hour's travel north of its more famous neighbour, Sydney. If its Council webpages can tell us anything it is that Gosford has it 'covered' and is working hard at 'making it better". There are 32 "sporting fields" in the city and the list of sports played on them includes baseball but not cricket. In fact the results on entering 'cricket' as a search-term made me question whether I was requesting spurious information; information about something which is played on every piece of grass everywhere and needs no mention. Hard to say, but the Council seems to be playing-down what the Nation is crazy about. It was here that Steven Jesperson was born.
***************************************************************

Did you hear the Brass falling on the Concrete ... I love that part.
Got anymore work for me ????

.


----------



## aarhus2004 (Jan 10, 2004)

Hello Noyb,

I was able to download your zip without problem and save as my jpeg too - wonder why you couldn't?

But I got the error message " Wordpad has caused an error in MSWRD832.CNV..." and the Properties of it tell me it is an Microsoft Office Word doc which I don't have - I didn't realise that .doc was both with MS Office Word and my Wordpad. So Wordpad can' tread it.

I see I can trial the software and don't recall the system requirements as including MS Office Word. I will have a go.

Thanks for doing it for me, Noyb, I am encouraged. The spelling is Brit.  

Ben.


----------



## Noyb (May 25, 2005)

OK ... Here's the text output (from the OCR converted RTF file) ...
from the more affordable ($49) ... http://www.abbyy.com/scantooffice/

Since you don't have M$ Word ... I used a RTF file conversion.
The line spacing is messed up a little in the RTF WordPad file (very small font) so I copied the text from Wordpad .. then pasted to a Text Document .. to fix the spacing and font size.
This "trick" removes some of the weird RTF formatting.

Had the same problem previously ... but WORD can fix it.

You might want to give the Trial D/L a try.

I don't have the problem with zips .... Just getting the "original" Image uploads.

************************************************************
The Record-Maker front Down-Under.
Gosford, New South Wales, is today a thriving centre, about an hour's travel north of its more famous neighbour, Sydney. If its Council webpages can tell us anything it is that Gosford has it 'covered' and is working hard at 'making it better: There are 32 "sporting fields" in the city and the list of sports played on them includes baseball but not cricket. In fact the results on entering 'cricket' as a search-term made me question whether I was requesting spurious information; information about something which is played on every piece of grass everywhere and needs no mention. Hard to say, but the Council seems to be playing-down what the Nation is crazy about. It was here that Steven Jesperson
**************************************************************


----------



## aarhus2004 (Jan 10, 2004)

That's fabulous, Noybe. Really appreciate your tackling it for me. It is everything I had hoped-for when I first posted.

And yes I will trial it and get to work on the whole document.:up: 

Ben.


----------



## Noyb (May 25, 2005)

I'm curious how you make out ... let me know, or holler if you have a problem

Jay


----------



## aarhus2004 (Jan 10, 2004)

Will do, Jay. Thanks again.


----------



## aarhus2004 (Jan 10, 2004)

*The Record-Maker from Down-Under.

Gosford, New South Wales, is today a thriving centre, about an hour's travel north of its more famous neighbour, Sydney. If its Council webpages can tell us anything it is that Gosford has it 'covered' and is working hard at 'making it better'. There are 32 "sporting fields" in the city and the list of sports played on them includes baseball but not cricket. In fact the results on entering 'cricket' as a search-term made me question whether I was requesting spurious information; information about something which is played on every piece of grass everywhere and needs no mention. Hard to say, but the Council seems to be playing-down what the Nation is crazy about. It was here that Steven Jesperson was born.*

This is a great software. I mean I am very tentative about using new stuff but it is so easy to use this. It first of all looked goofy (see gif) on Wordpad but a C & P to Notepad made it clear enough and then a C & P back to Wordpad and a change of fonts etc did the trick.

Well done, Jay. :up:


----------



## Noyb (May 25, 2005)

Amazing   Thats what I saw also ... which required the Trick to remove the weird line spacing formatting.
Maybe editing the Image could work around this  but didnt seem necessary for this application.
Abbyy had to do the upsizing  apparently was good enough.

Since you dont have M$ WORD  may I mention 
Theres a freeware alternative that Ive seen mentioned here on TSG, by a few Wizards..
http://www.openoffice.org/

I havent played with it, but it may give you the ability to work on an OCR conversion to WORD.
As I understand  Itll open a .doc file.
Even with your reluctance to load new stuff  I think itd be a safe experiment.

The OCR trial is limited to 15 conversions ... If you have more ... Email me a few.


----------



## Noyb (May 25, 2005)

Also ... if you have a lot of OCR conversions ... here's a FREE Text to Speech program.
http://www.readplease.com/

I've used it to help proof read the OCR conversions for errors that a spell checker wont catch....
It reads the text conversion to me ... while I check the original image.


----------



## aarhus2004 (Jan 10, 2004)

Jay,

This gets better and better. Not only was my original document editing (the whole thing) sloppy but I am seeing what it lacks. Folk are too polite to tell me - seems. And thanks for the offer - I'm hoping to have enough shots left to complete it. I lost the doc because I had to do a format/install and it wasn't in my back-ups. It appears on my website here:

http://www.geocities.com/cowichancricket/stevenlife

Once done I shall try the Word-type freebie. And let you know how I get on.

Cheers, Jay.

Ben.

Somebody to read back to me!!!? Beats getting married or whatever! I wonder if AcaCandy's cat is a talker? If so, he, she or it is probably one hell of a frustrated feline. (Sorry AcaC I can't help myself!)


----------



## Noyb (May 25, 2005)

Why didn't you say so ... earlier.
My pro version can batch process all 6 pages at once .. did I miss any ??
May I suggest ... Use gif formatted Images for this stuff.

Using M$ WORD makes the editing easier ... but that weird language drives the spell checker nuts.

Can't fast forward, rewind or Mute a Wife either.

Can you read this in WordPad ??


----------



## ChuckE (Aug 30, 2004)

I am impressed *Noyb*, and I do offer my apologies for doubting OCR's ability to read the captured screen images of Ben's document. I stand very much corrected, and I am so glad, too. _Abbyy_ looks like a fine product.


----------



## ChuckE (Aug 30, 2004)

In regards to the "goofy" or "weird" line spacing of your imported text into WordPad, all you really should need to do is select all the text and set it to a font size that WordPad now knows about. (Seems that the process you are using is circumventing WordPad's automatic line height adjustment to provide for the text on those lines.)

What appears to have happened is that the text that came into WordPad is something larger than 10pt, thus the larger font is being placed on some line that are still 'thought' (by WordPad) as being 10pt, and the tops of the lines are being cut off by the lines above. If you just force the font size to a larger font size first, (example 18pt) and then put it all back to your desired size (perhaps 10pt) that should force WordPad to properly readjust the line heights. You should not have to copy out the text to something else to overcome this issue.

Hint: In WordPad, you might try selecting some, or all, of your text, and press Ctrl+1, or Ctrl+2, or Ctrl+5 to change the line spacing (not really the line height, but it may do what you want) to single, double, or line-and-a-half line spacing, respectively.


----------



## aarhus2004 (Jan 10, 2004)

Noyb said:


> Why didn't you say so ... earlier.
> My pro version can batch process all 6 pages at once .. did I miss any ??
> May I suggest ... Use gif formatted Images for this stuff.
> 
> ...


Yes, yes I can, and edit it and so on. Anyway I plan on trying to substitute a gif of your copy for the 6 jpegs. A different font may help with the clarity.

I chose jpegs over gifs originally cos the results were sharper (on Geocities) and the sizes the same.

Glad to have the help, Jay. I am beginning to understand that the Puritans laid a different foundation for English in the US. A lot of the dichotomy can be explained by this. My reading of Sam Pepys' Diary (1660 - 1669) reveals it. In a sense English, USA style, is an older form than is the current mode (UK). Merriam Webster often confirms my spelling but as a "variant". I have reached the point (in Canada and in age) where I hardly know how to spell anything or, if I do, whether it's my native spelling or Canada/USA style.

When I dis-embarked from the first class section in CP Air in 1960, and in Vancouver as an immigrant, two impressions have lingered down the years. One, I cannot forget the Canadian who resented my presence in !st Class (it was all they had to offer since I was last onboard) and two, the 'untidiness' of the locals. I hesitate to write it  but I also found, and find still, that N.American ladies ruled and rule the roosts!

Cheers. Jay.

Ben.


----------



## Noyb (May 25, 2005)

aarhus2004 said:


> .... I chose jpegs over gifs originally cos the results were sharper....


It's my understanding (and limited experiance) that gifs are usually cleaner/smaller because the jpeg compression adds artifacts around sharp, high contrast, edges such as text .. and there's no need for huge range of colors and the extra file size needed to represent them.

I just noticed that some of your Images were jpeg ... others were gifs.
Don't worry ... Only my spell checker noticed 

Did you see that OCR has troubles with numbers ??


----------



## aarhus2004 (Jan 10, 2004)

Hello Jay,

Yes I did notice the number difficulty but what surprised me more was it mistaking an h for a b but only in some three letter words. But that's the cheaper edition ($50). Maybe one day they will duplicate the younger human eye. I haven't installed the Word freeware yet. I have to be fully alert to even contemplate such a thing and today has been hectic in severals ways. I am zonked. I sometimes think using WinME has caused my brain to mimic it because if I use too many resources I get slower and dafter by the minute.

I do ramble on - given the chance!

Cheers, Jay.

Ben.


----------



## ChuckE (Aug 30, 2004)

Noyb said:


> It's my understanding (and limited experiance) that gifs are usually cleaner/smaller because the jpeg compression adds artifacts around sharp, high contrast, edges such as text


Well, somewhat. GIFs are not always smaller, but they are nearly always cleaner. The reason, as you are correct, is that JPGs do usually add artifacts to what may be sharply defined, high contrast, edges. That is just the process that the creators of the JPG compression scheme used to more highly compress images, by approximating color transitions. In photographs, which usually have "tons" of color transitions and bits of spurious color spots, a few blurry edges does not matter much, except in either very HIGH compressions (which can be adjustable) or in high magnifications of the image. That is, to trade off some picture degradation for file size compactness, can usually be a fair trade.

If you want clean image edges, AND good file compression AND the 256 color limitation is not an issue, then GIFs are a highly recommended file format.

GIFs can not create image artifacts, what you see is what you get, within the color range. You must know, though, that if you convert an image which already contains artifacts (as a JPG will) into a GIF, only captures the artifacts also. It will not clean up an already artifact riddled image.

When I need to make absolutely as clean as I can make screen captures, as compact as I can, I always use the GIF format. There are some combinations of colors and color changes that a JPG output file will be smaller than a GIF of the same image, but a JPG will never be as sharp as the GIF of the same screen. The size difference, one way or the other, is not much anyway.


----------

