Some assistance/advice for OCRing?

hedge@beehaw.org · 1 year ago

Some assistance/advice for OCRing?

MasterBuilder@lemmy.one · 1 year ago

I use ocrmypdf, after being a bit frustrated with gscan2pdf. There is a simple ui available, but I just created a tiny script that does the ocr , deskew, etc. In one operation with wildcard file selection.

I also installed a jbig compressor that really shrinks images. My processed docs are generally 40% to 80% smaller, and it seems to get better tesseract output than gscan does.

donio@beehaw.org · edit-2 1 year ago

OCRmyPDF is what I use as well, had good luck with it on boardgame rulebooks that sometimes come with missing or partial embedded text. Combined with recoll and the Emacs pdf-tools mode I have it all indexed and at my fingertips.

hedge@beehaw.org · 11 months ago

@MasterBuilder@lemmy.one & @donio@beehaw.org, revisiting the subject, if I may: I’ve now run ocrmypdf through its paces and am pretty impressed with the results. One thing though, is that I would like to be able to edit the OCR text that it generates, usually to join hyphenated words, remove line breaks, preserve ¶ breaks, and correct the rare spelling error. Is there a way to do this? (I believe there is a way to do this on gImageReader, but I don’t think it will let you save the OCR’d text to the PDF!) Looking at ocrmypdf on github, it looks like there might be a way to do this, but darned if I can figure out how. I wasn’t able to find anything about this in the documentation either. I’d be much obliged for any suggestions you might have.