OCR Tips: Difference between revisions
Sgottschalk (talk | contribs) |
Sgottschalk (talk | contribs) |
||
Line 4: | Line 4: | ||
Fullsize, color images are about 10 MB in size and take about 2 minutes to process each | Fullsize, color images are about 10 MB in size and take about 2 minutes to process each | ||
Fullsize, grayscale images are about 1 MB in size and take about 1 minute to process each | Fullsize, grayscale images are about 1 MB in size and take about 1 minute to process each | ||
Revision as of 13:57, 2 October 2012
FineReader tips
SG: Consider image size/color:
Fullsize, color images are about 10 MB in size and take about 2 minutes to process each
Fullsize, grayscale images are about 1 MB in size and take about 1 minute to process each
Recognition Server
ABBYY Effective Practices & Hints for users of Recognition Server - what works best in our experiences.
EH: start with fewer languages selected since each language adds to the time taken (potential to sort specimens geographically prior to OCR). We are currently processing our specimens from SW Asia and the Middle East with a large number from Turkey so we currently run ABBYY with Turkish and English selected.
EH: we select high quality rather than speed
PL: OCR quality can be enhanced when a large image is cropped - which also reduces page count.
PL: Images can be ingested from a shared folder, or scan station, or ftp/ftps, or API. Hotfolder ingestion can be further controlled by including an optional XML ticket. XML tickets control workflow, output, and allow metadata to be ingested along with the image to be processed.
- what to look out for (with examples).
EH: we find that running the whole image increases the page count, each image ends up as 4-6 pages. Not necessarily a problem but good to be aware of if page count is an issue.
Tesseract Tips
Tesseract Effective Practices & Hints.
What works best:
Resolution: x-height (pixel height of lowercase letter) between 20-40 pixels is ideal
Switching to grayscale, increasing contrast, and other image treatments can improve output at times
What to look out for:
Resolution: an x-height below 8-12 pixels will produce very poor OCR return
Using a black background for package labels (e.g. lichens, bryophyte) will create a black border that can significantly reduce OCR return
Form labels can interfere with OCR output
Faded labels or images with poor lighting can be problematic
Old font can be problematic. However, it is possible to train Tesseract for new fonts
Misc notes:
Will recognize vertical text
Image input can be tif, jpeg, or gif