OCR Resources: Difference between revisions
Jump to navigation
Jump to search
No edit summary |
|||
Line 15: | Line 15: | ||
*[http://en.wikipedia.org/wiki/Tesseract_(software) Tesseract] - Open source optical character recognition engine available under the Apache License, Version 2.0. Software is capable to functioning on various operating systems. Considered to be one of the more accurate OCR engines that are available under a free software license. | *[http://en.wikipedia.org/wiki/Tesseract_(software) Tesseract] - Open source optical character recognition engine available under the Apache License, Version 2.0. Software is capable to functioning on various operating systems. Considered to be one of the more accurate OCR engines that are available under a free software license. | ||
**[http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseracticdar2007.pdf An Overview of the Tesseract OCR Engine] by Ray Smith at Google Inc. | **[http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseracticdar2007.pdf An Overview of the Tesseract OCR Engine] by Ray Smith at Google Inc. | ||
**[[OCR Tips#Tesseract_tips|Tesseract tips]] | **[[OCR Tips#Tesseract_tips|Tesseract tips]] | ||
Line 32: | Line 31: | ||
*[http://daryllafferty.com/salix SALIX] - Semi-automatic Label Information eXtraction system is designed to capture herbarium specimen label data with the use of optical character recognition technologies and transfer those data into a database. | *[http://daryllafferty.com/salix SALIX] - Semi-automatic Label Information eXtraction system is designed to capture herbarium specimen label data with the use of optical character recognition technologies and transfer those data into a database. | ||
== Coding Outcomes from the aOCR Hackathon (Feb 2013) == | |||
* HandwritingDetection ([https://github.com/idigbio-aocr]): an algorithm that separates images into sets with no handwriting, little handwriting (mostly text typed or printed), lots of handwriting, based on the noise generated by the OCR software. | |||
== Sample Images == | == Sample Images == |
Revision as of 12:07, 4 December 2013
OCR Software used by ADBC projects
- ABBYY FineReader - high performing proprietary OCR software provided by the ABBYY software company. The Professional and Corporate Editions are designed specifically for Microsoft Windows operating systems.
- ABBYY Recognition Server - extends the features of FineReader and places them in a server-based scalable platform.
- GOCR (or JOCR) is a free optical character recognition program, initially written by Jörg Schulenburg. It can be used to convert or scan image files (portable pixmap or PCX) into text files.
- OCRopus - free document analysis and optical character recognition (OCR) system released under the Apache License, Version 2.0 with a very modular design through the use of plugins.
- Omnipage - high performing proprietary OCR software provided by the Omnipage software company. The Professional and Standard Editions are designed specifically for Microsoft Windows operating systems.
- Tesseract - Open source optical character recognition engine available under the Apache License, Version 2.0. Software is capable to functioning on various operating systems. Considered to be one of the more accurate OCR engines that are available under a free software license.
- An Overview of the Tesseract OCR Engine by Ray Smith at Google Inc.
- Tesseract tips
- Xerox OCR engine -
- List of other OCR software: http://en.wikipedia.org/wiki/List_of_optical_character_recognition_software
Biodiversity Informatics Tools Incorporating OCR Technology
- Apiary Project - High-throughput workflow for computer-assisted human parsing of biological specimen label data
- HerbIS - (Erudite Recorded Botanical Information Synthesizer) - Software algorithms that processes and presents herbarium label data in machine-understandable format through the use of natural language processing (NLP). Created at the Yale Peabody Museum of Natural History.
- Symbiota - Specimen-based virtual flora/fauna software with a built in module for specimen digitization that incorporates OCR technology
- SALIX - Semi-automatic Label Information eXtraction system is designed to capture herbarium specimen label data with the use of optical character recognition technologies and transfer those data into a database.
Coding Outcomes from the aOCR Hackathon (Feb 2013)
- HandwritingDetection ([1]): an algorithm that separates images into sets with no handwriting, little handwriting (mostly text typed or printed), lots of handwriting, based on the noise generated by the OCR software.
Sample Images
- Bryophye Images from LBCC project (10,500 image URLs)
- Lichen Images from LBCC project (10,500 image URLs)
- NYBG plant herbarium sheets
- BRIT plant herbarium images
- Herbarium Sheet sample
- OCR output from above Herbarium Sheet
- Herbarium Sheet sample
- Insect images
Museum Specimen Label Examples
- See sample herbarium label and content defined from the San Diego County Natural History Museum Plant Atlas Project FAQ.
- Sample herbarium label from University of Colorado (COLO)
- Sample bryophyte packet label from New York Botanical Garden (NYBG)
- Entomology labels from an Essig Museum specimen of Cerceris compacta