OCR Resources: Difference between revisions
		
		
		
		
		
		Jump to navigation
		Jump to search
		
				
		
		
	
| No edit summary | |||
| (62 intermediate revisions by 6 users not shown) | |||
| Line 1: | Line 1: | ||
| [[Category:Workflow]] | |||
| [[Category:Protocol]] | |||
| [[Category:OCR]] | |||
| [[Category:aOCR]] | |||
| [[Category:Digitization]] | |||
| == OCR Software used by ADBC projects  == | |||
| *[http:// | *[http://finereader.abbyy.com/corporate/ ABBYY FineReader] - high performing proprietary OCR software provided by the [http://www.abbyy.com ABBYY] software company. The Professional and Corporate Editions are designed specifically for Microsoft Windows operating systems.   | ||
| **[[OCR Tips#FineReader_tips|FineReader tips]] | |||
| *[http:// | *[http://www.abbyy.com/recognition_server/functionality/?utm_expid=34274949-7&utm_referrer=http%3A%2F%2Fwww.abbyy.com%2Frecognition_server%2Fkey_features%2F ABBYY Recognition Server] - extends the features of FineReader and places them in a server-based scalable platform.   | ||
| **[[OCR Tips#Recognition_Server|Recognition Server tips]] | |||
| * | *[http://en.wikipedia.org/wiki/GOCR GOCR] (or JOCR) is a free optical character recognition program, initially written by Jörg Schulenburg. It can be used to convert or scan image files (portable pixmap or PCX) into text files. | ||
| *[http://en.wikipedia.org/wiki/Ocropus OCRopus] - free document analysis and optical character recognition (OCR) system released under the Apache License, Version 2.0 with a very modular design through the use of plugins. | |||
| *[http://en.wikipedia.org/wiki/Omnipage Omnipage] - high performing proprietary OCR software provided by the [http://www.nuance.com/for-business/by-product/omnipage/index.htm Omnipage software company]. The Professional and Standard Editions are designed specifically for Microsoft Windows operating systems.  | |||
| **[[OCR Tips#Omnipage_Features|Omnipage features]] | |||
| *[http://en.wikipedia.org/wiki/Tesseract_(software) Tesseract] - Open source optical character recognition engine available under the Apache License, Version 2.0. Software is capable to functioning on various operating systems. Considered to be one of the more accurate OCR engines that are available under a free software license. | |||
| **[http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseracticdar2007.pdf An Overview of the Tesseract OCR Engine] by Ray Smith at Google Inc.  | |||
| **[[OCR Tips#Tesseract_tips|Tesseract tips]] | |||
| *Xerox OCR engine - | |||
| *List of other OCR software: http://en.wikipedia.org/wiki/List_of_optical_character_recognition_software | *List of other OCR software: http://en.wikipedia.org/wiki/List_of_optical_character_recognition_software | ||
| Biodiversity Informatics Tools Incorporating OCR Technology | == Webinars / Demos == | ||
| *[https://www.idigbio.org/content/idigbio-webinar-visualize-your-text-data-using-ocr-output Calendar Announcement: Visualize Your Text Data Using OCR Output] Wednesday 10 AM EST 22 January 2014 | |||
| **Webinar Demo Report and Recording:[https://www.idigbio.org/content/idigbio-webinar-visualize-your-text-data-using-ocr-output Visualize Your Text Data Using OCR Output] | |||
| *[https://www.idigbio.org/content/demo-webinar-strategies-ocr-directed-workflow Calendar Announcement: Strategies for an OCR-directed workflow] Monday 11 AM EST 25 August 2014 | |||
| **Webinar Recording:[http://idigbio.adobeconnect.com/p9md9ekz0vq/ Strategies for an OCR directed workflow] | |||
| **[https://docs.google.com/document/d/1OSCZ6OQcK0Y5Htj4P2WkaFuZ4BUIJlsWGkZoTwfU1YE/edit# Group Notes for Webinar] | |||
| == Biodiversity Informatics Tools Incorporating OCR Technology  == | |||
| *[http://www.apiaryproject.org Apiary Project] - High-throughput workflow for computer-assisted human parsing of biological specimen label data | *[http://www.apiaryproject.org Apiary Project] - High-throughput workflow for computer-assisted human parsing of biological specimen label data | ||
| *'''HerbIS''' (Erudite Recorded Botanical Information Synthesizer) -  | *'''HerbIS''' - (Erudite Recorded Botanical Information Synthesizer) - Software algorithms that processes and presents herbarium label data in machine-understandable format through the use of natural language processing (NLP). Created at the Yale Peabody Museum of Natural History. | ||
| *[http://symbiota.org Symbiota] - Specimen-based virtual flora/fauna software with a built in module for specimen digitization that incorporates OCR technology | |||
| *[http://daryllafferty.com/salix SALIX] - Semi-automatic Label Information eXtraction system is designed to capture herbarium specimen label data with the use of optical character recognition technologies and transfer those data into a database. | |||
| * [http://www.sciotr.com ScioTR] - A new touch-enabled Windows 8 app which integrates Optical Character Recognition (OCR), Natural Language Parsing (NLP) and Machine Learning (ML) to provide an efficient workflow for capturing highly-structured data from images. ScioTR allows the user to parse or excavate the image for regions of interest using a touch screen interface. By doing this, OCR, NLP and ML strategies are more effective and thus require less human interaction later in the workflow. ScioTR works best when used in concert with a commercially available OCR engine. It has some NLP and ML modules inside of it. ScioTR also allows for the configuration of a custom field set. ScioTR was presented at the SPNHC DemoCamp in 2013. The powerpoint given at the conference as well as a video of the demo is available here: [http://www.sciotr.com ScioTR.com]. You can also find more technical information our software development blog, [http://sciochronicle.blogspot.com ScioChronicle] and/or the [http://www.youtube.com/user/ScioQualis ScioQualis YouTube Channel]. We hope to have ScioTR in the Windows 8 store around the end of Jan 2014. For a good outline of the current features and user experience, take a look at the [http://www.sciotr.com/Home/Help ScioTR Help Documentation], which is still currently being compiled. | |||
| == Coding Outcomes from the aOCR Hackathon (Feb 2013)   == | |||
| * HandwritingDetection (https://github.com/idigbio-aocr): an algorithm that separates images into sets with no handwriting, little handwriting (mostly text typed or printed), lots of handwriting, based on the noise generated by the OCR software. [http://manuscripttranscription.blogspot.com/2013/02/detecting-handwriting-in-ocr-text.html Read more at Ben's blog] | |||
| * Image OCR and result retrieval service (https://github.com/idigbio-aocr/RESTAPI): REST services to (a) accept a request for an OCR job, returning an identifier of the job, (b) process the image with available OCR engines (simulated), and (c) return OCR results. Read more at: [[REST API Documentation by Paul Schroeder]] | |||
| * LABELX (https://github.com/BryanHeidorn/LABELX): parses OCR output and classifies the data as user-defined Darwin Core fields. | |||
| * Results using LABELX (https://github.com/idigbio-aocr/label-data) which parses OCR results. | |||
| == Sample Images  == | |||
| *Bryophyte Images from LBCC project (10,500 image URLs)  | |||
| **[[Media:BryophyteOcrImageSamples.odt|Bryophyte OCR Image Data Samples]] | |||
| *Lichen Images from LBCC project (10,500 image URLs)  | |||
| **[[Media:LichensOcrImageSamples.odt|Lichen OCR Image Data Samples]] | |||
| *NYBG plant herbarium sheets  | |||
| *BRIT plant herbarium images | |||
| **[[Media:BRIT00104.jp2.jpg|Herbarium Sheet]] sample | |||
| ***[https://www.idigbio.org/sites/default/files/workshop-images/DebsPhotos/BRIT00104.jp2_.jpg.txt OCR output] from above Herbarium Sheet | |||
| *Insect images | |||
| == Museum Specimen Label Examples  == | |||
| * | *See sample [http://www.sdplantatlas.org/FAQ/SpecimenLabel.gif herbarium label and content defined] from the San Diego County Natural History Museum Plant Atlas Project FAQ.<br>  | ||
| *[[Media:00844803.jpg|Sample herbarium label]] from University of Colorado (COLO)  | |||
| *[[Media:NY01075764_lg.jpg|Sample bryophyte packet label]] from New York Botanical Garden (NYBG) | |||
| *[[Media:EMEC609485 Cerceris compacta.jpg|Entomology labels]] from an Essig Museum specimen of Cerceris compacta | |||
| == [https://www.idigbio.org/wiki/index.php/Augmenting_OCR Back to the aOCR Wiki] == | |||
Latest revision as of 13:33, 25 August 2014
OCR Software used by ADBC projects
- ABBYY FineReader - high performing proprietary OCR software provided by the ABBYY software company. The Professional and Corporate Editions are designed specifically for Microsoft Windows operating systems.
- ABBYY Recognition Server - extends the features of FineReader and places them in a server-based scalable platform.
- GOCR (or JOCR) is a free optical character recognition program, initially written by Jörg Schulenburg. It can be used to convert or scan image files (portable pixmap or PCX) into text files.
- OCRopus - free document analysis and optical character recognition (OCR) system released under the Apache License, Version 2.0 with a very modular design through the use of plugins.
- Omnipage - high performing proprietary OCR software provided by the Omnipage software company. The Professional and Standard Editions are designed specifically for Microsoft Windows operating systems.
- Tesseract - Open source optical character recognition engine available under the Apache License, Version 2.0. Software is capable to functioning on various operating systems. Considered to be one of the more accurate OCR engines that are available under a free software license.
- An Overview of the Tesseract OCR Engine by Ray Smith at Google Inc.
- Tesseract tips
 
- Xerox OCR engine -
- List of other OCR software: http://en.wikipedia.org/wiki/List_of_optical_character_recognition_software
Webinars / Demos
- Calendar Announcement: Visualize Your Text Data Using OCR Output Wednesday 10 AM EST 22 January 2014
- Webinar Demo Report and Recording:Visualize Your Text Data Using OCR Output
 
- Calendar Announcement: Strategies for an OCR-directed workflow Monday 11 AM EST 25 August 2014
- Webinar Recording:Strategies for an OCR directed workflow
- Group Notes for Webinar
 
Biodiversity Informatics Tools Incorporating OCR Technology
- Apiary Project - High-throughput workflow for computer-assisted human parsing of biological specimen label data
- HerbIS - (Erudite Recorded Botanical Information Synthesizer) - Software algorithms that processes and presents herbarium label data in machine-understandable format through the use of natural language processing (NLP). Created at the Yale Peabody Museum of Natural History.
- Symbiota - Specimen-based virtual flora/fauna software with a built in module for specimen digitization that incorporates OCR technology
- SALIX - Semi-automatic Label Information eXtraction system is designed to capture herbarium specimen label data with the use of optical character recognition technologies and transfer those data into a database.
- ScioTR - A new touch-enabled Windows 8 app which integrates Optical Character Recognition (OCR), Natural Language Parsing (NLP) and Machine Learning (ML) to provide an efficient workflow for capturing highly-structured data from images. ScioTR allows the user to parse or excavate the image for regions of interest using a touch screen interface. By doing this, OCR, NLP and ML strategies are more effective and thus require less human interaction later in the workflow. ScioTR works best when used in concert with a commercially available OCR engine. It has some NLP and ML modules inside of it. ScioTR also allows for the configuration of a custom field set. ScioTR was presented at the SPNHC DemoCamp in 2013. The powerpoint given at the conference as well as a video of the demo is available here: ScioTR.com. You can also find more technical information our software development blog, ScioChronicle and/or the ScioQualis YouTube Channel. We hope to have ScioTR in the Windows 8 store around the end of Jan 2014. For a good outline of the current features and user experience, take a look at the ScioTR Help Documentation, which is still currently being compiled.
Coding Outcomes from the aOCR Hackathon (Feb 2013)
- HandwritingDetection (https://github.com/idigbio-aocr): an algorithm that separates images into sets with no handwriting, little handwriting (mostly text typed or printed), lots of handwriting, based on the noise generated by the OCR software. Read more at Ben's blog
- Image OCR and result retrieval service (https://github.com/idigbio-aocr/RESTAPI): REST services to (a) accept a request for an OCR job, returning an identifier of the job, (b) process the image with available OCR engines (simulated), and (c) return OCR results. Read more at: REST API Documentation by Paul Schroeder
- LABELX (https://github.com/BryanHeidorn/LABELX): parses OCR output and classifies the data as user-defined Darwin Core fields.
- Results using LABELX (https://github.com/idigbio-aocr/label-data) which parses OCR results.
Sample Images
- Bryophyte Images from LBCC project (10,500 image URLs)
- Lichen Images from LBCC project (10,500 image URLs)
- NYBG plant herbarium sheets
- BRIT plant herbarium images
- Herbarium Sheet sample
- OCR output from above Herbarium Sheet
 
 
- Herbarium Sheet sample
- Insect images
Museum Specimen Label Examples
- See sample herbarium label and content defined from the San Diego County Natural History Museum Plant Atlas Project FAQ.
- Sample herbarium label from University of Colorado (COLO)
- Sample bryophyte packet label from New York Botanical Garden (NYBG)
- Entomology labels from an Essig Museum specimen of Cerceris compacta