IConference 2013 iDigBio AOCR WG Wiki: Difference between revisions
		
		
		
		
		
		Jump to navigation
		Jump to search
		
				
		
		
	
| (63 intermediate revisions by 2 users not shown) | |||
| Line 5: | Line 5: | ||
| *[[2013 AOCR Hackathon Wiki]] | *[[2013 AOCR Hackathon Wiki]] | ||
| *[[IDigBio_Augmenting_OCR_Workshop#Workshop_Presentations| AOCR October 2012 Working Group Meeting Presentations]] | *[[IDigBio_Augmenting_OCR_Workshop#Workshop_Presentations| AOCR October 2012 Working Group Meeting Presentations]] | ||
| === [https://www.dropbox.com/s/p8zjm0ajaj838tw/HackathonIconfAgendaFinal.docx Day-By-Day Schedule] === | |||
| *download to have active hyperlinks | |||
| * [https://docs.google.com/document/d/1dNbEOdL1stn-ztKDWTzfL1T4VPzg20ZZ76zV5ueFrnY/edit# Day-By-Day Online in GoogleDoc Format] | |||
| **NOTE: this version editable by participants. | |||
| == Links to Logistics, Communication, and Participant Information == | == Links to Logistics, Communication, and Participant Information == | ||
| Line 18: | Line 23: | ||
| *2013 Hackathon Listserv, a mailing list for Hackathon Participants at aocr-hackathon-l@lists.ufl.edu | *2013 Hackathon Listserv, a mailing list for Hackathon Participants at aocr-hackathon-l@lists.ufl.edu | ||
| == iConference 2013 Participation == | == iConference 2013 AOCR WG Participation == | ||
| ===Panel Workshop === | <br> | ||
| :::Integrated Digitized Biodiversity Collections, iDigBio, is an initiative funded under the National Science Foundation's (NSF) Advancing Digitization of Biological Collections (ADBC) program set up to help natural history museums get specimen data for hundreds of millions of specimens out of drawers, off of labels, out of field notebooks, out of old publications and into integrated databases for everyone's use. The iDigBio Augmenting OCR Working Group needs your wisdom, knowledge and collaboration as part of our multi-faceted approach to improve OCR strategies and natural language processing (NLP) algorithms used in digitization. Our workshop panelists, five members of our working group, are eager to introduce the iSchools community to our challenges and get your input in our break-out sessions. Our research areas of interest include: image segmentation, autocorrection of typographical errors, semantic autocorrection, autonormalization, automated text segmentation, generating consensus records and user interfaces for these tasks. We seek your insights, collective experiences and partnership in order to find ways to improve the digitization process to create a national searchable online specimen-based data set that is fit-for-use by scientists and the public. Some ideas generated in this session may be implemented at the iDigBio hackathon being held at the Botanical Research Institute of Texas (BRIT) during the iConference. | ===Panel & Workshop === | ||
| ==== Five  | :::;Paul, D., Heidorn, P. B., Best, J., Gilbert, E., Neill, A., Nelson, G., & Ulate, W. (2013). Help iDigBio reveal hidden data - iDigBio Augmenting OCR working group needs you. iConference 2013 Proceedings (pp. 1019-1021). doi 10.9776/13471 | ||
| :::Integrated Digitized Biodiversity Collections, iDigBio, is an initiative funded under the National Science Foundation's (NSF) Advancing Digitization of Biological Collections (ADBC) program set up to help natural history museums get specimen data for hundreds of millions of specimens out of drawers, off of labels, out of field notebooks, out of old publications and into integrated databases for everyone's use. The iDigBio Augmenting OCR Working Group needs your wisdom, knowledge and collaboration as part of our multi-faceted approach to improve OCR strategies and natural language processing (NLP) algorithms used in digitization. Our workshop panelists, five members of our working group, are eager to introduce the iSchools community to our challenges and get your input in our break-out sessions. Our research areas of interest include: image segmentation, autocorrection of typographical errors, semantic autocorrection, autonormalization, automated text segmentation, generating consensus records and user interfaces for these tasks. We seek your insights, collective experiences and partnership in order to find ways to improve the digitization process to create a national searchable online specimen-based data set that is fit-for-use by scientists and the public. Some ideas generated in this session may be implemented at the iDigBio hackathon being held at the Botanical Research Institute of Texas (BRIT) during the iConference. http://hdl.handle.net/2142/42502 | |||
| ==== iConference2013 Workshop Audio ==== | |||
| <!--You can replay the panel recording here: [https://www.idigbio.org/content/workshop-help-idigbio-reveal-hidden-data-idigbio-augmenting-ocr-working-group-needs-you Workshop -- Help iDigBio Reveal Hidden Data]--> | |||
| [http://idigbio.adobeconnect.com/p6463wkhtx1/ Unedited panel recording] | |||
| ====Take Notes!==== | |||
| Group notes for iConference2013 Workshop: http://tinyurl.com/iConf2013aocrws | |||
| ==== [[Five iConference2013 Talks]]==== | |||
| ::::::;Introducing iDigBio and the Augmenting OCR Working Group: Deborah Paul | ::::::;Introducing iDigBio and the Augmenting OCR Working Group: Deborah Paul | ||
| ::::::;Digitization of biocollections -- a grand challenge in scope, scale, and significance: Amanda Neill   | ::::::;Digitization of biocollections -- a grand challenge in scope, scale, and significance: Amanda Neill   | ||
| Line 28: | Line 43: | ||
| ::::::; HERBIS/LABELX -- Machine Learning Approach to Parsing OCR Text: Bryan Heidorn | ::::::; HERBIS/LABELX -- Machine Learning Approach to Parsing OCR Text: Bryan Heidorn | ||
| ::::::;Linking Data -- Biodiversity Heritage Library -- supporting knowledge discovery from digitized content: John Mignault | ::::::;Linking Data -- Biodiversity Heritage Library -- supporting knowledge discovery from digitized content: John Mignault | ||
| ==== [http://tinyurl.com/iconferencebiblio iConference2013 Papers, Poster and Presentations in the iDigBio Bibliography] ==== | |||
| <br> | |||
| =====Workshop Deliverables from Panel + Break Out Groups + Report Back Sessions.===== | |||
| *Summary Wiki Page | |||
| **[https://www.idigbio.org/wiki/index.php/Five_iConference2013_Talks talks] | |||
| **[http://tinyurl.com/AOCRhackNotes collected conversations iConference2013 and hackathon (Google Doc)] | |||
| **collaborations / contacts with our group or members of our group from iConference2013 | |||
| ***Andrea Thomer | |||
| ***Suzanne Westbrook | |||
| ***Kevin Crowston | |||
| **[http://www.facebook.com/media/set/?set=a.513727365337881.110781.215120891865198&type=3 iconference2013 photos on facebook] | |||
| **[http://www.facebook.com/media/set/?set=a.513738458670105.110787.215120891865198&type=3 hackathon photographs on facebook] | |||
| **summary report / white paper on the experience, methods, outcomes (in progress) | |||
| **hackathon results paper for publication (in progress) | |||
| ===Poster=== | ===Poster=== | ||
| ::: | :::;Anglin, R., Best, J., Figueiredo, R., Gilbert, E., Gnanasambandam, N., Gottschalk, S. Haston, E., Heidorn, P. B., Lafferty, D., Lang, P., Nelson, G., Paul, D. L., Ulate, W., Watson, K., Zhang, Q. (2013). Improving the Character of Optical Character Recognition (OCR) - iDigBio Augmenting OCR Working Group Seeks Collaborators and Strategies to Improve OCR Output and Parsing of OCR Output for Faster, More Efficient, Cheaper Natural History Collections Specimen Label Digitization. iConference 2013 Proceedings (pp.957-964). doi 10.9776/13493 | ||
| :::There are an estimated 2 – 3 billion museum specimens world – wide (OECD 1999, Ariño 2010). In an effort to increase the research value of their collections, institutions across the U. S. have been seeking new ways to cost effectively transcribe the label information associated with these specimen collections. Current digitization methods are still relatively slow, labor-intensive, and therefore expensive. New methods, such as optical character recognition (OCR), natural language processing, and human-in-the-loop assisted parsing are being explored to reduce these costs. The National Science Foundation (NSF), through the Advancing Digitization of Biodiversity Collections (ADBC) program, funded Integrated Digitized Biocollections (iDigBio) in 2011 to create a Home Uniting Biodiversity Collections (HUB) cyberinfrastructure to aggregate and collectively integrate specimen data and find ways to digitize specimen data faithfully and faster and disseminate the knowledge of how to achieve this. The iDigBio Augmenting OCR Working Group is part of this national effort. | :::There are an estimated 2 – 3 billion museum specimens world – wide (OECD 1999, Ariño 2010). In an effort to increase the research value of their collections, institutions across the U. S. have been seeking new ways to cost effectively transcribe the label information associated with these specimen collections. Current digitization methods are still relatively slow, labor-intensive, and therefore expensive. New methods, such as optical character recognition (OCR), natural language processing, and human-in-the-loop assisted parsing are being explored to reduce these costs. The National Science Foundation (NSF), through the Advancing Digitization of Biodiversity Collections (ADBC) program, funded Integrated Digitized Biocollections (iDigBio) in 2011 to create a Home Uniting Biodiversity Collections (HUB) cyberinfrastructure to aggregate and collectively integrate specimen data and find ways to digitize specimen data faithfully and faster and disseminate the knowledge of how to achieve this. The iDigBio Augmenting OCR Working Group is part of this national effort. http://hdl.handle.net/2142/42089 | ||
| ===Notes (short paper)=== | ===Notes (short paper)=== | ||
| :::;Augmenting  | :::;Paul, D., & Heidorn, P. B. (2013). Augmenting optical character recognition (OCR) for improved digitization - Strategies to access scientific data in natural history collections. iConference 2013 Proceedings (pp. 514-518). doi 10.9776/13266: Augmenting OCR Working Group (A-OCR WG) at Integrated Digitized Biocollections (iDigBio) seeks to improve community OCR strategies and algorithms for faster, better parsing of OCR output derived from valuable data on natural history collection specimen labels. This task is exceedingly difficult because museum labels are often annotated, and vary in content, form and font. Under the National Science Foundation's (NSF) Advancing Digitization of Biological Collections (ADBC) program, iDigBio is building a cyberinfrastructure to aggregate quality data from museum specimens housed in collections across the United States for use by researchers, educators, environmentalists and the public. Since March of 2012, the A-OCR WG formed from community consensus to begin its role in this endeavor, defining reachable goals including setting up a hackathon concurrent with iConference 2013. This paper reports on the definition of some key problems identified by the A-OCR WG since these science problems will drive research and cyberinfrastructure development. http://hdl.handle.net/2142/39427 | ||
| ===Alternative Event=== | ===Alternative Event=== | ||
| :::; Help iDigBio  | :::;Paul, D., Heidorn, P. B., Best, J., Gilbert, E., Neill, A., & Ulate, W. (2013). Help iDigBio reveal hidden data - iDigBio Augmenting OCR working group needs you - Part II. iConference 2013 Proceedings (pp. 1066-1068). doi 10.9776/13517: Twitter hash tag #CNFAE15. '''Session Abstract'''. Integrated Digitized Biocollections (iDigBio) is a nation-wide effort funded by the National Science Foundation (NSF) to digitize data from hundreds of millions of natural history museum specimens. In a concerted five-part outreach effort, the iDigBio Augmenting Optical Character Recognition Working Group (A-OCR WG) coordinated a 2013 iConference Workshop, Poster, Notes submission, Alternative Event and a concurrent Hackathon hosted by the Botanical Research Institute of Texas (BRIT). The Workshop titled, " Help iDigBio Reveal Hidden Data: iDigBio Augmenting OCR Working Group Needs You" introduces the iSchools community to iDigBio and the A-OCR WG mission and challenges to improve digitization efficiency. '''This related Alternative Event provides the A-OCR WG an opportunity to report back to iConference Workshop attendees about our first experience using a Hackathon model to work on parsing and user interface design issues specific to our needs.''' We anticipate a lively, open discussion with event attendees and future collaborators. http://hdl.handle.net/2142/42515 | ||
| : | |||
| == Overview of the related Hackathon Challenge == | == Overview of the related iDigBio - BRIT Hackathon Challenge == | ||
| *[[2013 AOCR Hackathon Wiki]] | *[[2013 AOCR Hackathon Wiki]] | ||
| *2013 iDigBio AOCR [[Hackathon Challenge]] | *2013 iDigBio AOCR [[Hackathon Challenge]] | ||
Latest revision as of 13:59, 29 May 2013
Welcome to the iConference 2013 iDigBio AOCR Wiki
- Short URL to this iConference 2013 wiki http://tinyurl.com/aocriConference2013
- Note: This wiki page undergoing frequent updates and some participants have wiki edit permissions and will add to / update / edit these pages before, during and after iConference 2013.
- AOCR Working Group Wiki
- 2013 AOCR Hackathon Wiki
- AOCR October 2012 Working Group Meeting Presentations
Day-By-Day Schedule
- download to have active hyperlinks
- Day-By-Day Online in GoogleDoc Format
- NOTE: this version editable by participants.
 
Links to Logistics, Communication, and Participant Information
- Participant List
- Participant Related Projects
- Travel, Food, Lodging, Connectivity Logistics
- 2013 Hackathon Listserv, a mailing list for Hackathon Participants at aocr-hackathon-l@lists.ufl.edu
iConference 2013 AOCR WG Participation
Panel & Workshop
- Paul, D., Heidorn, P. B., Best, J., Gilbert, E., Neill, A., Nelson, G., & Ulate, W. (2013). Help iDigBio reveal hidden data - iDigBio Augmenting OCR working group needs you. iConference 2013 Proceedings (pp. 1019-1021). doi 10.9776/13471
 
- Integrated Digitized Biodiversity Collections, iDigBio, is an initiative funded under the National Science Foundation's (NSF) Advancing Digitization of Biological Collections (ADBC) program set up to help natural history museums get specimen data for hundreds of millions of specimens out of drawers, off of labels, out of field notebooks, out of old publications and into integrated databases for everyone's use. The iDigBio Augmenting OCR Working Group needs your wisdom, knowledge and collaboration as part of our multi-faceted approach to improve OCR strategies and natural language processing (NLP) algorithms used in digitization. Our workshop panelists, five members of our working group, are eager to introduce the iSchools community to our challenges and get your input in our break-out sessions. Our research areas of interest include: image segmentation, autocorrection of typographical errors, semantic autocorrection, autonormalization, automated text segmentation, generating consensus records and user interfaces for these tasks. We seek your insights, collective experiences and partnership in order to find ways to improve the digitization process to create a national searchable online specimen-based data set that is fit-for-use by scientists and the public. Some ideas generated in this session may be implemented at the iDigBio hackathon being held at the Botanical Research Institute of Texas (BRIT) during the iConference. http://hdl.handle.net/2142/42502
 
 
iConference2013 Workshop Audio
Take Notes!
Group notes for iConference2013 Workshop: http://tinyurl.com/iConf2013aocrws
Five iConference2013 Talks
- Introducing iDigBio and the Augmenting OCR Working Group
- Deborah Paul
- Digitization of biocollections -- a grand challenge in scope, scale, and significance
- Amanda Neill
- The Apiary Project -- a workflow for text extraction and parsing for herbarium specimens
- Jason Best
- Symbiota -- Creating an OCR and NLP enabled user interface and workflow to efficiently digitize 2.3 million lichen and bryophyte specimens
- Edward Gilbert
- HERBIS/LABELX -- Machine Learning Approach to Parsing OCR Text
- Bryan Heidorn
- Linking Data -- Biodiversity Heritage Library -- supporting knowledge discovery from digitized content
- John Mignault
 
 
 
 
 
 
iConference2013 Papers, Poster and Presentations in the iDigBio Bibliography
Workshop Deliverables from Panel + Break Out Groups + Report Back Sessions.
- Summary Wiki Page
- talks
- collected conversations iConference2013 and hackathon (Google Doc)
- collaborations / contacts with our group or members of our group from iConference2013
- Andrea Thomer
- Suzanne Westbrook
- Kevin Crowston
 
- iconference2013 photos on facebook
- hackathon photographs on facebook
- summary report / white paper on the experience, methods, outcomes (in progress)
- hackathon results paper for publication (in progress)
 
Poster
- Anglin, R., Best, J., Figueiredo, R., Gilbert, E., Gnanasambandam, N., Gottschalk, S. Haston, E., Heidorn, P. B., Lafferty, D., Lang, P., Nelson, G., Paul, D. L., Ulate, W., Watson, K., Zhang, Q. (2013). Improving the Character of Optical Character Recognition (OCR) - iDigBio Augmenting OCR Working Group Seeks Collaborators and Strategies to Improve OCR Output and Parsing of OCR Output for Faster, More Efficient, Cheaper Natural History Collections Specimen Label Digitization. iConference 2013 Proceedings (pp.957-964). doi 10.9776/13493
 
- There are an estimated 2 – 3 billion museum specimens world – wide (OECD 1999, Ariño 2010). In an effort to increase the research value of their collections, institutions across the U. S. have been seeking new ways to cost effectively transcribe the label information associated with these specimen collections. Current digitization methods are still relatively slow, labor-intensive, and therefore expensive. New methods, such as optical character recognition (OCR), natural language processing, and human-in-the-loop assisted parsing are being explored to reduce these costs. The National Science Foundation (NSF), through the Advancing Digitization of Biodiversity Collections (ADBC) program, funded Integrated Digitized Biocollections (iDigBio) in 2011 to create a Home Uniting Biodiversity Collections (HUB) cyberinfrastructure to aggregate and collectively integrate specimen data and find ways to digitize specimen data faithfully and faster and disseminate the knowledge of how to achieve this. The iDigBio Augmenting OCR Working Group is part of this national effort. http://hdl.handle.net/2142/42089
 
 
Notes (short paper)
- Paul, D., & Heidorn, P. B. (2013). Augmenting optical character recognition (OCR) for improved digitization - Strategies to access scientific data in natural history collections. iConference 2013 Proceedings (pp. 514-518). doi 10.9776/13266
- Augmenting OCR Working Group (A-OCR WG) at Integrated Digitized Biocollections (iDigBio) seeks to improve community OCR strategies and algorithms for faster, better parsing of OCR output derived from valuable data on natural history collection specimen labels. This task is exceedingly difficult because museum labels are often annotated, and vary in content, form and font. Under the National Science Foundation's (NSF) Advancing Digitization of Biological Collections (ADBC) program, iDigBio is building a cyberinfrastructure to aggregate quality data from museum specimens housed in collections across the United States for use by researchers, educators, environmentalists and the public. Since March of 2012, the A-OCR WG formed from community consensus to begin its role in this endeavor, defining reachable goals including setting up a hackathon concurrent with iConference 2013. This paper reports on the definition of some key problems identified by the A-OCR WG since these science problems will drive research and cyberinfrastructure development. http://hdl.handle.net/2142/39427
 
 
 
Alternative Event
- Paul, D., Heidorn, P. B., Best, J., Gilbert, E., Neill, A., & Ulate, W. (2013). Help iDigBio reveal hidden data - iDigBio Augmenting OCR working group needs you - Part II. iConference 2013 Proceedings (pp. 1066-1068). doi 10.9776/13517
- Twitter hash tag #CNFAE15. Session Abstract. Integrated Digitized Biocollections (iDigBio) is a nation-wide effort funded by the National Science Foundation (NSF) to digitize data from hundreds of millions of natural history museum specimens. In a concerted five-part outreach effort, the iDigBio Augmenting Optical Character Recognition Working Group (A-OCR WG) coordinated a 2013 iConference Workshop, Poster, Notes submission, Alternative Event and a concurrent Hackathon hosted by the Botanical Research Institute of Texas (BRIT). The Workshop titled, " Help iDigBio Reveal Hidden Data: iDigBio Augmenting OCR Working Group Needs You" introduces the iSchools community to iDigBio and the A-OCR WG mission and challenges to improve digitization efficiency. This related Alternative Event provides the A-OCR WG an opportunity to report back to iConference Workshop attendees about our first experience using a Hackathon model to work on parsing and user interface design issues specific to our needs. We anticipate a lively, open discussion with event attendees and future collaborators. http://hdl.handle.net/2142/42515
 
 
 
- 2013 AOCR Hackathon Wiki
- 2013 iDigBio AOCR Hackathon Challenge
- overall description of The Challenge
- The Specific Task: parse OCR output to find values for these 2013 hackathon data elements
- Metrics and Evaluation to be used
-  Three Data Sets
- There are three data sets, that is, three different sets of images of museum specimen labels. Participants, working alone or in groups, may work on one or more data sets as they choose. The sets have been ranked, easy, medium, hard, as an estimate of how difficult it might be to successfully get good parsed data from the OCR output from each data set.
 
- Accessing the Data