Transcription Hackathon: Difference between revisions
		
		
		
		
		
		Jump to navigation
		Jump to search
		
				
		
		
	
| No edit summary | |||
| (14 intermediate revisions by 5 users not shown) | |||
| Line 1: | Line 1: | ||
| [[Category:Transcription Hackathon]] | [[Category:Transcription Hackathon]][[Category:Workshop]] | ||
| '''Notes from Nature/iDigBio Hackathon to Further Enable Public Participation in the Online Transcription of Biodiversity Specimen Labels'''   | '''Notes from Nature/iDigBio Hackathon to Further Enable Public Participation in the Online Transcription of Biodiversity Specimen Labels'''   | ||
| December 16–20 at the University of Florida, Gainesville   | December 16–20 at the University of Florida, Gainesville   | ||
| {| class="wikitable" style="float:right;" | |||
| ! colspan="2" style="background:#D58B28;width:200px;font-size:10pt" | Digitizing the Past and Present for the Future  | |||
| |- | |||
| | colspan="2" style="text-align:center;font-size:7pt" | <!--YOU CAN INSERT A NEW IMAGE FOR THE LOGO BETWEEN THE COLON AND THE PIPE-->[[Image:IDigBio Logo RGB.png|center|300px|iDigBio Logo RGB.png]]<br /> | |||
| |- | |||
| !colspan="2" style="background:#D58B28;text-align:center;font-size:9pt" | Quick Links for Transcription Hackathon Workshop | |||
| |-  | |||
| |[https://docs.google.com/document/d/1TyluwM1rMcq7O_nidy8CLJFMW4FrOPjsHkrLVho5cVU/edit?usp=sharing Transcription Hackathon Workshop Agenda]  | |||
| |-  | |||
| |[https://www.idigbio.org/biblio?f%5bkeyword%5d=274 Transcription Hackathon Workshop Biblio Entries] | |||
| |-  | |||
| |[https://www.idigbio.org/content/citscribe-hackathon Transcription Hackathon Workshop Report] | |||
| |} | |||
| == Agenda and Logistics  == | == Agenda and Logistics  == | ||
| *[https://www.idigbio.org/content/hackathon-enable-public-participation-online-transcription-biodiversity-specimen-labels Hackathon Advertisement]   | *[https://www.idigbio.org/content/hackathon-enable-public-participation-online-transcription-biodiversity-specimen-labels Hackathon Advertisement]   | ||
| *[ | *[[Transcription Hackathon Draft Agenda| Agenda]] | ||
| *[ | *[[Media:IDigBio_Public_Participation_in_Digitization_Workshop_Logistics_4Dec13.pdf|Logistics Document]] | ||
| *[ | *[[Media:Transcription_Hackathon_Participant_List_23Dec13.pdf|Participants List]] | ||
| *[http://idigbio.adobeconnect.com/citscribe AdobeConnect room for collaboration after the hackathon, then for connection to the workshop remotely] (Send an email to Austin Mast, if you'd like to use the room for additional collaboration after the hackathon.) | *[http://idigbio.adobeconnect.com/citscribe AdobeConnect room for collaboration after the hackathon, then for connection to the workshop remotely] (Send an email to Austin Mast, if you'd like to use the room for additional collaboration after the hackathon.) | ||
| Line 16: | Line 30: | ||
| *[https://www.facebook.com/media/set/?set=a.645283388848944.1073741833.215120891865198&type=1 Citscribe Hackathon Facebook Album] | *[https://www.facebook.com/media/set/?set=a.645283388848944.1073741833.215120891865198&type=1 Citscribe Hackathon Facebook Album] | ||
| *Twitter stuff: @iDigBio @NfromN hashtag #CITScribe | *Twitter stuff: @iDigBio @NfromN hashtag #CITScribe | ||
| ==Report== | |||
| *[https://www.idigbio.org/content/citscribe-hackathon Citscribe Hackathon Report] | |||
| == Coordination  == | == Coordination  == | ||
| *[ | *[[Transcription Hackathon Interoperability Planning| Interoperability Track]] | ||
| *[ | *[[Transcription Hackathon OCR Integration Planning| OCR Integration Track]] | ||
| *[ | *[[Transcription Hackathon Reconciliation of Replicates Planning| QA/QC and Reconciliation of Replicates Track]] | ||
| *[ | *[[Transcription Hackathon User Engagement Planning| User Engagement Track]] | ||
| *[https://docs.google.com/document/d/1ns_10ZMBRMOZX1DzfRBdALjhKtr_x8yYaAZLJH6YHyI/edit?usp=sharing Participants Interest in Tracks] | *[https://docs.google.com/document/d/1ns_10ZMBRMOZX1DzfRBdALjhKtr_x8yYaAZLJH6YHyI/edit?usp=sharing Participants Interest in Tracks] | ||
| Line 37: | Line 54: | ||
| *Joshua Campbell, iDigBio: [https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/CampbelliDigBioCrowdsourcingHackathon2013.pdf Herbarium Labels Transcription Crowdsourcing Consensus] | *Joshua Campbell, iDigBio: [https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/CampbelliDigBioCrowdsourcingHackathon2013.pdf Herbarium Labels Transcription Crowdsourcing Consensus] | ||
| *Yonggang Liu, ACIS iDigBio: [https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/Yonggang_image_ingestion_appliance.pdf iDigBio Image Ingestion Appliance] | *Yonggang Liu, ACIS iDigBio: [https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/Yonggang_image_ingestion_appliance.pdf iDigBio Image Ingestion Appliance] | ||
| *Paul  | *Paul Kimberly, Smithsonian: [https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/SI_Center.pdf Smithsonian Transcription Center] | ||
| *William Ulate, Missouri Botanical Garden: [[Media:Purposeful_Gaming_BHL_Dec_2013.pdf|Purposeful Gaming and BHL]] | |||
| == Development Resources  == | == Development Resources  == | ||
| * [https://github.com/idigbio-citsci-hackathon GitHub organization for this Transcription Hackathon] | * [https://github.com/idigbio-citsci-hackathon GitHub organization for this Transcription Hackathon] | ||
| * 4 existing crowdsourcing datasets from Notes From Nature. Datasets contain transcriptions of different types of collections labels. Read more [https://docs.google.com/document/d/1UCz5WblnNIvqBErX-XeWgS9mf69qFhycHqntQOGnPp4/edit?usp=sharing here]. The datasets were shared only with the  | * 4 existing crowdsourcing datasets from Notes From Nature. Datasets contain transcriptions of different types of collections labels. Read more [https://docs.google.com/document/d/1UCz5WblnNIvqBErX-XeWgS9mf69qFhycHqntQOGnPp4/edit?usp=sharing here]. The datasets were shared only with the hackathon participants through dropbox once anonymized. It will be made public when we get a definitive approval from NfN. | ||
| ** Calbug dataset | ** Calbug dataset | ||
| ** Herbarium labels—The filenames with "USAM_" represent a nearly complete set of recent transcriptions from a collection (the University of South Alabama Herbarium), four replicates for most specimens (I think). | ** Herbarium labels—The filenames with "USAM_" represent a nearly complete set of recent transcriptions from a collection (the University of South Alabama Herbarium), four replicates for most specimens (I think). | ||
| Line 65: | Line 83: | ||
| * Gold Images from aOCR Hackthon: | * Gold Images from aOCR Hackthon: | ||
| ** CSV file with URLs for the Images on iDigBio beta server (Uploaded by Image Ingestion Appliance): [http://www.acis.ufl.edu/~yonggang/idigbio/recordset/gold/ent.csv ent], [http://www.acis.ufl.edu/~yonggang/idigbio/recordset/gold/herb.csv herb],[http://www.acis.ufl.edu/~yonggang/idigbio/recordset/gold/lichens.csv lichens]. | ** CSV file with URLs for the Images on iDigBio beta server (Uploaded by Image Ingestion Appliance): [http://www.acis.ufl.edu/~yonggang/idigbio/recordset/gold/ent.csv ent], [http://www.acis.ufl.edu/~yonggang/idigbio/recordset/gold/herb.csv herb],[http://www.acis.ufl.edu/~yonggang/idigbio/recordset/gold/lichens.csv lichens]. | ||
| * Code from the aOCR Hackthon: | * Code from the aOCR Hackthon: | ||
| ** HandwritingDetection (https://github.com/idigbio-aocr): an algorithm that separates images into sets with no handwriting, little handwriting (mostly text typed or printed), lots of handwriting, based on the noise generated by the OCR software. [http://manuscripttranscription.blogspot.com/2013/02/detecting-handwriting-in-ocr-text.html Read more at Ben's blog]. This could be used to rank which images are in more need for human transcription. | ** HandwritingDetection (https://github.com/idigbio-aocr): an algorithm that separates images into sets with no handwriting, little handwriting (mostly text typed or printed), lots of handwriting, based on the noise generated by the OCR software. [http://manuscripttranscription.blogspot.com/2013/02/detecting-handwriting-in-ocr-text.html Read more at Ben's blog]. This could be used to rank which images are in more need for human transcription. | ||
| Line 73: | Line 90: | ||
| * Hi all - (Paul Flemons). | * Hi all - (Paul Flemons). | ||
| **I have uploaded a number of files: | **I have uploaded a number of files: | ||
| ***[ | ***[[Media:OpenRefine_procedures_for_EVENTS_1212a.pdf|a description of Open Refine procedures used for matching BVP fields to EMu EVENTS]] | ||
| ***[ | ***[[Media:Preparing_BVP_data_for_import_into_EMu_-_process_1212a.pdf|Detailed process of preparing BVP data for EMu]] | ||
| ***[ | ***[[Media:Preparing_BVP_data_for_import_into_EMu_-_overview.pdf|Overview of preparing BVP data for EMu]] | ||
| ***[ | ***[[Media:VisioDiagramofProcess.JPG|Diagram of the process of preparing data from BVP for EMu]] | ||
| *From Steve Raden: some background on Zooniverse's design | *From Steve Raden: some background on Zooniverse's design | ||
Latest revision as of 15:35, 3 February 2015
Notes from Nature/iDigBio Hackathon to Further Enable Public Participation in the Online Transcription of Biodiversity Specimen Labels 
December 16–20 at the University of Florida, Gainesville
| Digitizing the Past and Present for the Future | |
|---|---|
| Quick Links for Transcription Hackathon Workshop | |
| Transcription Hackathon Workshop Agenda | |
| Transcription Hackathon Workshop Biblio Entries | |
| Transcription Hackathon Workshop Report | |
Agenda and Logistics
- Hackathon Advertisement
- Agenda
- Logistics Document
- Participants List
- AdobeConnect room for collaboration after the hackathon, then for connection to the workshop remotely (Send an email to Austin Mast, if you'd like to use the room for additional collaboration after the hackathon.)
Media
- Citscribe Hackathon Facebook Album
- Twitter stuff: @iDigBio @NfromN hashtag #CITScribe
Report
Coordination
- Interoperability Track
- OCR Integration Track
- QA/QC and Reconciliation of Replicates Track
- User Engagement Track
- Participants Interest in Tracks
Presentations
- Yonggang Liu, iDigBio: Image Ingestion at iDigBIo
- Austin Mast, iDigBio: Public Participation
- Yun Ling Yim, UC Berkeley: Calbug Digitization, CalBug California Arthropod Collections
- Miao Chen, Indiana U.: Using OCR
- Cody Meche, UF: Agile Scrum
- Julie Allen, INHS: Gamification
- Edward Gilbert, Symbiota Developer: Symbiota: a specimen-based biodiversity portal platform
- Deborah Paul, iDigBio Augmenting OCR WG: What's new in using OCR output in a Citizen Science Workflow
- Andrea Matsunaga, iDigBio: Herbarium Labels Transcription Crowdsourcing & OCR
- Joshua Campbell, iDigBio: Herbarium Labels Transcription Crowdsourcing Consensus
- Yonggang Liu, ACIS iDigBio: iDigBio Image Ingestion Appliance
- Paul Kimberly, Smithsonian: Smithsonian Transcription Center
- William Ulate, Missouri Botanical Garden: Purposeful Gaming and BHL
Development Resources
- GitHub organization for this Transcription Hackathon
- 4 existing crowdsourcing datasets from Notes From Nature. Datasets contain transcriptions of different types of collections labels. Read more here. The datasets were shared only with the hackathon participants through dropbox once anonymized. It will be made public when we get a definitive approval from NfN.
- Calbug dataset
- Herbarium labels—The filenames with "USAM_" represent a nearly complete set of recent transcriptions from a collection (the University of South Alabama Herbarium), four replicates for most specimens (I think).
- Macrofungi labels
- Ornithological dataset
 
- For those interested in experimenting with the images that have been used for public participation in transcription:
- Herbarium label images: the set of ca. 16,000 "USAM" images used for some of the herbarium transcriptions is available at USAM Herbarium Images. This is several GB worth of image files. To get them, you could use the DownloadThemAll Firefox plugin.
 
- Notes From Nature web interface:
- Code available at https://github.com/zooniverse/notesFromNature
- Forked version for the Hackathon available at: https://github.com/idigbio-citsci-hackathon/notesFromNature
- Install Vagrant from http://downloads.vagrantup.com/tags/v1.3.5 and virtualBox from https://www.virtualbox.org/wiki/Downloads
- Vagrant script to build a VM with Notes From Nature web interface: https://github.com/idigbio-citsci-hackathon/nfn-vagrant
- Go to the location of the vagrant script and type "vagrant up" in your command prompt to build a VM with Note from Nature running on localhost:9294.
- API Calls
 
- CYWG iDigBio Image Ingestion Appliance:
- The appliance can be used to ingest the images to be used by the crowdsourcing service into the iDigBio storage, and made publicly accessible through HTTP. The relationship between the image filenames and the URL can be exported by the appliance in CSV format.
 
- Gold Images from aOCR Hackthon:
- Code from the aOCR Hackthon:
- HandwritingDetection (https://github.com/idigbio-aocr): an algorithm that separates images into sets with no handwriting, little handwriting (mostly text typed or printed), lots of handwriting, based on the noise generated by the OCR software. Read more at Ben's blog. This could be used to rank which images are in more need for human transcription.
- Dictionaries to improve crowdsourcing consensus (e.g., names of collectors, scientific names): link to be provided by aOCR?
- (Some botantists: RDF and tab-delimited.)
 
 
- Hi all - (Paul Flemons).
- From Steve Raden: some background on Zooniverse's design
Hackathon Products
- Brainstorming Documents from the Thursday Mix Ups
- Group 1 Mix Up Discussion Summary (google doc)
- Group 2
- Group 3 MixUp google doc
- Group 4 Mix Up Discussion Summary (google doc)
 
- Some groups used the Coordination pages above to summarize products
- Group 1 Target File Format
