Transcription Hackathon: Difference between revisions

Revision as of 18:51, 1 January 2014

Notes from Nature/iDigBio Hackathon to Further Enable Public Participation in the Online Transcription of Biodiversity Specimen Labels

December 16–20 at the University of Florida, Gainesville

[[Transcription Hackathon Interoperability Planning| Interoperability Track]
[[Transcription Hackathon OCR Integration Planning| OCR Integration Track]
[[Transcription Hackathon Reconciliation of Replicates Planning| QA/QC and Reconciliation of Replicates Track]
[[Transcription Hackathon User Engagement Planning| User Engagement Track]
Participants Interest in Tracks

GitHub organization for this Transcription Hackathon
4 existing crowdsourcing datasets from Notes From Nature. Datasets contain transcriptions of different types of collections labels. Read more here. The datasets were shared only with the hackaton participants through dropbox once anonymized. It will be made public when we get a definitive approval from NfN.
- Calbug dataset
- Herbarium labels—The filenames with "USAM_" represent a nearly complete set of recent transcriptions from a collection (the University of South Alabama Herbarium), four replicates for most specimens (I think).
- Macrofungi labels
- Ornithological dataset

For those interested in experimenting with the images that have been used for public participation in transcription:
- Herbarium label images: the set of ca. 16,000 "USAM" images used for some of the herbarium transcriptions is available at USAM Herbarium Images. This is several GB worth of image files. To get them, you could use the DownloadThemAll Firefox plugin.

CYWG iDigBio Image Ingestion Appliance:
- The appliance can be used to ingest the images to be used by the crowdsourcing service into the iDigBio storage, and made publicly accessible through HTTP. The relationship between the image filenames and the URL can be exported by the appliance in CSV format.

Gold Images from aOCR Hackthon:
- CSV file with URLs for the Images on iDigBio beta server (Uploaded by Image Ingestion Appliance): ent, herb,lichens.

Code from the aOCR Hackthon:
- HandwritingDetection (https://github.com/idigbio-aocr): an algorithm that separates images into sets with no handwriting, little handwriting (mostly text typed or printed), lots of handwriting, based on the noise generated by the OCR software. Read more at Ben's blog. This could be used to rank which images are in more need for human transcription.
- Dictionaries to improve crowdsourcing consensus (e.g., names of collectors, scientific names): link to be provided by aOCR?
  - (Some botantists: RDF and tab-delimited.)

Brainstorming Documents from the Thursday Mix Ups
- Group 1 Mix Up Discussion Summary (google doc)
- Group 2
- Group 3 MixUp google doc
- Group 4 Mix Up Discussion Summary (google doc)

@@ Line 8: / Line 8: @@
 *[https://www.idigbio.org/content/hackathon-enable-public-participation-online-transcription-biodiversity-specimen-labels Hackathon Advertisement]
-*[https://www.idigbio.org/wiki/index.php/Transcription_Hackathon_Draft Agenda| Agenda]]
+*[[Transcription Hackathon Draft Agenda| Agenda]]
 *[https://www.idigbio.org/wiki/images/8/8a/IDigBio_Public_Participation_in_Digitization_Workshop_Logistics_4Dec13.pdf Logistics Document]
 *[https://www.idigbio.org/wiki/images/8/83/Transcription_Hackathon_Participant_List_23Dec13.pdf Participants List]