Transcription Hackathon: Difference between revisions
Jump to navigation
Jump to search
Austinmast (talk | contribs) No edit summary |
|||
Line 19: | Line 19: | ||
== Development Resources == | == Development Resources == | ||
* Existing crowdsourcing datasets from Notes From Nature: datasets with transcriptions of different types of collections labels: | |||
** Herbarium labels: link to be provided | |||
** Entomology labels: link to be provided | |||
** Field notebooks: link to be provided | |||
* Existing solution datasets to assess quality of crowdsourcing consensus: | |||
** Herbarium labels ideal response: link to be provided | |||
** Entomology labels ideal response: link to be provided | |||
** Field notebooks ideal response: link to be provided | |||
* [[CYWG iDigBio Image Ingestion Appliance]]: | |||
** The appliance can be used to ingest the images to be used by the crowdsourcing service into the iDigBio storage, and made publicly accessible through HTTP. The relationship between the image filenames and the URL can be exported by the appliance in CSV format. | |||
* Code from the aOCR Hackthon: | |||
** HandwritingDetection (https://github.com/idigbio-aocr): an algorithm that separates images into sets with no handwriting, little handwriting (mostly text typed or printed), lots of handwriting, based on the noise generated by the OCR software. [http://manuscripttranscription.blogspot.com/2013/02/detecting-handwriting-in-ocr-text.html Read more at Ben's blog]. This could be used to rank which images are in more need for human transcription. |
Revision as of 19:30, 4 December 2013
Notes from Nature/iDigBio Hackathon to Further Enable Public Participation in the Online Transcription of Biodiversity Specimen Labels
December 16–20 at the University of Florida, Gainesville
Agenda and Logistics
Coordination
- Interoperability Track
- OCR Integration Track
- Reconciliation of Replicates Track
- User Engagement Track
Development Resources
- Existing crowdsourcing datasets from Notes From Nature: datasets with transcriptions of different types of collections labels:
- Herbarium labels: link to be provided
- Entomology labels: link to be provided
- Field notebooks: link to be provided
- Existing solution datasets to assess quality of crowdsourcing consensus:
- Herbarium labels ideal response: link to be provided
- Entomology labels ideal response: link to be provided
- Field notebooks ideal response: link to be provided
- CYWG iDigBio Image Ingestion Appliance:
- The appliance can be used to ingest the images to be used by the crowdsourcing service into the iDigBio storage, and made publicly accessible through HTTP. The relationship between the image filenames and the URL can be exported by the appliance in CSV format.
- Code from the aOCR Hackthon:
- HandwritingDetection (https://github.com/idigbio-aocr): an algorithm that separates images into sets with no handwriting, little handwriting (mostly text typed or printed), lots of handwriting, based on the noise generated by the OCR software. Read more at Ben's blog. This could be used to rank which images are in more need for human transcription.