Transcription Hackathon: Difference between revisions
		
		
		
		
		
		Jump to navigation
		Jump to search
		
				
		
		
	
| Line 22: | Line 22: | ||
| == Presentations == | == Presentations == | ||
| *Yonggang Liu: [https://docs.google.com/presentation/d/1-R6r_kDnf6IyxSHg1J3M-oRy_wGngLFCwUJg_w4oBAU/edit?usp=sharing Image Ingestion at iDigBIo] | *Yonggang Liu, iDigBio: [https://docs.google.com/presentation/d/1-R6r_kDnf6IyxSHg1J3M-oRy_wGngLFCwUJg_w4oBAU/edit?usp=sharing Image Ingestion at iDigBIo] | ||
| *Yun Ling Yim, UC Berkeley: [https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/Calbug_idigbio_Jun.pdf Calbug | |||
| Digitization California Arthropod Collections] | |||
| == Development Resources  == | == Development Resources  == | ||
Revision as of 15:22, 16 December 2013
Notes from Nature/iDigBio Hackathon to Further Enable Public Participation in the Online Transcription of Biodiversity Specimen Labels 
December 16–20 at the University of Florida, Gainesville
Agenda and Logistics
- Hackathon Advertisement
- Tentative Agenda
- Logistics Document
- Participants List
- AdobeConnect room for planning prior to the hackathon, then for connection to the workshop remotely (Send an email to Austin Mast, if you'd like to use the room for planning prior to the hackathon.)
Coordination
- Interoperability Track
- OCR Integration Track
- QA/QC and Reconciliation of Replicates Track
- User Engagement Track
- Participants Interest in Tracks
Presentations
- Yonggang Liu, iDigBio: Image Ingestion at iDigBIo
- Yun Ling Yim, UC Berkeley: [https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/Calbug_idigbio_Jun.pdf Calbug
Digitization California Arthropod Collections]
Development Resources
- GitHub organization for this Transcription Hackathon
- 4 existing crowdsourcing datasets from Notes From Nature. Datasets contain transcriptions of different types of collections labels. Read more here. The datasets were shared only with the hackaton participants through dropbox once anonymized. It will be made public when we get a definitive approval from NfN.
- Calbug dataset
- Herbarium labels—The filenames with "USAM_" represent a nearly complete set of recent transcriptions from a collection (the University of South Alabama Herbarium), four replicates for most specimens (I think).
- Macrofungi labels
- Ornithological dataset
 
- Existing solution datasets to assess quality of crowdsourcing consensus (we are working to get "gold standard" data for some of these:
- Herbarium labels ideal response: link to be provided by Austin
- Entomology labels ideal response: link to be provided by Austin
- Field notebooks ideal response: link to be provided by Austin
 
- For those interested in experimenting with the images that have been used for public participation in transcription:
- Herbarium label images: the set of ca. 16,000 "USAM" images used for some of the herbarium transcriptions is available at USAM Herbarium Images. This is several GB worth of image files. To get them, you could use the DownloadThemAll Firefox plugin.
 
- Notes From Nature web interface:
- Code available at https://github.com/zooniverse/notesFromNature
- Forked version for the Hackathon available at: https://github.com/idigbio-citsci-hackathon/notesFromNature
- Vagrant script to build a VM with Notes From Nature web interface: link to be provided by Alex
- Install Vagrant from http://downloads.vagrantup.com/tags/v1.3.5 and virtualBox from https://www.virtualbox.org/wiki/Downloads
 
- CYWG iDigBio Image Ingestion Appliance:
- The appliance can be used to ingest the images to be used by the crowdsourcing service into the iDigBio storage, and made publicly accessible through HTTP. The relationship between the image filenames and the URL can be exported by the appliance in CSV format.
 
- Gold Images from aOCR Hackthon:
- Code from the aOCR Hackthon:
- HandwritingDetection (https://github.com/idigbio-aocr): an algorithm that separates images into sets with no handwriting, little handwriting (mostly text typed or printed), lots of handwriting, based on the noise generated by the OCR software. Read more at Ben's blog. This could be used to rank which images are in more need for human transcription.
- Dictionaries to improve crowdsourcing consensus (e.g., names of collectors, scientific names): link to be provided by aOCR?
- (Some botantists: RDF and tab-delimited.)
 
 
- Hi all - (Paul Flemons).
- I have uploaded a number of files:
- https://www.idigbio.org/wiki/index.php/File:OpenRefine_procedures_for_EVENTS_1212a.pdf - a desrciption of Open Refine procedures used for matching BVP fields to EMu EVENTS
- https://www.idigbio.org/wiki/index.php/File:Preparing_BVP_data_for_import_into_EMu_-_process_1212a.pdf Detailed process of preparing BVP data for EMu
- https://www.idigbio.org/wiki/index.php/File:Preparing_BVP_data_for_import_into_EMu_-_overview.pdf Overview of preparing BVP data for EMu
- https://www.idigbio.org/wiki/index.php/File:VisioDiagramofProcess.JPG Diagram of the process of preparing data from BVP for EMu
 
 
- I have uploaded a number of files:
- From Steve Raden: some background on Zooniverse's design