Transcription Hackathon: Difference between revisions

From iDigBio
Jump to navigation Jump to search
No edit summary
m (Reverted edits by Snomelf (talk) to last revision by Dpaul)
Line 1: Line 1:
<p><br />
[[Category:Transcription Hackathon]]
<b>Notes from Nature/iDigBio Hackathon to Further Enable Public Participation in the Online Transcription of Biodiversity Specimen Labels</b>
 
</p><p>December 16–20 at the University of Florida, Gainesville  
'''Notes from Nature/iDigBio Hackathon to Further Enable Public Participation in the Online Transcription of Biodiversity Specimen Labels'''
</p>
 
<h2> Agenda and Logistics  </h2>
December 16–20 at the University of Florida, Gainesville  
<ul><li><a href="https://www.idigbio.org/content/hackathon-enable-public-participation-online-transcription-biodiversity-specimen-labels">Hackathon Advertisement</a>
 
</li><li><a href="https://www.idigbio.org/wiki/index.php/Transcription_Hackathon_Draft_Agenda">Agenda</a>
== Agenda and Logistics  ==
</li><li><a href="https://www.idigbio.org/wiki/images/8/8a/IDigBio_Public_Participation_in_Digitization_Workshop_Logistics_4Dec13.pdf">Logistics Document</a>
 
</li><li><a href="https://www.idigbio.org/wiki/images/a/a2/Transcription_Hackathon_Participant_List_8Dec13.pdf">Participants List</a>
*[https://www.idigbio.org/content/hackathon-enable-public-participation-online-transcription-biodiversity-specimen-labels Hackathon Advertisement]
</li><li><a href="http://idigbio.adobeconnect.com/citscribe">AdobeConnect room for planning prior to the hackathon, then for connection to the workshop remotely</a> (Send an email to Austin Mast, if you'd like to use the room for planning prior to the hackathon.)
*[https://www.idigbio.org/wiki/index.php/Transcription_Hackathon_Draft_Agenda Agenda]
</li></ul>
*[https://www.idigbio.org/wiki/images/8/8a/IDigBio_Public_Participation_in_Digitization_Workshop_Logistics_4Dec13.pdf Logistics Document]
<h2> Coordination  </h2>
*[https://www.idigbio.org/wiki/images/a/a2/Transcription_Hackathon_Participant_List_8Dec13.pdf Participants List]
<ul><li><a href="https://www.idigbio.org/wiki/index.php/Transcription_Hackathon_Interoperability_Planning">Interoperability Track</a>
*[http://idigbio.adobeconnect.com/citscribe AdobeConnect room for planning prior to the hackathon, then for connection to the workshop remotely] (Send an email to Austin Mast, if you'd like to use the room for planning prior to the hackathon.)
</li><li><a href="https://www.idigbio.org/wiki/index.php/Transcription_Hackathon_OCR_Integration_Planning">OCR Integration Track</a>
 
</li><li><a href="https://www.idigbio.org/wiki/index.php/Transcription_Hackathon_Reconciliation_of_Replicates_Planning">QA/QC and Reconciliation of Replicates Track</a>
== Coordination  ==
</li><li><a href="https://www.idigbio.org/wiki/index.php/Transcription_Hackathon_User_Engagement_Planning">User Engagement Track</a>
 
</li><li><a href="https://docs.google.com/document/d/1ns_10ZMBRMOZX1DzfRBdALjhKtr_x8yYaAZLJH6YHyI/edit?usp=sharing">Participants Interest in Tracks</a>
*[https://www.idigbio.org/wiki/index.php/Transcription_Hackathon_Interoperability_Planning Interoperability Track]
</li></ul>
*[https://www.idigbio.org/wiki/index.php/Transcription_Hackathon_OCR_Integration_Planning OCR Integration Track]
<h2> Presentations </h2>
*[https://www.idigbio.org/wiki/index.php/Transcription_Hackathon_Reconciliation_of_Replicates_Planning QA/QC and Reconciliation of Replicates Track]
<ul><li>Yonggang Liu, iDigBio: <a href="https://docs.google.com/presentation/d/1-R6r_kDnf6IyxSHg1J3M-oRy_wGngLFCwUJg_w4oBAU/edit?usp=sharing">Image Ingestion at iDigBIo</a>
*[https://www.idigbio.org/wiki/index.php/Transcription_Hackathon_User_Engagement_Planning User Engagement Track]
</li><li>Austin Mast, iDigBio: <a href="https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/Mast_Lightning_Talk.pdf">Public Participation</a>
*[https://docs.google.com/document/d/1ns_10ZMBRMOZX1DzfRBdALjhKtr_x8yYaAZLJH6YHyI/edit?usp=sharing Participants Interest in Tracks]
</li><li>Yun Ling Yim, UC Berkeley: <a href="https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/Calbug_idigbio_Jun.pdf">Calbug Digitization, CalBug California Arthropod Collections</a>
 
</li><li>Miao Chen, Indiana U.: <a href="https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/lightningtalk-miaochen.pdf">Using OCR</a>
== Presentations ==
</li><li>Cody Meche, UF: <a href="https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/Agile.pdf">Agile Scrum</a>
*Yonggang Liu, iDigBio: [https://docs.google.com/presentation/d/1-R6r_kDnf6IyxSHg1J3M-oRy_wGngLFCwUJg_w4oBAU/edit?usp=sharing Image Ingestion at iDigBIo]
</li><li>Julie Allen, INHS: <a href="https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/Allen.pdf">Gamification</a>
*Austin Mast, iDigBio: [https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/Mast_Lightning_Talk.pdf Public Participation]
</li><li>Edward Gilbert, Symbiota Developer: <a href="https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/Symbiota_2013-12-16.pdf">Symbiota: a specimen-based biodiversity portal platform</a>
*Yun Ling Yim, UC Berkeley: [https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/Calbug_idigbio_Jun.pdf Calbug Digitization, CalBug California Arthropod Collections]
</li><li>Deborah Paul, iDigBio Augmenting OCR WG: <a href="https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/aOCRLightning.pptx">What's new in using OCR output in a Citizen Science Workflow</a>
*Miao Chen, Indiana U.: [https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/lightningtalk-miaochen.pdf Using OCR]
</li><li>Andrea Matsunaga, iDigBio: <a href="https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/MatsunagaiDigBioCrowdsourcingHackathon2013.pdf">Herbarium Labels Transcription Crowdsourcing &amp; OCR</a>
*Cody Meche, UF: [https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/Agile.pdf Agile Scrum]
</li><li>Joshua Campbell, iDigBio: <a href="https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/CampbelliDigBioCrowdsourcingHackathon2013.pdf">Herbarium Labels Transcription Crowdsourcing Consensus</a>
*Julie Allen, INHS: [https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/Allen.pdf Gamification]
</li><li>Yonggang Liu, ACIS iDigBio: <a href="https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/Yonggang_image_ingestion_appliance.pdf">iDigBio Image Ingestion Appliance</a>
*Edward Gilbert, Symbiota Developer: [https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/Symbiota_2013-12-16.pdf Symbiota: a specimen-based biodiversity portal platform]
</li><li>Paul Kimbereley, Smithsonian: <a href="https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/SI_Center.pdf">Smithsonian Transcription Center</a>
*Deborah Paul, iDigBio Augmenting OCR WG: [https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/aOCRLightning.pptx What's new in using OCR output in a Citizen Science Workflow]
</li></ul>
*Andrea Matsunaga, iDigBio: [https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/MatsunagaiDigBioCrowdsourcingHackathon2013.pdf Herbarium Labels Transcription Crowdsourcing & OCR]
<h2> Development Resources  </h2>
*Joshua Campbell, iDigBio: [https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/CampbelliDigBioCrowdsourcingHackathon2013.pdf Herbarium Labels Transcription Crowdsourcing Consensus]
<ul><li> <a href="https://github.com/idigbio-citsci-hackathon">GitHub organization for this Transcription Hackathon</a>
*Yonggang Liu, ACIS iDigBio: [https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/Yonggang_image_ingestion_appliance.pdf iDigBio Image Ingestion Appliance]
</li><li> 4 existing crowdsourcing datasets from Notes From Nature. Datasets contain transcriptions of different types of collections labels. Read more <a href="https://docs.google.com/document/d/1UCz5WblnNIvqBErX-XeWgS9mf69qFhycHqntQOGnPp4/edit?usp=sharing">here</a>. The datasets were shared only with the hackaton participants through dropbox once anonymized. It will be made public when we get a definitive approval from NfN.
*Paul Kimbereley, Smithsonian: [https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/SI_Center.pdf Smithsonian Transcription Center]
<ul><li> Calbug dataset
 
</li><li> Herbarium labels—The filenames with "USAM_" represent a nearly complete set of recent transcriptions from a collection (the University of South Alabama Herbarium), four replicates for most specimens (I think).
== Development Resources  ==
</li><li> Macrofungi labels
* [https://github.com/idigbio-citsci-hackathon GitHub organization for this Transcription Hackathon]
</li><li> Ornithological dataset
* 4 existing crowdsourcing datasets from Notes From Nature. Datasets contain transcriptions of different types of collections labels. Read more [https://docs.google.com/document/d/1UCz5WblnNIvqBErX-XeWgS9mf69qFhycHqntQOGnPp4/edit?usp=sharing here]. The datasets were shared only with the hackaton participants through dropbox once anonymized. It will be made public when we get a definitive approval from NfN.
</li></ul>
** Calbug dataset
</li></ul>
** Herbarium labels—The filenames with "USAM_" represent a nearly complete set of recent transcriptions from a collection (the University of South Alabama Herbarium), four replicates for most specimens (I think).
<ul><li> Existing solution datasets to assess quality of crowdsourcing consensus (we are working to get "gold standard" data for some of these:
** Macrofungi labels
<ul><li> Herbarium labels ideal response: link to be provided by Austin
** Ornithological dataset
</li><li> Entomology labels  ideal response: link to be provided by Austin
 
</li><li> Field notebooks  ideal response: link to be provided by Austin
* Existing solution datasets to assess quality of crowdsourcing consensus (we are working to get "gold standard" data for some of these:
</li></ul>
** Herbarium labels ideal response: link to be provided by Austin
</li></ul>
** Entomology labels  ideal response: link to be provided by Austin
<ul><li> For those interested in experimenting with the images that have been used for public participation in transcription:
** Field notebooks  ideal response: link to be provided by Austin
<ul><li> Herbarium label images: the set of ca. 16,000 "USAM" images used for some of the herbarium transcriptions is available at <a href="http://www.specimenimaging.com/images/USAM/">USAM Herbarium Images</a>.  This is several GB worth of image files.  To get them, you could use the DownloadThemAll Firefox plugin.  
 
</li></ul>
* For those interested in experimenting with the images that have been used for public participation in transcription:
</li></ul>
** Herbarium label images: the set of ca. 16,000 "USAM" images used for some of the herbarium transcriptions is available at [http://www.specimenimaging.com/images/USAM/ USAM Herbarium Images].  This is several GB worth of image files.  To get them, you could use the DownloadThemAll Firefox plugin.  
<ul><li> <a href="http://www.notesfromnature.org/">Notes From Nature</a> web interface:
 
<ul><li> Code available at https://github.com/zooniverse/notesFromNature
* [http://www.notesfromnature.org/ Notes From Nature] web interface:
</li><li> Forked version for the Hackathon available at: https://github.com/idigbio-citsci-hackathon/notesFromNature
** Code available at https://github.com/zooniverse/notesFromNature
</li><li> Install Vagrant from http://downloads.vagrantup.com/tags/v1.3.5 and virtualBox from https://www.virtualbox.org/wiki/Downloads
** Forked version for the Hackathon available at: https://github.com/idigbio-citsci-hackathon/notesFromNature
</li><li> Vagrant script to build a VM with Notes From Nature web interface:  https://github.com/idigbio-citsci-hackathon/nfn-vagrant
** Install Vagrant from http://downloads.vagrantup.com/tags/v1.3.5 and virtualBox from https://www.virtualbox.org/wiki/Downloads
</li><li> Go to the location of the vagrant script and type "vagrant up" in your command prompt to build a VM with Note from Nature running on localhost:9294.
** Vagrant script to build a VM with Notes From Nature web interface:  https://github.com/idigbio-citsci-hackathon/nfn-vagrant
</li><li> API Calls
** Go to the location of the vagrant script and type "vagrant up" in your command prompt to build a VM with Note from Nature running on localhost:9294.
<ul><li> https://api.zooniverse.org/projects/notes_from_nature/groups/
** API Calls
</li><li> https://api.zooniverse.org/projects/notes_from_nature/groups/5170103b3ae74027cf000002
*** https://api.zooniverse.org/projects/notes_from_nature/groups/
</li></ul>
*** https://api.zooniverse.org/projects/notes_from_nature/groups/5170103b3ae74027cf000002
</li></ul>
 
</li></ul>
* [[CYWG iDigBio Image Ingestion Appliance]]:
<ul><li> <a _fcknotitle="true" href="CYWG iDigBio Image Ingestion Appliance">CYWG iDigBio Image Ingestion Appliance</a>:
** The appliance can be used to ingest the images to be used by the crowdsourcing service into the iDigBio storage, and made publicly accessible through HTTP. The relationship between the image filenames and the URL can be exported by the appliance in CSV format.
<ul><li> The appliance can be used to ingest the images to be used by the crowdsourcing service into the iDigBio storage, and made publicly accessible through HTTP. The relationship between the image filenames and the URL can be exported by the appliance in CSV format.
 
</li></ul>
* Gold Images from aOCR Hackthon:
</li></ul>
** CSV file with URLs for the Images on iDigBio beta server (Uploaded by Image Ingestion Appliance): [http://www.acis.ufl.edu/~yonggang/idigbio/recordset/gold/ent.csv ent], [http://www.acis.ufl.edu/~yonggang/idigbio/recordset/gold/herb.csv herb],[http://www.acis.ufl.edu/~yonggang/idigbio/recordset/gold/lichens.csv lichens].
<ul><li> Gold Images from aOCR Hackthon:
 
<ul><li> CSV file with URLs for the Images on iDigBio beta server (Uploaded by Image Ingestion Appliance): <a href="http://www.acis.ufl.edu/~yonggang/idigbio/recordset/gold/ent.csv">ent</a>, <a href="http://www.acis.ufl.edu/~yonggang/idigbio/recordset/gold/herb.csv">herb</a>,<a href="http://www.acis.ufl.edu/~yonggang/idigbio/recordset/gold/lichens.csv">lichens</a>.
* Code from the aOCR Hackthon:
</li></ul>
** HandwritingDetection (https://github.com/idigbio-aocr): an algorithm that separates images into sets with no handwriting, little handwriting (mostly text typed or printed), lots of handwriting, based on the noise generated by the OCR software. [http://manuscripttranscription.blogspot.com/2013/02/detecting-handwriting-in-ocr-text.html Read more at Ben's blog]. This could be used to rank which images are in more need for human transcription.
</li></ul>
** Dictionaries to improve crowdsourcing consensus (e.g., names of collectors, scientific names): link to be provided by aOCR?
<ul><li> Code from the aOCR Hackthon:
*** (Some [http://webprojects.huh.harvard.edu/authority_files/ botantists]: RDF and tab-delimited.)
<ul><li> HandwritingDetection (https://github.com/idigbio-aocr): an algorithm that separates images into sets with no handwriting, little handwriting (mostly text typed or printed), lots of handwriting, based on the noise generated by the OCR software. <a href="http://manuscripttranscription.blogspot.com/2013/02/detecting-handwriting-in-ocr-text.html">Read more at Ben's blog</a>. This could be used to rank which images are in more need for human transcription.
 
</li><li> Dictionaries to improve crowdsourcing consensus (e.g., names of collectors, scientific names): link to be provided by aOCR?
 
<ul><li> (Some <a href="http://webprojects.huh.harvard.edu/authority_files/">botantists</a>: RDF and tab-delimited.)
* Hi all - (Paul Flemons).
</li></ul>
**I have uploaded a number of files:
</li></ul>
***https://www.idigbio.org/wiki/index.php/File:OpenRefine_procedures_for_EVENTS_1212a.pdf - a description of Open Refine procedures used for matching BVP fields to EMu EVENTS
</li></ul>
***https://www.idigbio.org/wiki/index.php/File:Preparing_BVP_data_for_import_into_EMu_-_process_1212a.pdf Detailed process of preparing BVP data for EMu
<p><br />
***https://www.idigbio.org/wiki/index.php/File:Preparing_BVP_data_for_import_into_EMu_-_overview.pdf Overview of preparing BVP data for EMu
</p>
***https://www.idigbio.org/wiki/index.php/File:VisioDiagramofProcess.JPG Diagram of the process of preparing data from BVP for EMu
<ul><li> Hi all - (Paul Flemons).
 
<ul><li>I have uploaded a number of files:
*From Steve Raden: some background on Zooniverse's design
<ul><li>https://www.idigbio.org/wiki/index.php/File:OpenRefine_procedures_for_EVENTS_1212a.pdf - a description of Open Refine procedures used for matching BVP fields to EMu EVENTS
**http://arfon.org/how-the-zooniverse-works-tools-and-technologies
</li><li>https://www.idigbio.org/wiki/index.php/File:Preparing_BVP_data_for_import_into_EMu_-_process_1212a.pdf Detailed process of preparing BVP data for EMu
**http://arfon.org/how-the-zooniverse-works-keeping-it-personal
</li><li>https://www.idigbio.org/wiki/index.php/File:Preparing_BVP_data_for_import_into_EMu_-_overview.pdf Overview of preparing BVP data for EMu
**http://arfon.org/how-the-zooniverse-works-the-domain-model
</li><li>https://www.idigbio.org/wiki/index.php/File:VisioDiagramofProcess.JPG Diagram of the process of preparing data from BVP for EMu
 
</li></ul>
== Hackathon Products  ==
</li></ul>
*Brainstorming Documents from the Thursday Mix Ups
</li></ul>
**Group 1 [https://docs.google.com/document/d/1aMVXG3GzTznYBs9R6lQ13Tny_CyBIcMJ_LPLj1zlz7U/edit Mix Up Discussion Summary] (google doc)
<ul><li>From Steve Raden: some background on Zooniverse's design
**Group 2
<ul><li>http://arfon.org/how-the-zooniverse-works-tools-and-technologies
**Group 3 [https://docs.google.com/document/d/1B6kvLFw_Mzhrsx4xPgJm29w5j75TpSihdDXFauyt2YM/edit MixUp google doc]
</li><li>http://arfon.org/how-the-zooniverse-works-keeping-it-personal
**Group 4 [https://docs.google.com/document/d/1-Z-oiwjZZiCh-nVHGZHhUBphY5Z-rcJBnZzn6vnJtBs/edit Mix Up Discussion Summary] (google doc]
</li><li>http://arfon.org/how-the-zooniverse-works-the-domain-model
</li></ul>
</li></ul>
<h2> Hackathon Products  </h2>
<ul><li>Brainstorming Documents from the Thursday Mix Ups
<ul><li>Group 1 <a href="https://docs.google.com/document/d/1aMVXG3GzTznYBs9R6lQ13Tny_CyBIcMJ_LPLj1zlz7U/edit">Mix Up Discussion Summary</a> (google doc)
</li><li>Group 2
</li><li>Group 3 <a href="https://docs.google.com/document/d/1B6kvLFw_Mzhrsx4xPgJm29w5j75TpSihdDXFauyt2YM/edit">MixUp google doc</a>
</li><li>Group 4 <a href="https://drive.google.com/?tab=wo&authuser=0#folders/0Bygk4TdWUfiXczg5dnlvb1NrbFk">Presentations and Discussions</a><a href="https://docs.google.com/document/d/1-Z-oiwjZZiCh-nVHGZHhUBphY5Z-rcJBnZzn6vnJtBs/edit">Mix Up Discussion Summary</a> (google doc]
</li></ul>
</li></ul>
<a _fcknotitle="true" href="Category:Transcription_Hackathon">Transcription_Hackathon</a>

Revision as of 21:22, 21 December 2013


Notes from Nature/iDigBio Hackathon to Further Enable Public Participation in the Online Transcription of Biodiversity Specimen Labels

December 16–20 at the University of Florida, Gainesville

Agenda and Logistics

Coordination

Presentations

Development Resources

  • GitHub organization for this Transcription Hackathon
  • 4 existing crowdsourcing datasets from Notes From Nature. Datasets contain transcriptions of different types of collections labels. Read more here. The datasets were shared only with the hackaton participants through dropbox once anonymized. It will be made public when we get a definitive approval from NfN.
    • Calbug dataset
    • Herbarium labels—The filenames with "USAM_" represent a nearly complete set of recent transcriptions from a collection (the University of South Alabama Herbarium), four replicates for most specimens (I think).
    • Macrofungi labels
    • Ornithological dataset
  • Existing solution datasets to assess quality of crowdsourcing consensus (we are working to get "gold standard" data for some of these:
    • Herbarium labels ideal response: link to be provided by Austin
    • Entomology labels ideal response: link to be provided by Austin
    • Field notebooks ideal response: link to be provided by Austin
  • For those interested in experimenting with the images that have been used for public participation in transcription:
    • Herbarium label images: the set of ca. 16,000 "USAM" images used for some of the herbarium transcriptions is available at USAM Herbarium Images. This is several GB worth of image files. To get them, you could use the DownloadThemAll Firefox plugin.
  • CYWG iDigBio Image Ingestion Appliance:
    • The appliance can be used to ingest the images to be used by the crowdsourcing service into the iDigBio storage, and made publicly accessible through HTTP. The relationship between the image filenames and the URL can be exported by the appliance in CSV format.
  • Gold Images from aOCR Hackthon:
    • CSV file with URLs for the Images on iDigBio beta server (Uploaded by Image Ingestion Appliance): ent, herb,lichens.
  • Code from the aOCR Hackthon:
    • HandwritingDetection (https://github.com/idigbio-aocr): an algorithm that separates images into sets with no handwriting, little handwriting (mostly text typed or printed), lots of handwriting, based on the noise generated by the OCR software. Read more at Ben's blog. This could be used to rank which images are in more need for human transcription.
    • Dictionaries to improve crowdsourcing consensus (e.g., names of collectors, scientific names): link to be provided by aOCR?


Hackathon Products