Transcription Hackathon: Difference between revisions

From iDigBio
Jump to navigation Jump to search
No edit summary
No edit summary
 
(23 intermediate revisions by 7 users not shown)
Line 1: Line 1:
<p><br />
[[Category:Transcription Hackathon]][[Category:Workshop]]
<b>Notes from Nature/iDigBio Hackathon to Further Enable Public Participation in the Online Transcription of Biodiversity Specimen Labels</b>
 
</p><p>December 16–20 at the University of Florida, Gainesville  
'''Notes from Nature/iDigBio Hackathon to Further Enable Public Participation in the Online Transcription of Biodiversity Specimen Labels'''
</p>
 
<h2> Agenda and Logistics  </h2>
December 16–20 at the University of Florida, Gainesville  
<ul><li><a href="https://www.idigbio.org/content/hackathon-enable-public-participation-online-transcription-biodiversity-specimen-labels">Hackathon Advertisement</a>  
{| class="wikitable" style="float:right;"
</li><li><a href="https://www.idigbio.org/wiki/index.php/Transcription_Hackathon_Draft_Agenda">Agenda</a>
! colspan="2" style="background:#D58B28;width:200px;font-size:10pt" | Digitizing the Past and Present for the Future
</li><li><a href="https://www.idigbio.org/wiki/images/8/8a/IDigBio_Public_Participation_in_Digitization_Workshop_Logistics_4Dec13.pdf">Logistics Document</a>
|-
</li><li><a href="https://www.idigbio.org/wiki/images/a/a2/Transcription_Hackathon_Participant_List_8Dec13.pdf">Participants List</a>
| colspan="2" style="text-align:center;font-size:7pt" | <!--YOU CAN INSERT A NEW IMAGE FOR THE LOGO BETWEEN THE COLON AND THE PIPE-->[[Image:IDigBio Logo RGB.png|center|300px|iDigBio Logo RGB.png]]<br />
</li><li><a href="http://idigbio.adobeconnect.com/citscribe">AdobeConnect room for planning prior to the hackathon, then for connection to the workshop remotely</a> (Send an email to Austin Mast, if you'd like to use the room for planning prior to the hackathon.)
|-
</li></ul>
!colspan="2" style="background:#D58B28;text-align:center;font-size:9pt" | Quick Links for Transcription Hackathon Workshop
<h2> Coordination  </h2>
|-
<ul><li><a href="https://www.idigbio.org/wiki/index.php/Transcription_Hackathon_Interoperability_Planning">Interoperability Track</a>
|[https://docs.google.com/document/d/1TyluwM1rMcq7O_nidy8CLJFMW4FrOPjsHkrLVho5cVU/edit?usp=sharing Transcription Hackathon Workshop Agenda]
</li><li><a href="https://www.idigbio.org/wiki/index.php/Transcription_Hackathon_OCR_Integration_Planning">OCR Integration Track</a>
|-
</li><li><a href="https://www.idigbio.org/wiki/index.php/Transcription_Hackathon_Reconciliation_of_Replicates_Planning">QA/QC and Reconciliation of Replicates Track</a>
|[https://www.idigbio.org/biblio?f%5bkeyword%5d=274 Transcription Hackathon Workshop Biblio Entries]
</li><li><a href="https://www.idigbio.org/wiki/index.php/Transcription_Hackathon_User_Engagement_Planning">User Engagement Track</a>
|-
</li><li><a href="https://docs.google.com/document/d/1ns_10ZMBRMOZX1DzfRBdALjhKtr_x8yYaAZLJH6YHyI/edit?usp=sharing">Participants Interest in Tracks</a>
|[https://www.idigbio.org/content/citscribe-hackathon Transcription Hackathon Workshop Report]
</li></ul>
|}
<h2> Presentations </h2>
 
<ul><li>Yonggang Liu, iDigBio: <a href="https://docs.google.com/presentation/d/1-R6r_kDnf6IyxSHg1J3M-oRy_wGngLFCwUJg_w4oBAU/edit?usp=sharing">Image Ingestion at iDigBIo</a>
 
</li><li>Austin Mast, iDigBio: <a href="https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/Mast_Lightning_Talk.pdf">Public Participation</a>
== Agenda and Logistics  ==
</li><li>Yun Ling Yim, UC Berkeley: <a href="https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/Calbug_idigbio_Jun.pdf">Calbug Digitization, CalBug California Arthropod Collections</a>
 
</li><li>Miao Chen, Indiana U.: <a href="https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/lightningtalk-miaochen.pdf">Using OCR</a>
*[https://www.idigbio.org/content/hackathon-enable-public-participation-online-transcription-biodiversity-specimen-labels Hackathon Advertisement]
</li><li>Cody Meche, UF: <a href="https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/Agile.pdf">Agile Scrum</a>
*[[Transcription Hackathon Draft Agenda| Agenda]]
</li><li>Julie Allen, INHS: <a href="https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/Allen.pdf">Gamification</a>
*[[Media:IDigBio_Public_Participation_in_Digitization_Workshop_Logistics_4Dec13.pdf|Logistics Document]]
</li><li>Edward Gilbert, Symbiota Developer: <a href="https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/Symbiota_2013-12-16.pdf">Symbiota: a specimen-based biodiversity portal platform</a>
*[[Media:Transcription_Hackathon_Participant_List_23Dec13.pdf|Participants List]]
</li><li>Deborah Paul, iDigBio Augmenting OCR WG: <a href="https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/aOCRLightning.pptx">What's new in using OCR output in a Citizen Science Workflow</a>
*[http://idigbio.adobeconnect.com/citscribe AdobeConnect room for collaboration after the hackathon, then for connection to the workshop remotely] (Send an email to Austin Mast, if you'd like to use the room for additional collaboration after the hackathon.)
</li><li>Andrea Matsunaga, iDigBio: <a href="https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/MatsunagaiDigBioCrowdsourcingHackathon2013.pdf">Herbarium Labels Transcription Crowdsourcing &amp; OCR</a>
 
</li><li>Joshua Campbell, iDigBio: <a href="https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/CampbelliDigBioCrowdsourcingHackathon2013.pdf">Herbarium Labels Transcription Crowdsourcing Consensus</a>
== Media ==
</li><li>Yonggang Liu, ACIS iDigBio: <a href="https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/Yonggang_image_ingestion_appliance.pdf">iDigBio Image Ingestion Appliance</a>
*[https://www.facebook.com/media/set/?set=a.645283388848944.1073741833.215120891865198&type=1 Citscribe Hackathon Facebook Album]
</li><li>Paul Kimbereley, Smithsonian: <a href="https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/SI_Center.pdf">Smithsonian Transcription Center</a>
*Twitter stuff: @iDigBio @NfromN hashtag #CITScribe
</li></ul>
 
<h2> Development Resources  </h2>
==Report==
<ul><li> <a href="https://github.com/idigbio-citsci-hackathon">GitHub organization for this Transcription Hackathon</a>
*[https://www.idigbio.org/content/citscribe-hackathon Citscribe Hackathon Report]
</li><li> 4 existing crowdsourcing datasets from Notes From Nature. Datasets contain transcriptions of different types of collections labels. Read more <a href="https://docs.google.com/document/d/1UCz5WblnNIvqBErX-XeWgS9mf69qFhycHqntQOGnPp4/edit?usp=sharing">here</a>. The datasets were shared only with the hackaton participants through dropbox once anonymized. It will be made public when we get a definitive approval from NfN.
 
<ul><li> Calbug dataset
== Coordination  ==
</li><li> Herbarium labels—The filenames with "USAM_" represent a nearly complete set of recent transcriptions from a collection (the University of South Alabama Herbarium), four replicates for most specimens (I think).
 
</li><li> Macrofungi labels
*[[Transcription Hackathon Interoperability Planning| Interoperability Track]]
</li><li> Ornithological dataset
*[[Transcription Hackathon OCR Integration Planning| OCR Integration Track]]
</li></ul>
*[[Transcription Hackathon Reconciliation of Replicates Planning| QA/QC and Reconciliation of Replicates Track]]
</li></ul>
*[[Transcription Hackathon User Engagement Planning| User Engagement Track]]
<ul><li> Existing solution datasets to assess quality of crowdsourcing consensus (we are working to get "gold standard" data for some of these:
*[https://docs.google.com/document/d/1ns_10ZMBRMOZX1DzfRBdALjhKtr_x8yYaAZLJH6YHyI/edit?usp=sharing Participants Interest in Tracks]
<ul><li> Herbarium labels ideal response: link to be provided by Austin
 
</li><li> Entomology labels  ideal response: link to be provided by Austin
== Presentations ==
</li><li> Field notebooks  ideal response: link to be provided by Austin
*Yonggang Liu, iDigBio: [https://docs.google.com/presentation/d/1-R6r_kDnf6IyxSHg1J3M-oRy_wGngLFCwUJg_w4oBAU/edit?usp=sharing Image Ingestion at iDigBIo]
</li></ul>
*Austin Mast, iDigBio: [https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/Mast_Lightning_Talk.pdf Public Participation]
</li></ul>
*Yun Ling Yim, UC Berkeley: [https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/Calbug_idigbio_Jun.pdf Calbug Digitization, CalBug California Arthropod Collections]
<ul><li> For those interested in experimenting with the images that have been used for public participation in transcription:
*Miao Chen, Indiana U.: [https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/lightningtalk-miaochen.pdf Using OCR]
<ul><li> Herbarium label images: the set of ca. 16,000 "USAM" images used for some of the herbarium transcriptions is available at <a href="http://www.specimenimaging.com/images/USAM/">USAM Herbarium Images</a>. This is several GB worth of image files. To get them, you could use the DownloadThemAll Firefox plugin.
*Cody Meche, UF: [https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/Agile.pdf Agile Scrum]
</li></ul>
*Julie Allen, INHS: [https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/Allen.pdf Gamification]
</li></ul>
*Edward Gilbert, Symbiota Developer: [https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/Symbiota_2013-12-16.pdf Symbiota: a specimen-based biodiversity portal platform]
<ul><li> <a href="http://www.notesfromnature.org/">Notes From Nature</a> web interface:
*Deborah Paul, iDigBio Augmenting OCR WG: [https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/aOCRLightning.pptx What's new in using OCR output in a Citizen Science Workflow]
<ul><li> Code available at https://github.com/zooniverse/notesFromNature
*Andrea Matsunaga, iDigBio: [https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/MatsunagaiDigBioCrowdsourcingHackathon2013.pdf Herbarium Labels Transcription Crowdsourcing & OCR]
</li><li> Forked version for the Hackathon available at: https://github.com/idigbio-citsci-hackathon/notesFromNature
*Joshua Campbell, iDigBio: [https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/CampbelliDigBioCrowdsourcingHackathon2013.pdf Herbarium Labels Transcription Crowdsourcing Consensus]
</li><li> Install Vagrant from http://downloads.vagrantup.com/tags/v1.3.5 and virtualBox from https://www.virtualbox.org/wiki/Downloads
*Yonggang Liu, ACIS iDigBio: [https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/Yonggang_image_ingestion_appliance.pdf iDigBio Image Ingestion Appliance]
</li><li> Vagrant script to build a VM with Notes From Nature web interface: https://github.com/idigbio-citsci-hackathon/nfn-vagrant
*Paul Kimberly, Smithsonian: [https://www.idigbio.org/sites/default/files/workshop-presentations/citscribe/SI_Center.pdf Smithsonian Transcription Center]
</li><li> Go to the location of the vagrant script and type "vagrant up" in your command prompt to build a VM with Note from Nature running on localhost:9294.
*William Ulate, Missouri Botanical Garden: [[Media:Purposeful_Gaming_BHL_Dec_2013.pdf|Purposeful Gaming and BHL]]
</li><li> API Calls
 
<ul><li> https://api.zooniverse.org/projects/notes_from_nature/groups/
== Development Resources  ==
</li><li> https://api.zooniverse.org/projects/notes_from_nature/groups/5170103b3ae74027cf000002
* [https://github.com/idigbio-citsci-hackathon GitHub organization for this Transcription Hackathon]
</li></ul>
* 4 existing crowdsourcing datasets from Notes From Nature. Datasets contain transcriptions of different types of collections labels. Read more [https://docs.google.com/document/d/1UCz5WblnNIvqBErX-XeWgS9mf69qFhycHqntQOGnPp4/edit?usp=sharing here]. The datasets were shared only with the hackathon participants through dropbox once anonymized. It will be made public when we get a definitive approval from NfN.
</li></ul>
** Calbug dataset
</li></ul>
** Herbarium labels—The filenames with "USAM_" represent a nearly complete set of recent transcriptions from a collection (the University of South Alabama Herbarium), four replicates for most specimens (I think).
<ul><li> <a _fcknotitle="true" href="CYWG iDigBio Image Ingestion Appliance">CYWG iDigBio Image Ingestion Appliance</a>:
** Macrofungi labels
<ul><li> The appliance can be used to ingest the images to be used by the crowdsourcing service into the iDigBio storage, and made publicly accessible through HTTP. The relationship between the image filenames and the URL can be exported by the appliance in CSV format.
** Ornithological dataset
</li></ul>
 
</li></ul>
* For those interested in experimenting with the images that have been used for public participation in transcription:
<ul><li> Gold Images from aOCR Hackthon:
** Herbarium label images: the set of ca. 16,000 "USAM" images used for some of the herbarium transcriptions is available at [http://www.specimenimaging.com/images/USAM/ USAM Herbarium Images].  This is several GB worth of image files.  To get them, you could use the DownloadThemAll Firefox plugin.
<ul><li> CSV file with URLs for the Images on iDigBio beta server (Uploaded by Image Ingestion Appliance): <a href="http://www.acis.ufl.edu/~yonggang/idigbio/recordset/gold/ent.csv">ent</a>, <a href="http://www.acis.ufl.edu/~yonggang/idigbio/recordset/gold/herb.csv">herb</a>,<a href="http://www.acis.ufl.edu/~yonggang/idigbio/recordset/gold/lichens.csv">lichens</a>.
 
</li></ul>
* [http://www.notesfromnature.org/ Notes From Nature] web interface:
</li></ul>
** Code available at https://github.com/zooniverse/notesFromNature
<ul><li> Code from the aOCR Hackthon:
** Forked version for the Hackathon available at: https://github.com/idigbio-citsci-hackathon/notesFromNature
<ul><li> HandwritingDetection (https://github.com/idigbio-aocr): an algorithm that separates images into sets with no handwriting, little handwriting (mostly text typed or printed), lots of handwriting, based on the noise generated by the OCR software. <a href="http://manuscripttranscription.blogspot.com/2013/02/detecting-handwriting-in-ocr-text.html">Read more at Ben's blog</a>. This could be used to rank which images are in more need for human transcription.
** Install Vagrant from http://downloads.vagrantup.com/tags/v1.3.5 and virtualBox from https://www.virtualbox.org/wiki/Downloads
</li><li> Dictionaries to improve crowdsourcing consensus (e.g., names of collectors, scientific names): link to be provided by aOCR?
** Vagrant script to build a VM with Notes From Nature web interface: https://github.com/idigbio-citsci-hackathon/nfn-vagrant
<ul><li> (Some <a href="http://webprojects.huh.harvard.edu/authority_files/">botantists</a>: RDF and tab-delimited.)
** Go to the location of the vagrant script and type "vagrant up" in your command prompt to build a VM with Note from Nature running on localhost:9294.
</li></ul>
** API Calls
</li></ul>
*** https://api.zooniverse.org/projects/notes_from_nature/groups/
</li></ul>
*** https://api.zooniverse.org/projects/notes_from_nature/groups/5170103b3ae74027cf000002
<p><br />
 
</p>
* [[CYWG iDigBio Image Ingestion Appliance]]:
<ul><li> Hi all - (Paul Flemons).
** The appliance can be used to ingest the images to be used by the crowdsourcing service into the iDigBio storage, and made publicly accessible through HTTP. The relationship between the image filenames and the URL can be exported by the appliance in CSV format.
<ul><li>I have uploaded a number of files:
 
<ul><li>https://www.idigbio.org/wiki/index.php/File:OpenRefine_procedures_for_EVENTS_1212a.pdf - a description of Open Refine procedures used for matching BVP fields to EMu EVENTS
* Gold Images from aOCR Hackthon:
</li><li>https://www.idigbio.org/wiki/index.php/File:Preparing_BVP_data_for_import_into_EMu_-_process_1212a.pdf Detailed process of preparing BVP data for EMu
** CSV file with URLs for the Images on iDigBio beta server (Uploaded by Image Ingestion Appliance): [http://www.acis.ufl.edu/~yonggang/idigbio/recordset/gold/ent.csv ent], [http://www.acis.ufl.edu/~yonggang/idigbio/recordset/gold/herb.csv herb],[http://www.acis.ufl.edu/~yonggang/idigbio/recordset/gold/lichens.csv lichens].
</li><li>https://www.idigbio.org/wiki/index.php/File:Preparing_BVP_data_for_import_into_EMu_-_overview.pdf Overview of preparing BVP data for EMu
* Code from the aOCR Hackthon:
</li><li>https://www.idigbio.org/wiki/index.php/File:VisioDiagramofProcess.JPG Diagram of the process of preparing data from BVP for EMu
** HandwritingDetection (https://github.com/idigbio-aocr): an algorithm that separates images into sets with no handwriting, little handwriting (mostly text typed or printed), lots of handwriting, based on the noise generated by the OCR software. [http://manuscripttranscription.blogspot.com/2013/02/detecting-handwriting-in-ocr-text.html Read more at Ben's blog]. This could be used to rank which images are in more need for human transcription.
</li></ul>
** Dictionaries to improve crowdsourcing consensus (e.g., names of collectors, scientific names): link to be provided by aOCR?
</li></ul>
*** (Some [http://webprojects.huh.harvard.edu/authority_files/ botantists]: RDF and tab-delimited.)
</li></ul>
 
<ul><li>From Steve Raden: some background on Zooniverse's design
* Hi all - (Paul Flemons).
<ul><li>http://arfon.org/how-the-zooniverse-works-tools-and-technologies
**I have uploaded a number of files:
</li><li>http://arfon.org/how-the-zooniverse-works-keeping-it-personal
***[[Media:OpenRefine_procedures_for_EVENTS_1212a.pdf|a description of Open Refine procedures used for matching BVP fields to EMu EVENTS]]
</li><li>http://arfon.org/how-the-zooniverse-works-the-domain-model
***[[Media:Preparing_BVP_data_for_import_into_EMu_-_process_1212a.pdf|Detailed process of preparing BVP data for EMu]]
</li></ul>
***[[Media:Preparing_BVP_data_for_import_into_EMu_-_overview.pdf|Overview of preparing BVP data for EMu]]
</li></ul>
***[[Media:VisioDiagramofProcess.JPG|Diagram of the process of preparing data from BVP for EMu]]
<h2> Hackathon Products  </h2>
 
<ul><li>Brainstorming Documents from the Thursday Mix Ups
*From Steve Raden: some background on Zooniverse's design
<ul><li>Group 1 <a href="https://docs.google.com/document/d/1aMVXG3GzTznYBs9R6lQ13Tny_CyBIcMJ_LPLj1zlz7U/edit">Mix Up Discussion Summary</a> (google doc)
**http://arfon.org/how-the-zooniverse-works-tools-and-technologies
</li><li>Group 2
**http://arfon.org/how-the-zooniverse-works-keeping-it-personal
</li><li>Group 3 <a href="https://docs.google.com/document/d/1B6kvLFw_Mzhrsx4xPgJm29w5j75TpSihdDXFauyt2YM/edit">MixUp google doc</a>
**http://arfon.org/how-the-zooniverse-works-the-domain-model
</li><li>Group 4 <a href="https://drive.google.com/?tab=wo&authuser=0#folders/0Bygk4TdWUfiXczg5dnlvb1NrbFk">Presentations and Discussions</a><a href="https://docs.google.com/document/d/1-Z-oiwjZZiCh-nVHGZHhUBphY5Z-rcJBnZzn6vnJtBs/edit">Mix Up Discussion Summary</a> (google doc]
 
</li></ul>
== Hackathon Products  ==
</li></ul>
 
<a _fcknotitle="true" href="Category:Transcription_Hackathon">Transcription_Hackathon</a>
*Brainstorming Documents from the Thursday Mix Ups
**Group 1 [https://docs.google.com/document/d/1aMVXG3GzTznYBs9R6lQ13Tny_CyBIcMJ_LPLj1zlz7U/edit Mix Up Discussion Summary] (google doc)  
**Group 2
**Group 3 [https://docs.google.com/document/d/1B6kvLFw_Mzhrsx4xPgJm29w5j75TpSihdDXFauyt2YM/edit MixUp google doc]
**Group 4 [https://docs.google.com/document/d/1-Z-oiwjZZiCh-nVHGZHhUBphY5Z-rcJBnZzn6vnJtBs/edit Mix Up Discussion Summary] (google doc)
*Some groups used the Coordination pages above to summarize products
*Group 1 [[Target File Format]]

Latest revision as of 14:35, 3 February 2015


Notes from Nature/iDigBio Hackathon to Further Enable Public Participation in the Online Transcription of Biodiversity Specimen Labels

December 16–20 at the University of Florida, Gainesville

Digitizing the Past and Present for the Future
iDigBio Logo RGB.png

Quick Links for Transcription Hackathon Workshop
Transcription Hackathon Workshop Agenda
Transcription Hackathon Workshop Biblio Entries
Transcription Hackathon Workshop Report


Agenda and Logistics

Media

Report

Coordination

Presentations

Development Resources

  • GitHub organization for this Transcription Hackathon
  • 4 existing crowdsourcing datasets from Notes From Nature. Datasets contain transcriptions of different types of collections labels. Read more here. The datasets were shared only with the hackathon participants through dropbox once anonymized. It will be made public when we get a definitive approval from NfN.
    • Calbug dataset
    • Herbarium labels—The filenames with "USAM_" represent a nearly complete set of recent transcriptions from a collection (the University of South Alabama Herbarium), four replicates for most specimens (I think).
    • Macrofungi labels
    • Ornithological dataset
  • For those interested in experimenting with the images that have been used for public participation in transcription:
    • Herbarium label images: the set of ca. 16,000 "USAM" images used for some of the herbarium transcriptions is available at USAM Herbarium Images. This is several GB worth of image files. To get them, you could use the DownloadThemAll Firefox plugin.
  • CYWG iDigBio Image Ingestion Appliance:
    • The appliance can be used to ingest the images to be used by the crowdsourcing service into the iDigBio storage, and made publicly accessible through HTTP. The relationship between the image filenames and the URL can be exported by the appliance in CSV format.
  • Gold Images from aOCR Hackthon:
    • CSV file with URLs for the Images on iDigBio beta server (Uploaded by Image Ingestion Appliance): ent, herb,lichens.
  • Code from the aOCR Hackthon:
    • HandwritingDetection (https://github.com/idigbio-aocr): an algorithm that separates images into sets with no handwriting, little handwriting (mostly text typed or printed), lots of handwriting, based on the noise generated by the OCR software. Read more at Ben's blog. This could be used to rank which images are in more need for human transcription.
    • Dictionaries to improve crowdsourcing consensus (e.g., names of collectors, scientific names): link to be provided by aOCR?

Hackathon Products

  • Some groups used the Coordination pages above to summarize products
  • Group 1 Target File Format