Transcription Hackathon Reconciliation of Replicates Planning: Difference between revisions

Latest revision as of 16:13, 20 December 2013

We worked on tools to help with reconciling and interpreting crowd-sourced data. One possible workflow might go like this:

   Start with crowd-sourced transcriptions.
   → reconcile ( → filter out irreconcilables?)
   if locality:
       → place name matching
       → geocoding
   if names:
       → name splitting
       → name list lookup

Reconciliation: Range of approaches:

Get a super-user to finalize / approve transcriptions, instead of trying to resolve multiple submissions
Or, given multiple transcriptions, pick one which minimizes some edit distance.
Or, use sequence alignment tools to find the best transcription of subregions in a larger string. (GitHub code does this.)

Locality: Again, a range, but probably want to try to clean up the transcribed string before going to geocoding service.

Names: Processing will depend on target database structure: Maybe you just want one string, or maybe you want to try to separate names. If the names are separated, they could be compared/linked to an outside list of collectors. (... and that could be part of a larger QA process: Does the collection date make sense, given the life span of the collector?) (GitHub code tries to do this.)

Older documents

Back to Transcription_Hackathon

@@ Line 1: / Line 1: @@
-== Coordination Tools  ==
+[[Category:Transcription Hackathon]]
+We worked on tools to help with reconciling and interpreting crowd-sourced data. One possible workflow might go like this:
-*[https://docs.google.com/document/d/1AOsU-lcQpzzzibXculxbpGUlLct3VS2coe0v62t2FvY/edit?usp=sharing GoogleDoc for Coordination]
+    Start with crowd-sourced transcriptions.
+    → '''reconcile''' ( → filter out irreconcilables?)
+    if locality:
+        → '''place name matching'''
+        → geocoding
+    if names:
+        → '''name splitting'''
+        → name list lookup
-'''Add your name and interests to the GoogleDoc, if this is a track that interests you!'''
+Reconciliation: Range of approaches:
+* Get a super-user to finalize / approve transcriptions, instead of trying to resolve multiple submissions
+* Or, given multiple transcriptions, pick one which minimizes some edit distance.
+* Or, use sequence alignment tools to find the best transcription of subregions in a larger string. (GitHub code does this.)
+Locality: Again, a range, but probably want to try to [http://norvig.com/spell-correct.html clean up] the transcribed string before going to geocoding service.
+Names: Processing will depend on target database structure: Maybe you just want one string, or maybe you want to try to separate names. If the names are separated, they could be compared/linked to an outside list of collectors. (... and that could be part of a larger QA process: Does the collection date make sense, given the life span of the collector?) (GitHub code tries to do this.)
+* [https://docs.google.com/presentation/d/1KqIprcRvAEqbKMmVmEqqEkbyg7DtxLk4RYgMGD15c4M Final presentation]
+* [https://github.com/idigbio-citsci-hackathon/StringTools GitHub]
+== Older documents  ==
+*[https://docs.google.com/document/d/1AOsU-lcQpzzzibXculxbpGUlLct3VS2coe0v62t2FvY GoogleDoc for Coordination]
+* [https://docs.google.com/document/d/1VxGU5sq2n0s9Ox84l7WSDUv4SKILgk7VKQewn3Zb5v0 GoogleDoc for presentation planning]
+Back to [[Transcription_Hackathon]]

Transcription Hackathon Reconciliation of Replicates Planning: Difference between revisions

Latest revision as of 16:13, 20 December 2013

Older documents

Navigation menu

Search