Transcription Hackathon Reconciliation of Replicates Planning: Difference between revisions
Jump to navigation
Jump to search
Austinmast (talk | contribs) (Created page with "== Coordination Tools == *[https://docs.google.com/document/d/1AOsU-lcQpzzzibXculxbpGUlLct3VS2coe0v62t2FvY/edit?usp=sharing GoogleDoc for Coordination] '''Add your name and in...") |
|||
(2 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
[[Category:Transcription Hackathon]] | |||
We worked on tools to help with reconciling and interpreting crowd-sourced data. One possible workflow might go like this: | |||
Start with crowd-sourced transcriptions. | |||
→ '''reconcile''' ( → filter out irreconcilables?) | |||
if locality: | |||
→ '''place name matching''' | |||
→ geocoding | |||
if names: | |||
→ '''name splitting''' | |||
→ name list lookup | |||
Reconciliation: Range of approaches: | |||
* Get a super-user to finalize / approve transcriptions, instead of trying to resolve multiple submissions | |||
* Or, given multiple transcriptions, pick one which minimizes some edit distance. | |||
* Or, use sequence alignment tools to find the best transcription of subregions in a larger string. (GitHub code does this.) | |||
Locality: Again, a range, but probably want to try to [http://norvig.com/spell-correct.html clean up] the transcribed string before going to geocoding service. | |||
Names: Processing will depend on target database structure: Maybe you just want one string, or maybe you want to try to separate names. If the names are separated, they could be compared/linked to an outside list of collectors. (... and that could be part of a larger QA process: Does the collection date make sense, given the life span of the collector?) (GitHub code tries to do this.) | |||
* [https://docs.google.com/presentation/d/1KqIprcRvAEqbKMmVmEqqEkbyg7DtxLk4RYgMGD15c4M Final presentation] | |||
* [https://github.com/idigbio-citsci-hackathon/StringTools GitHub] | |||
== Older documents == | |||
*[https://docs.google.com/document/d/1AOsU-lcQpzzzibXculxbpGUlLct3VS2coe0v62t2FvY GoogleDoc for Coordination] | |||
* [https://docs.google.com/document/d/1VxGU5sq2n0s9Ox84l7WSDUv4SKILgk7VKQewn3Zb5v0 GoogleDoc for presentation planning] | |||
Back to [[Transcription_Hackathon]] |
Latest revision as of 16:13, 20 December 2013
We worked on tools to help with reconciling and interpreting crowd-sourced data. One possible workflow might go like this:
Start with crowd-sourced transcriptions. → reconcile ( → filter out irreconcilables?) if locality: → place name matching → geocoding if names: → name splitting → name list lookup
Reconciliation: Range of approaches:
- Get a super-user to finalize / approve transcriptions, instead of trying to resolve multiple submissions
- Or, given multiple transcriptions, pick one which minimizes some edit distance.
- Or, use sequence alignment tools to find the best transcription of subregions in a larger string. (GitHub code does this.)
Locality: Again, a range, but probably want to try to clean up the transcribed string before going to geocoding service.
Names: Processing will depend on target database structure: Maybe you just want one string, or maybe you want to try to separate names. If the names are separated, they could be compared/linked to an outside list of collectors. (... and that could be part of a larger QA process: Does the collection date make sense, given the life span of the collector?) (GitHub code tries to do this.)
Older documents
Back to Transcription_Hackathon