2013 hackathon data elements
Target Data Elements
Primary scoring for critical items dwc:catalogNumber dwc:recordedBy dwc:recordNumber dwc:verbatimEventDate aocr:verbatimScientificName Secondary scoring for other key items aocr:verbatimInstitution dwc:datasetName dwc:verbatimLocality dwc:country dwc:stateProvince dwc:county dwc:verbatimLatitude dwc:verbatimLongitude Lastly, scoring for optional items dwc:eventDate dwc:scientificName dwc:decimalLatitude dwc:decimalLongitude dwc:fieldNotes dwc:sex dwc:dateIdentified dwc:identifiedBy
Evaluation
We will attempt to provide services that can validate the outcomes of hackathon deliverables. This hackathon is not structured as a competition, but we felt it would be beneficial for participants to have some baseline to evaluate the effectiveness of their methods.
OCR Text Evaluation
Evaluation of OCR Output will be based on a comparison to Gold Hand-Typed outputs, using confusion matrix like criteria for evaluating word presence, word correctness, and avoiding non-text garbage regions. We will attempt to avoid penalizing for attempts at text recognition in barcode and handwritten regions.
Parsed Field Evaluation
Evaluation of the effectiveness of parsing will be calculated based on a confusion matrix. Rows are named with each of the possible element names for parts of a label. Columns are also these same names. Counts along the diagonal represent the number of items that were tagged correctly. For example, a count that is correctly labeled as a county will add one to the diagonal. If a county is incorrectly marked as a stateProvince, a 1 is added to the “county” row under the stateProvince column. This format therefore provides a count of correct classifications and count of false positives and false negatives. We will calculate, precision, recall, f-score and potentially others.
Given the discussion from the broader community, it may also be that we change our minds with respect to what belongs in categories above. For now, those fields above should be seen as the ones of general interest and we can be flexible and discuss our evaluation strategy further with regard to primary / secondary / last.
Note extra credit will be figured in for those that manage to get their data from CSV to XML format. Extra credit may also be given for those that manage to get their CSV columns according to the order of the fields as their appear in the image.