Data Quality Toolkit 2024: Difference between revisions
(Create page for Data Quality Toolkit) |
(Add outline of data quality toolkit formatting, with examples) |
||
Line 6: | Line 6: | ||
This page was created to aggregate common data quality issues and potential solutions to those issues in collection management systems and CMS-agnostic tools. Data quality issues are grouped into data categories, and tutorials are provided for (1) identifying and (2) fixing the issues. | This page was created to aggregate common data quality issues and potential solutions to those issues in collection management systems and CMS-agnostic tools. Data quality issues are grouped into data categories, and tutorials are provided for (1) identifying and (2) fixing the issues. | ||
This page was inspired by Bob Mesibov's [https://www.datafix.com.au/cookbook/ Data Cleaner's Cookbook]. | |||
== Catalog Numbers and Other Identifiers== | |||
=== Duplicate Catalog Numbers === | |||
'''Problem:''' The same catalog number is used multiple times within your dataset. (This problem may or may not be intentional, depending on your collection's policies. It is generally best to not duplicate catalog numbers, when possible). | |||
'''How to FIND this Problem in Your Dataset:''' | |||
* '''Arctos:''' | |||
* '''Excel:''' | |||
* '''OpenRefine''' | |||
* '''Specify:''' | |||
* '''Symbiota:''' | |||
* '''TaxonWorks:''' | |||
'''How to FIX this Problem in your Dataset:''' | |||
* '''Arctos:''' | |||
* '''Excel:''' | |||
* '''OpenRefine''' | |||
* '''Specify:''' | |||
* '''Symbiota:''' | |||
* '''TaxonWorks:''' | |||
== Dates == | |||
=== Identified Date Earlier than Collected Date === | |||
'''Problem:''' The date the specimen was identified (dateIdentified field) is earlier than the date the specimen was collected (eventDate). | |||
'''How to FIND this Problem in Your Dataset:''' | |||
* '''Arctos:''' | |||
* '''Excel:''' | |||
* '''OpenRefine''' | |||
* '''Specify:''' | |||
* '''Symbiota:''' | |||
* '''TaxonWorks:''' | |||
'''How to FIX this Problem in your Dataset:''' | |||
* '''Arctos:''' | |||
* '''Excel:''' | |||
* '''OpenRefine''' | |||
* '''Specify:''' | |||
* '''Symbiota:''' | |||
* '''TaxonWorks:''' | |||
== Geography == | |||
=== Misspelled Geographic Unit Names === | |||
'''Problem:''' The geographic units (e.g., country, state, county) are misspelled, resulting in poor matching of geographic unit names to existing geographic lists. | |||
'''How to FIND this Problem in Your Dataset:''' | |||
* '''Arctos:''' | |||
* '''Excel:''' | |||
* '''OpenRefine''' | |||
* '''Specify:''' | |||
* '''Symbiota:''' Use the [https://biokic.github.io/symbiota-docs/coll_manager/data_cleaning/geography/ Geography Cleaning Tools] | |||
* '''TaxonWorks:''' | |||
'''How to FIX this Problem in your Dataset:''' | |||
* '''Arctos:''' | |||
* '''Excel:''' | |||
* '''OpenRefine''' | |||
* '''Specify:''' | |||
* '''Symbiota:''' Use the [https://biokic.github.io/symbiota-docs/coll_manager/data_cleaning/geography/ Geography Cleaning Tools] | |||
* '''TaxonWorks:''' | |||
== Taxonomy == | |||
=== Misspelled Taxonomic Names === | |||
'''Problem:''' Scientific names are misspelled, resulting in poor matching of taxonomic names to taxonomic databases. | |||
'''How to FIND this Problem in Your Dataset:''' | |||
* '''Arctos:''' | |||
* '''Excel:''' | |||
* '''OpenRefine''' | |||
* '''Specify:''' | |||
* '''Symbiota:''' Use the [https://biokic.github.io/symbiota-docs/coll_manager/data_cleaning/taxonomy/ Taxonomic Cleaning Tool] | |||
* '''TaxonWorks:''' | |||
'''How to FIX this Problem in your Dataset:''' | |||
* '''Arctos:''' | |||
* '''Excel:''' | |||
* '''OpenRefine''' | |||
* '''Specify:''' | |||
* '''Symbiota:''' Use the [https://biokic.github.io/symbiota-docs/coll_manager/data_cleaning/taxonomy/ Taxonomic Cleaning Tool] | |||
* '''TaxonWorks:''' |
Revision as of 15:21, 14 February 2024
Overview
This page was created to aggregate common data quality issues and potential solutions to those issues in collection management systems and CMS-agnostic tools. Data quality issues are grouped into data categories, and tutorials are provided for (1) identifying and (2) fixing the issues.
This page was inspired by Bob Mesibov's Data Cleaner's Cookbook.
Catalog Numbers and Other Identifiers
Duplicate Catalog Numbers
Problem: The same catalog number is used multiple times within your dataset. (This problem may or may not be intentional, depending on your collection's policies. It is generally best to not duplicate catalog numbers, when possible).
How to FIND this Problem in Your Dataset:
- Arctos:
- Excel:
- OpenRefine
- Specify:
- Symbiota:
- TaxonWorks:
How to FIX this Problem in your Dataset:
- Arctos:
- Excel:
- OpenRefine
- Specify:
- Symbiota:
- TaxonWorks:
Dates
Identified Date Earlier than Collected Date
Problem: The date the specimen was identified (dateIdentified field) is earlier than the date the specimen was collected (eventDate).
How to FIND this Problem in Your Dataset:
- Arctos:
- Excel:
- OpenRefine
- Specify:
- Symbiota:
- TaxonWorks:
How to FIX this Problem in your Dataset:
- Arctos:
- Excel:
- OpenRefine
- Specify:
- Symbiota:
- TaxonWorks:
Geography
Misspelled Geographic Unit Names
Problem: The geographic units (e.g., country, state, county) are misspelled, resulting in poor matching of geographic unit names to existing geographic lists.
How to FIND this Problem in Your Dataset:
- Arctos:
- Excel:
- OpenRefine
- Specify:
- Symbiota: Use the Geography Cleaning Tools
- TaxonWorks:
How to FIX this Problem in your Dataset:
- Arctos:
- Excel:
- OpenRefine
- Specify:
- Symbiota: Use the Geography Cleaning Tools
- TaxonWorks:
Taxonomy
Misspelled Taxonomic Names
Problem: Scientific names are misspelled, resulting in poor matching of taxonomic names to taxonomic databases.
How to FIND this Problem in Your Dataset:
- Arctos:
- Excel:
- OpenRefine
- Specify:
- Symbiota: Use the Taxonomic Cleaning Tool
- TaxonWorks:
How to FIX this Problem in your Dataset:
- Arctos:
- Excel:
- OpenRefine
- Specify:
- Symbiota: Use the Taxonomic Cleaning Tool
- TaxonWorks: