Arctos Data Quality Toolkit: Difference between revisions
| Jegelewicz (talk | contribs) | Jegelewicz (talk | contribs)   (https://github.com/ArctosDB/Arctos-Presentations/issues/77#issuecomment-2030181890) | ||
| (9 intermediate revisions by the same user not shown) | |||
| Line 4: | Line 4: | ||
| = Overview  = | = Overview  = | ||
| This toolkit contains Arctos-specific resources for the [[Data Quality Toolkit 2024]]. | This toolkit contains Arctos-specific resources for the [[Data Quality Toolkit 2024]]. For a detailed overview of data quality tools in Arctos see [https://handbook.arctosdb.org/documentation/data_quality Arctos Data Quality Checks, Reports, and Tools] | ||
| == Catalog Numbers and Other Identifiers== | == Catalog Numbers and Other Identifiers== | ||
| Line 11: | Line 11: | ||
| '''Problem:''' The same catalog number is used multiple times within your dataset. (This problem may or may not be intentional, depending on your collection's policies. It is generally best to not duplicate catalog numbers, when possible). | '''Problem:''' The same catalog number is used multiple times within your dataset. (This problem may or may not be intentional, depending on your collection's policies. It is generally best to not duplicate catalog numbers, when possible). | ||
| '''Solution:''' Catalog numbers must match the expected format for the collection and may not already exist in Arctos. Duplicate catalog numbers  | '''Solution:''' Catalog numbers must match the expected format for the collection and may not already exist in Arctos. Duplicate catalog numbers are not allowed in Arctos.  Any duplicate of an existing number will generate and error and fail to upload. | ||
| https://handbook.arctosdb.org/documentation/data_quality#catalog-numbers | https://handbook.arctosdb.org/documentation/data_quality#catalog-numbers | ||
| Line 42: | Line 42: | ||
| '''Problem:''' The event [https://dwc.tdwg.org/terms/#dwc:year year], [https://dwc.tdwg.org/terms/#dwc:month month], and [https://dwc.tdwg.org/terms/#dwc:day day] values do not match the provided [https://dwc.tdwg.org/terms/#dwc:eventDate event date]. The event date is often the date of collection for preserved specimens. | '''Problem:''' The event [https://dwc.tdwg.org/terms/#dwc:year year], [https://dwc.tdwg.org/terms/#dwc:month month], and [https://dwc.tdwg.org/terms/#dwc:day day] values do not match the provided [https://dwc.tdwg.org/terms/#dwc:eventDate event date]. The event date is often the date of collection for preserved specimens. | ||
| '''Solution:''' Dates are always entered as a single value. Components (year, month, day, time) are extracted at the time of request, never stored | '''Solution:''' Dates are always entered as a single value. Components (year, month, day, time) are extracted at the time of request, never stored as separate terms. | ||
| https://handbook.arctosdb.org/documentation/data_quality#dates | https://handbook.arctosdb.org/documentation/data_quality#dates | ||
| Line 93: | Line 93: | ||
| '''Problem:''' Lower geography (e.g., county, state/province) values exist, but no higher geography values (e.g., country) are provided. | '''Problem:''' Lower geography (e.g., county, state/province) values exist, but no higher geography values (e.g., country) are provided. | ||
| '''Solution:''' Higher geography in Arctos is a controlled vocabulary of data objects associated with spatial polygons. Components are extracted on demand, never stored. | '''Solution:''' Higher geography in Arctos is a controlled vocabulary of data objects associated with spatial polygons. Components are extracted on demand, never stored as separate terms. | ||
| https://handbook.arctosdb.org/documentation/data_quality#locality | https://handbook.arctosdb.org/documentation/data_quality#locality | ||
| Line 107: | Line 107: | ||
| '''Problem:''' The provided value for [https://dwc.tdwg.org/terms/#dwc:country country] and [https://dwc.tdwg.org/terms/#dwc:countryCode countryCode] do not match. | '''Problem:''' The provided value for [https://dwc.tdwg.org/terms/#dwc:country country] and [https://dwc.tdwg.org/terms/#dwc:countryCode countryCode] do not match. | ||
| '''Solution:'''  | '''Solution:''' countrycode isn't part of Arctos (because adding it would in many cases introduce unnecessary ambiguity) | ||
| https://handbook.arctosdb.org/documentation/data_quality# | https://handbook.arctosdb.org/documentation/data_quality#locality | ||
| === Mismatched Geographic Terms === | === Mismatched Geographic Terms === | ||
| Line 144: | Line 144: | ||
| '''Problem:''' Scientific names are misspelled, resulting in poor matching of taxonomic names to taxonomic databases. | '''Problem:''' Scientific names are misspelled, resulting in poor matching of taxonomic names to taxonomic databases. | ||
| '''Solution:''' Identifications in Arctos can be made in several formats, however, they all must include a reference to at least one term from the Taxon Name Table. This table is maintained by Arctos Operators with manage_taxonomy permissions and is not guaranteed to exclude misspellings or errors, but when these are discovered, there are paths for linking poorly formatted names to the correct version and/or  | '''Solution:''' Identifications in Arctos can be made in several formats, however, they all must include a reference to at least one term from the Taxon Name Table. This table is maintained by Arctos Operators with manage_taxonomy permissions and is not guaranteed to exclude misspellings or errors, but when these are discovered, there are paths for linking poorly formatted names to the correct version and/or quarantining such names from use while still allowing them to be present for the purposes of search and discoverability. | ||
| https://handbook.arctosdb.org/documentation/data_quality#identification-taxon-names | https://handbook.arctosdb.org/documentation/data_quality#identification-taxon-names | ||
| Line 155: | Line 155: | ||
| '''Problem:''' Species may be missing higher taxonomic information. | '''Problem:''' Species may be missing higher taxonomic information. | ||
| '''Solution:'''  | '''Solution:''' The [https://arctos.database.museum/Reports/flat_taxonomy_gap.cfm Taxonomy Gap] tool in Arctos allows for review of taxonomic classifications with missing terms (Family, Order, etc.) or with no associated local classification. Arctos also pulls data from GlobalNames so records are generally still discoverable even when local taxonomic sources are missing terms or entire classifications. | ||
| https://handbook.arctosdb.org/documentation/data_quality#taxonomy | https://handbook.arctosdb.org/documentation/data_quality#taxonomy | ||
| Line 171: | Line 171: | ||
| '''Problem:''' When transferring text files between Unix/Linux and DOS/Windows systems, line endings can become inconsistent. Unix/Linux systems typically use line feed (LF) characters, while DOS/Windows systems use carriage return (CR) and line feed (LF) combinations. This mismatch can result in extra characters appearing in the data, causing visual artifacts and processing errors. | '''Problem:''' When transferring text files between Unix/Linux and DOS/Windows systems, line endings can become inconsistent. Unix/Linux systems typically use line feed (LF) characters, while DOS/Windows systems use carriage return (CR) and line feed (LF) combinations. This mismatch can result in extra characters appearing in the data, causing visual artifacts and processing errors. | ||
| '''Solution:''' | '''Solution:''' No fields may include a non-printing character, leading spaces, or trailing spaces. | ||
| https://handbook.arctosdb.org/documentation/data_quality#nonprinting-characters | |||
| === Invalid Individual Count === | === Invalid Individual Count === | ||
| '''Problem:''' individualCount values may not make sense as a positive integer.   | '''Problem:''' individualCount values may not make sense as a positive integer.   | ||
| '''Solution:''' | '''Solution:''' This is a curatorial assertion, there are no constraints. | ||
| https://handbook.arctosdb.org/documentation/data_quality#curatorial | |||
| === Non-standardized BasisOfRecord Values === | === Non-standardized BasisOfRecord Values === | ||
| Line 185: | Line 189: | ||
| Note that even punctuation and capitalization differences in these values (e.g., Preserved Specimen) are discouraged. | Note that even punctuation and capitalization differences in these values (e.g., Preserved Specimen) are discouraged. | ||
| '''Solution:''' | '''Solution:''' Basis of record is required in Arctos and must match a [https://arctos.database.museum/info/ctDocumentation.cfm?table=ctcataloged_item_type controlled vocabulary] that includes the terms expected in the DarwinCore Archive prepared for GBIF. Collections can select a preferred value and if left blank during data entry the preferred value will be automatically used. | ||
| https://handbook.arctosdb.org/documentation/data_quality#basis-of-record | |||
Latest revision as of 10:24, 2 April 2024
Overview
This toolkit contains Arctos-specific resources for the Data Quality Toolkit 2024. For a detailed overview of data quality tools in Arctos see Arctos Data Quality Checks, Reports, and Tools
Catalog Numbers and Other Identifiers
Duplicate Catalog Numbers
Problem: The same catalog number is used multiple times within your dataset. (This problem may or may not be intentional, depending on your collection's policies. It is generally best to not duplicate catalog numbers, when possible).
Solution: Catalog numbers must match the expected format for the collection and may not already exist in Arctos. Duplicate catalog numbers are not allowed in Arctos. Any duplicate of an existing number will generate and error and fail to upload.
https://handbook.arctosdb.org/documentation/data_quality#catalog-numbers
Dates
Date Hasn't Happened Yet
Problem: The date the specimen was identified, collected (often designated using the eventDate field), or georeferenced is in the future.
Solution: Future dates of collection (dates that fall after the current date) are not allowed.
https://handbook.arctosdb.org/documentation/data_quality#dates
Date is Suspiciously Old
Problem: The date the specimen was identified, collected (often designated using the eventDate field), or georeferenced is outside the expected historical date range. The expected date range depends on the institution, but it is unlikely that most collections have specimens with dates prior to 1600.
Solution: Many legitimate very old dates exist, however a date of collection or identification before the birth date of the collector or determiner will trigger a data quality notification in Arctos.
https://handbook.arctosdb.org/documentation/data_quality#dates-1
Identified Date Earlier than Collected Date
Problem: The date the specimen was identified (dateIdentified field) is earlier than the date the specimen was collected (eventDate).
Solution: Arctos supports more than collecting, so something may legitimately be identified (as in an observation) prior to being collected, however, there is a curatorial report that flags this situation for review.
https://handbook.arctosdb.org/documentation/data_quality#dates-1
Year, Month, and Day Values Do Not Match Date
Problem: The event year, month, and day values do not match the provided event date. The event date is often the date of collection for preserved specimens.
Solution: Dates are always entered as a single value. Components (year, month, day, time) are extracted at the time of request, never stored as separate terms.
https://handbook.arctosdb.org/documentation/data_quality#dates
Geography
Coordinates are Zero
Problem: The provided latitude and longitude values are 0.
Solution: Such a place exists and these coordinates are acceptable, however, if they do not fall inside the associated higher geography polygon, a data quality report will be generated.
https://handbook.arctosdb.org/documentation/data_quality#locality
Coordinates Do Not Fall Within Named Geographic Unit
Problem: The provided coordinates do not fall within the geographic boundaries of the named country, state, and/or county.
Solution: Assigned coordinates plus error that do not fall within the higher geography polygon for any location generate a data quality report for all collections using the locality. This clearly highlights improper negation as well as coordinate/geography mismatches.
https://handbook.arctosdb.org/documentation/data_quality#locality
Georeference Metadata with no Associated Georeference
Problem: Metadata fields regarding coordinates, such as coordinateUncertaintyInMeters, georeferenceProtocol, georeferenceSources, georeferencedBy, georeferenceRemarks, and geodeticDatum are provided, but no coordinates are present. This is sometimes intentional, particularly when georeferencedBy and georeferencedRemarks are used to indicate whether a record was purposefully not georeferenced. However, it is rare that the other metadata fields can be used without associated coordinates (i.e., decimalLatitude, [ https://dwc.tdwg.org/terms/#dwc:decimalLongitude decimalLongitude], or verbatimCoordinates).
Solution: Datum must be supplied with coordinates, but cannot be supplied without them. In addition, georeference protocol and georeference error cannot be supplied without coordinates, although coordinates can be supplied without them. All spatial data are converted to WGS84 and datum is explicitly provided. Input datum is also retained.
https://handbook.arctosdb.org/documentation/data_quality#georeference
Elevation is Unlikely
Problem: Elevation values are either too high (>17000 m) or too low (-11000 m) to occur on Earth.
Solution: Elevation values are constrained to avoid elevations or depths not possible on Earth.
https://handbook.arctosdb.org/documentation/data_quality#elevation-and-depth
Improperly Negated Latitudes/Longitudes
Problem: The sign of the latitude (decimalLatitude) or longitude (decimalLongitude) does not match the sign/hemisphere of the given country. For example, all longitudes in the U.S. should be negative.
Solution: Assigned coordinates plus error that do not fall within the higher geography polygon for any location generate a data quality report for all collections using the locality. This clearly highlights improper negation as well as coordinate/geography mismatches.
https://handbook.arctosdb.org/documentation/data_quality#locality
Invalid Coordinates
Problem: Coordinates deviate from accepted ranges or formats, like decimalLatitude and decimalLongitude exceeding -90 to 90 and -180 to 180, respectively. verbatimCoordinates have to be valid values for coordinates in decimal degrees, degrees decimal minutes, degrees minutes second.
Solution: Coordinate values are datatyped to disallow invalid entries.
https://handbook.arctosdb.org/documentation/data_quality#georeference
Lower Geography Values are Provided, but No Higher Geography
Problem: Lower geography (e.g., county, state/province) values exist, but no higher geography values (e.g., country) are provided.
Solution: Higher geography in Arctos is a controlled vocabulary of data objects associated with spatial polygons. Components are extracted on demand, never stored as separate terms.
https://handbook.arctosdb.org/documentation/data_quality#locality
Minimum and Maximum Elevation Values Mismatched
Problem: The minimum elevation (minimumElevationInMeters) has a greater value than the maximum elevation (maximumElevationInMeters).
Solution: Lowest elevation or depth cannot be more than highest.
https://handbook.arctosdb.org/documentation/data_quality#elevation-and-depth
Mismatched Country and CountryCode Values
Problem: The provided value for country and countryCode do not match.
Solution: countrycode isn't part of Arctos (because adding it would in many cases introduce unnecessary ambiguity)
https://handbook.arctosdb.org/documentation/data_quality#locality
Mismatched Geographic Terms
Problem: A record has lower geographic terms (e.g., state/province, county) that do not exist under the provided higher geographic term(s). For example, country = Canada and stateProvince = Sussex. There is no Sussex province in Canada.
Solution: Higher geography in Arctos is a controlled vocabulary composed of terms from GADM and IHO World Seas supported by shapes. Higher geography must match a term in this vocabulary, so any “misspellings” would be intentionally matching the relevant authority.
https://handbook.arctosdb.org/documentation/data_quality#higher-geography
Missing Geodetic Datum
Problem: Geodetic datum is a key piece of a properly georeferenced specimen, but is usually left blank. Although it is commonly assumed to be in ‘WGS84’, this should be added and noted as such.
Solution: Datum must be supplied with coordinates.
https://handbook.arctosdb.org/documentation/data_quality#georeference
Missing Latitudes/Longitudes
Problem: A record has a latitude value, but not a longitude value, or vice versa.
Solution: Latitude and longitude must either both be NULL or both include a value.
https://handbook.arctosdb.org/documentation/data_quality#georeference
Misspelled Geographic Unit Names
Problem: The geographic units (e.g., country, state/province, county) are misspelled, resulting in poor matching of geographic unit names to existing geographic lists.
Solution: Higher geography in Arctos is a controlled vocabulary composed of terms from GADM and IHO World Seas supported by shapes. Higher geography must match a term in this vocabulary, so any “misspellings” would be intentionally matching the relevant authority.
https://handbook.arctosdb.org/documentation/data_quality#higher-geography
Taxonomy
Misspelled or Invalid Taxonomic Names
Problem: Scientific names are misspelled, resulting in poor matching of taxonomic names to taxonomic databases.
Solution: Identifications in Arctos can be made in several formats, however, they all must include a reference to at least one term from the Taxon Name Table. This table is maintained by Arctos Operators with manage_taxonomy permissions and is not guaranteed to exclude misspellings or errors, but when these are discovered, there are paths for linking poorly formatted names to the correct version and/or quarantining such names from use while still allowing them to be present for the purposes of search and discoverability.
https://handbook.arctosdb.org/documentation/data_quality#identification-taxon-names
Taxon pages in Arctos include external validation through comparisons with select taxonomic authorities including the World Register of Marine Species (WoRMS), Encyclopedia of Life (EOL), the Global Biodiversity Information Facility (GBIF) and Wikipedia among others. This tool is also engaged whenever a new name is added to the taxonomic name table to help avoid the addition of mispellings and misused names.
https://handbook.arctosdb.org/documentation/data_quality#taxonomy
Unknown Higher Taxonomy
Problem: Species may be missing higher taxonomic information.
Solution: The Taxonomy Gap tool in Arctos allows for review of taxonomic classifications with missing terms (Family, Order, etc.) or with no associated local classification. Arctos also pulls data from GlobalNames so records are generally still discoverable even when local taxonomic sources are missing terms or entire classifications.
https://handbook.arctosdb.org/documentation/data_quality#taxonomy
Other Issues
Incorrect Character Encodings
Problem: Data inconsistencies arise when incorrect character encodings are used during data manipulation or transfer. This issue occurs when datasets are opened, downloaded, or imported across different software platforms, leading to misinterpretation and garbled text. For instance, special characters like accents or symbols may be rendered incorrectly, affecting the readability and accuracy of the data. (e.g., Carl Linné).
Solution: No fields may include a non-printing character, leading spaces, or trailing spaces.
https://handbook.arctosdb.org/documentation/data_quality#nonprinting-characters
Incorrect Line Endings
Problem: When transferring text files between Unix/Linux and DOS/Windows systems, line endings can become inconsistent. Unix/Linux systems typically use line feed (LF) characters, while DOS/Windows systems use carriage return (CR) and line feed (LF) combinations. This mismatch can result in extra characters appearing in the data, causing visual artifacts and processing errors.
Solution: No fields may include a non-printing character, leading spaces, or trailing spaces.
https://handbook.arctosdb.org/documentation/data_quality#nonprinting-characters
Invalid Individual Count
Problem: individualCount values may not make sense as a positive integer.
Solution: This is a curatorial assertion, there are no constraints.
https://handbook.arctosdb.org/documentation/data_quality#curatorial
Non-standardized BasisOfRecord Values
Problem: Values in the BasisOfRecord field do not match the recommended controlled vocabulary. While using standardized terms in this field is not strictly necessary, doing so does improve the discoverability and interoperability of your data.
The currently accepted values for BasisOfRecord include: MaterialEntity, PreservedSpecimen, FossilSpecimen, LivingSpecimen, MaterialSample, Event, HumanObservation, MachineObservation, Taxon, Occurrence, MaterialCitation.
Note that even punctuation and capitalization differences in these values (e.g., Preserved Specimen) are discouraged.
Solution: Basis of record is required in Arctos and must match a controlled vocabulary that includes the terms expected in the DarwinCore Archive prepared for GBIF. Collections can select a preferred value and if left blank during data entry the preferred value will be automatically used.
https://handbook.arctosdb.org/documentation/data_quality#basis-of-record