Symbiota Data Quality Toolkit: Difference between revisions

From iDigBio
Jump to navigation Jump to search
(Replaced content with "Symbiota has created a Data Quality Toolkit on their [https://biokic.github.io/symbiota-docs/editor/quality/ Documentation Site].")
Tag: Replaced
 
Line 1: Line 1:
[[Category:Data Quality]]
Symbiota has created a Data Quality Toolkit on their [https://biokic.github.io/symbiota-docs/editor/quality/ Documentation Site].
[[Category:Workshop]]
 
= Overview  =
 
This toolkit contains Symbiota-specific resources for the [[Data Quality Toolkit 2024]]. Symbiota is the open-source software behind many popular portals such as SEINet, the Bryophyte Portal, the Lichen Portal, and InvertEBase. More information about Symbiota can be found on [https://symbiota.org/ their homepage].
 
[[File:LogoSymbiotaPNG400.png|300px]]
 
== Catalog Numbers and Other Identifiers==
 
=== Duplicate Catalog Numbers ===
'''Problem:''' The same catalog number is used multiple times within your dataset. (This problem may or may not be intentional, depending on your collection's policies. It is generally best to not duplicate catalog numbers, when possible).
 
'''Solution:''' Symbiota includes a built-in tool for identifying and resolving duplicate catalog numbers, described [https://biokic.github.io/symbiota-docs/coll_manager/data_cleaning/dupes/ in this tutorial]. This tool shows a list of all duplicates and allows the collection administrator to merge records if necessary.
 
== Dates ==
 
=== Date Hasn't Happened Yet ===
'''Problem:''' The date the specimen was [https://dwc.tdwg.org/terms/#dwc:dateIdentified identified], collected (often designated using the [https://dwc.tdwg.org/terms/#dwc:eventDate eventDate] field), or [https://dwc.tdwg.org/terms/#dwc:georeferencedDate georeferenced] is in the future.
 
'''Solution:'''
 
=== Date is Suspiciously Old ===
 
'''Problem:''' The date the specimen was [https://dwc.tdwg.org/terms/#dwc:dateIdentified identified], collected (often designated using the [https://dwc.tdwg.org/terms/#dwc:eventDate eventDate] field), or [https://dwc.tdwg.org/terms/#dwc:georeferencedDate georeferenced] is outside the expected historical date range. The expected date range depends on the institution, but it is unlikely that most collections have specimens with dates prior to 1600.
 
'''Solution:'''
 
=== Identified Date Earlier than Collected Date ===
'''Problem:''' The date the specimen was identified (dateIdentified field) is earlier than the date the specimen was collected (eventDate).
 
'''Solution:'''
 
=== Year, Month, and Day Values Do Not Match Date ===
'''Problem:''' The event [https://dwc.tdwg.org/terms/#dwc:year year], [https://dwc.tdwg.org/terms/#dwc:month month], and [https://dwc.tdwg.org/terms/#dwc:day day] values do not match the provided [https://dwc.tdwg.org/terms/#dwc:eventDate event date]. The event date is often the date of collection for preserved specimens.
 
'''Solution:'''
 
== Geography ==
 
=== Coordinates are Zero ===
'''Problem:''' The provided latitude and longitude values are 0.
 
'''Solution:'''
 
=== Coordinates Do Not Fall Within Named Geographic Unit ===
'''Problem:''' The provided coordinates do not fall within the geographic boundaries of the named country, state, and/or county.
 
'''Solution:'''
 
=== Georeference Metadata with no Associated Georeference ===
'''Problem:''' Metadata fields regarding coordinates, such as [https://dwc.tdwg.org/terms/#dwc:coordinateUncertaintyInMeters coordinateUncertaintyInMeters], [https://dwc.tdwg.org/terms/#dwc:georeferenceProtocol georeferenceProtocol], [https://dwc.tdwg.org/terms/#dwc:georeferenceSources georeferenceSources], [https://dwc.tdwg.org/terms/#dwc:georeferencedBy georeferencedBy], [https://dwc.tdwg.org/terms/#dwc:georeferenceRemarks georeferenceRemarks], and [https://dwc.tdwg.org/terms/#dwc:geodeticDatum geodeticDatum] are provided, but no coordinates are present. This is sometimes intentional, particularly when georeferencedBy and georeferencedRemarks are used to indicate whether a record was purposefully not georeferenced. However, it is rare that the other metadata fields can be used without associated coordinates (i.e., [https://dwc.tdwg.org/terms/#dwc:decimalLatitude decimalLatitude], [ https://dwc.tdwg.org/terms/#dwc:decimalLongitude decimalLongitude], or [https://dwc.tdwg.org/terms/#dwc:verbatimCoordinates verbatimCoordinates]).
 
'''Solution:'''
 
=== Elevation is Unlikely ===
'''Problem:''' Elevation values are either too high (>17000 m) or too low (-11000 m) to occur on Earth.
 
'''Solution:'''
 
=== Improperly Negated Latitudes/Longitudes ===
'''Problem:''' The sign of the latitude ([https://dwc.tdwg.org/terms/#dwc:decimalLatitude decimalLatitude]) or longitude ([https://dwc.tdwg.org/terms/#dwc:decimalLongitude decimalLongitude]) does not match the sign/hemisphere of the given country. For example, all longitudes in the U.S. should be negative.
 
'''Solution:'''
 
=== Invalid Coordinates ===
'''Problem:''' Coordinates deviate from accepted ranges or formats, like decimalLatitude and decimalLongitude exceeding -90 to 90 and -180 to 180, respectively. verbatimCoordinates have to be valid values for coordinates in decimal degrees, degrees decimal minutes, degrees minutes second.
 
'''Solution:'''
 
=== Lower Geography Values are Provided, but No Higher Geography ===
'''Problem:''' Lower geography (e.g., county, state/province) values exist, but no higher geography values (e.g., country) are provided.
 
'''Solution:'''
 
=== Minimum and Maximum Elevation Values Mismatched ===
'''Problem:''' The minimum elevation ([https://dwc.tdwg.org/terms/#dwc:minimumElevationInMeters minimumElevationInMeters]) has a greater value than the maximum elevation ([https://dwc.tdwg.org/terms/#dwc:maximumElevationInMeters maximumElevationInMeters]).
 
'''Solution:'''
 
=== Mismatched Country and CountryCode Values ===
'''Problem:''' The provided value for [https://dwc.tdwg.org/terms/#dwc:country country] and [https://dwc.tdwg.org/terms/#dwc:countryCode countryCode] do not match.
 
'''Solution:'''
 
=== Mismatched Geographic Terms ===
'''Problem:''' A record has lower geographic terms (e.g., state/province, county) that do not exist under the provided higher geographic term(s). For example, country = Canada and stateProvince = Sussex. There is no Sussex province in Canada.
 
'''Solution:'''
 
=== Missing Geodetic Datum ===
'''Problem:''' Geodetic datum is a key piece of a properly georeferenced specimen, but is usually left blank. Although it is commonly assumed to be in ‘WGS84’, this should be added and noted as such.
 
'''Solution:'''
 
=== Missing Latitudes/Longitudes ===
'''Problem:''' A record has a latitude value, but not a longitude value, or vice versa.
 
'''Solution:''' To identify records with this problem in your dataset, use the [https://biokic.github.io/symbiota-docs/editor/edit/ Record Search form]. For Custom Field 1, select Decimal Latitude IS NULL. For Custom Field 2, select Decimal Longitude IS NOT NULL. Then conduct a similar search with Decimal Latitude IS NOT NULL and Decimal Longitude IS NULL. To fix this problem, you will need to review the records and either add lat/long values or remove the orphaned lat/long values.
 
=== Misspelled Geographic Unit Names ===
'''Problem:''' The geographic units (e.g., [https://dwc.tdwg.org/terms/#dwc:country country], [https://dwc.tdwg.org/terms/#dwc:stateProvince state/province], [https://dwc.tdwg.org/terms/#dwc:county county]) are misspelled, resulting in poor matching of geographic unit names to existing geographic lists.
 
'''Solution:''' Use the [https://biokic.github.io/symbiota-docs/coll_manager/data_cleaning/geography/ Geography Cleaning Tools].
 
== Taxonomy ==
 
=== Misspelled or Invalid Taxonomic Names ===
'''Problem:''' Scientific names are misspelled, resulting in poor matching of taxonomic names to taxonomic databases.
 
'''Solution:''' Use the [https://biokic.github.io/symbiota-docs/coll_manager/data_cleaning/taxonomy/ Taxonomic Cleaning Tool].
 
=== Unknown Higher Taxonomy ===
'''Problem:''' Species may be missing higher taxonomic information.
 
'''Solution:'''
 
== Other Issues ==
 
=== Incorrect Character Encodings ===
'''Problem:''' Data inconsistencies arise when incorrect character encodings are used during data manipulation or transfer. This issue occurs when datasets are opened, downloaded, or imported across different software platforms, leading to misinterpretation and garbled text. For instance, special characters like accents or symbols may be rendered incorrectly, affecting the readability and accuracy of the data. (e.g., Carl Linné).
 
'''Solution:'''
 
=== Incorrect Line Endings ===
'''Problem:''' When transferring text files between Unix/Linux and DOS/Windows systems, line endings can become inconsistent. Unix/Linux systems typically use line feed (LF) characters, while DOS/Windows systems use carriage return (CR) and line feed (LF) combinations. This mismatch can result in extra characters appearing in the data, causing visual artifacts and processing errors.
 
'''Solution:'''
 
=== Invalid Individual Count ===
'''Problem:''' individualCount values may not make sense as a positive integer.
 
'''Solution:'''
 
=== Non-standardized BasisOfRecord Values ===
'''Problem:''' Values in the [https://dwc.tdwg.org/terms/#dwc:basisOfRecord BasisOfRecord] field do not match the recommended controlled vocabulary. While using standardized terms in this field is not strictly necessary, doing so does improve the discoverability and interoperability of your data.
 
The currently accepted values for BasisOfRecord include: MaterialEntity, PreservedSpecimen, FossilSpecimen, LivingSpecimen, MaterialSample, Event, HumanObservation, MachineObservation, Taxon, Occurrence, MaterialCitation.
 
Note that even punctuation and capitalization differences in these values (e.g., Preserved Specimen) are discouraged.
 
'''Solution:'''

Latest revision as of 13:34, 21 March 2024

Symbiota has created a Data Quality Toolkit on their Documentation Site.