Excel Data Quality Toolkit: Difference between revisions
| Line 147: | Line 147: | ||
| # In the Sort By row under "Column", select the column/field that contains your individualCount values. | # In the Sort By row under "Column", select the column/field that contains your individualCount values. | ||
| # Keep the value "Cell Values" under "Sort On", then under "Order", select the "Largest to Smallest" option. This will bring all the non-numeric values in this field to the top of your spreadsheet, and you can review and fix them as necessary. | # Keep the value "Cell Values" under "Sort On", then under "Order", select the "Largest to Smallest" option. This will bring all the non-numeric values in this field to the top of your spreadsheet, and you can review and fix them as necessary. | ||
| #  | # Sort again with the "Smallest to Largest" option under Order to find any erroneous negative values. | ||
| === Non-standardized BasisOfRecord Values === | === Non-standardized BasisOfRecord Values === | ||
Revision as of 16:37, 25 March 2024
Overview
This toolkit contains Excel-specific resources for the Data Quality Toolkit 2024.
Catalog Numbers and Other Identifiers
Duplicate Catalog Numbers
Problem: The same catalog number is used multiple times within your dataset. (This problem may or may not be intentional, depending on your collection's policies. It is generally best to not duplicate catalog numbers, when possible).
Solution: A simple way to identify duplicate catalog numbers is to use Conditional Formatting.
- Click the letter above the column that contains catalog numbers to highlight the entire column.
- From the Home toolbar, click "Conditional Formatting, then "Highlight Cells Rules", then "Duplicate Values...".
- Click OK on the pop-up window.
- Duplicate values for catalog number will now be highlighted in red. To view these records together, you can sort the spreadsheet by first clicking outside the top left corner of the spreadsheet (on the small triangle icon) to select the ENTIRE spreadsheet.
- Click the "Sort & Filter" button in the "Editing" section of the Home toolbar.
- Click Custom Filter.
- In the pop-up window, select the column containing your catalog numbers from dropdown menu under "Column". (If your data has header rows, make sure that they "My data has headers" box is checked.)
- In the second dropdown menu labeled Sort On, select Cell Color. Note: Depending on the size of your spreadsheet, it might take a moment for the window to load after you have selected this option.
- In the third dropdown menu that will appear after you select Cell Color, select the formatting color (likely red) that you want to appear at the top of your spreadsheet.
- Click "OK" and wait until the spreadsheet re-loads (this may take awhile). Once done, the rows with duplicate catalog numbers will show up at the top of your spreadsheet.
Dates
Date Hasn't Happened Yet
Problem: The date the specimen was identified, collected (often designated using the eventDate field), or georeferenced is in the future.
Solution:
Date is Suspiciously Old
Problem: The date the specimen was identified, collected (often designated using the eventDate field), or georeferenced is outside the expected historical date range. The expected date range depends on the institution, but it is unlikely that most collections have specimens with dates prior to 1600.
Solution:
Identified Date Earlier than Collected Date
Problem: The date the specimen was identified (dateIdentified field) is earlier than the date the specimen was collected (eventDate).
Solution:
Year, Month, and Day Values Do Not Match Date
Problem: The event year, month, and day values do not match the provided event date. The event date is often the date of collection for preserved specimens.
Solution:
Geography
Coordinates are Zero
Problem: The provided latitude and longitude values are 0.
Solution:
Coordinates Do Not Fall Within Named Geographic Unit
Problem: The provided coordinates do not fall within the geographic boundaries of the named country, state, and/or county.
Solution:
Georeference Metadata with no Associated Georeference
Problem: Metadata fields regarding coordinates, such as coordinateUncertaintyInMeters, georeferenceProtocol, georeferenceSources, georeferencedBy, georeferenceRemarks, and geodeticDatum are provided, but no coordinates are present. This is sometimes intentional, particularly when georeferencedBy and georeferencedRemarks are used to indicate whether a record was purposefully not georeferenced. However, it is rare that the other metadata fields can be used without associated coordinates (i.e., decimalLatitude, [ https://dwc.tdwg.org/terms/#dwc:decimalLongitude decimalLongitude], or verbatimCoordinates).
Solution:
Elevation is Unlikely
Problem: Elevation values are either too high (>17000 m) or too low (-11000 m) to occur on Earth.
Solution:
Improperly Negated Latitudes/Longitudes
Problem: The sign of the latitude (decimalLatitude) or longitude (decimalLongitude) does not match the sign/hemisphere of the given country. For example, all longitudes in the U.S. should be negative.
Solution:
Invalid Coordinates
Problem: Coordinates deviate from accepted ranges or formats, like decimalLatitude and decimalLongitude exceeding -90 to 90 and -180 to 180, respectively. verbatimCoordinates have to be valid values for coordinates in decimal degrees, degrees decimal minutes, degrees minutes second.
Solution:
Lower Geography Values are Provided, but No Higher Geography
Problem: Lower geography (e.g., county, state/province) values exist, but no higher geography values (e.g., country) are provided.
Solution:
Minimum and Maximum Elevation Values Mismatched
Problem: The minimum elevation (minimumElevationInMeters) has a greater value than the maximum elevation (maximumElevationInMeters).
Solution:
Mismatched Country and CountryCode Values
Problem: The provided value for country and countryCode do not match.
Solution:
Mismatched Geographic Terms
Problem: A record has lower geographic terms (e.g., state/province, county) that do not exist under the provided higher geographic term(s). For example, country = Canada and stateProvince = Sussex. There is no Sussex province in Canada.
Solution:
Missing Geodetic Datum
Problem: Geodetic datum is a key piece of a properly georeferenced specimen, but is usually left blank. Although it is commonly assumed to be in ‘WGS84’, this should be added and noted as such.
Solution:
Missing Latitudes/Longitudes
Problem: A record has a latitude value, but not a longitude value, or vice versa.
Solution:
Misspelled Geographic Unit Names
Problem: The geographic units (e.g., country, state/province, county) are misspelled, resulting in poor matching of geographic unit names to existing geographic lists.
Solution:
Taxonomy
Misspelled or Invalid Taxonomic Names
Problem: Scientific names are misspelled, resulting in poor matching of taxonomic names to taxonomic databases.
Solution:
Unknown Higher Taxonomy
Problem: Species may be missing higher taxonomic information.
Solution:
Other Issues
Incorrect Character Encodings
Problem: Data inconsistencies arise when incorrect character encodings are used during data manipulation or transfer. This issue occurs when datasets are opened, downloaded, or imported across different software platforms, leading to misinterpretation and garbled text. For instance, special characters like accents or symbols may be rendered incorrectly, affecting the readability and accuracy of the data. (e.g., Carl Linné).
Solution:
Incorrect Line Endings
Problem: When transferring text files between Unix/Linux and DOS/Windows systems, line endings can become inconsistent. Unix/Linux systems typically use line feed (LF) characters, while DOS/Windows systems use carriage return (CR) and line feed (LF) combinations. This mismatch can result in extra characters appearing in the data, causing visual artifacts and processing errors.
Solution:
Invalid Individual Count
Problem: individualCount values may not make sense as a positive integer.
Solution:
- Click the triangle icon just above and to the left of the first cell in your spreadsheet. This should select all the cells in your spreadsheet.
- In the "Editing" section of the Home menu, click Sort & Filter, then Custom Sort.
- Make sure the "My data has headers" box is checked.
- In the Sort By row under "Column", select the column/field that contains your individualCount values.
- Keep the value "Cell Values" under "Sort On", then under "Order", select the "Largest to Smallest" option. This will bring all the non-numeric values in this field to the top of your spreadsheet, and you can review and fix them as necessary.
- Sort again with the "Smallest to Largest" option under Order to find any erroneous negative values.
Non-standardized BasisOfRecord Values
Problem: Values in the BasisOfRecord field do not match the recommended controlled vocabulary. While using standardized terms in this field is not strictly necessary, doing so does improve the discoverability and interoperability of your data.
The currently accepted values for BasisOfRecord include: MaterialEntity, PreservedSpecimen, FossilSpecimen, LivingSpecimen, MaterialSample, Event, HumanObservation, MachineObservation, Taxon, Occurrence, MaterialCitation.
Note that even punctuation and capitalization differences in these values (e.g., Preserved Specimen) are discouraged.
Solution: