CYWG iDigBio DwC-A Pull Ingestion: Difference between revisions
Line 79: | Line 79: | ||
== Requirements == | == Requirements == | ||
Software providers who wish to build a new data publishing mechanism should follow the RSS pattern of one of the four existing well-known RSS feed generators | Software providers who wish to build a new data publishing mechanism should follow the RSS pattern of one of the four existing well-known RSS feed generators examples below. This ensures that RSS feed consumers can properly parse out the needed fields to access the data files and metadata about the datasets being provided. | ||
Line 86: | Line 86: | ||
Software providers who wish to build a new data publishing mechanism should implement one of the known "feed" patterns from the | Software providers who wish to build a new data publishing mechanism should implement one of the known "feed" patterns from the below examples. The feed should comply with either the RSS or Atom XML publishing specifications. | ||
Revision as of 12:11, 15 June 2015
iDigBio DwC-A Pull Ingestion Through RSS feed
Ingestion of batches of data from providers into the iDigBio Data Portal v1 API is pull based as:
- a single dataset export in CSV format, or
- a Really Simple Syndication (RSS) feed to Darwin Core Archives (DwC-A) or Comma-Separated Values (CSV) files.
Accepted formats are:
- Comma-separated value file or "CSV",
- Zipped single file CSV (.csv.zip) or "CSV-ZIP", and
- DwC-A (Occurrence as a core and Audubon-Core extension are currently handled) or "DWCA".
Appropriate links to RSS feed or file are to be e-mailed to iDigBio.
To facilitate creation of a custom RSS feed containing the fields "title", "id", "type", "recordtype", "description", "link", "ipt:eml", and "pubDate", iDigBio makes a simple PHP script available at GitHub, which generates an output as follows:
<?xml version="1.0" encoding="UTF-8"?> <rss version="2.0" xmlns:ipt="http://ipt.gbif.org/"> <channel> <title>iDigBio Feeder RSS Feed</title> <link>http://feeder.idigbio.org/rss.php</link> <description>RSS Feed for iDigBio CSV Datasets.</description> <language>en-us</language> <item> <title>Archbold Biological Station</title> <id>http://feeder.idigbio.org/datasets/ABS_iDigBio</id> <type>CSV</type> <recordtype>occurrence</recordtype> <description>Example of CSV dataset only with specimens.</description> <link>http://feeder.idigbio.org/datasets/ABS_iDigBio.csv</link> <ipt:eml>http://feeder.idigbio.org/eml/ABS_iDigBio.xml</ipt:eml> <pubDate>Wed, 14 May 2014 11:31:45 -0400</pubDate> </item> <item> <title>Invertnet Images</title> <id>http://feeder.idigbio.org/datasets/idigbio-invertnet</id> <type>CSV</type> <recordtype>multimedia</recordtype> <description>Example of a CSV dataset only with images.</description> <link>http://feeder.idigbio.org/datasets/idigbio-invertnet.csv</link> <ipt:eml>http://feeder.idigbio.org/eml/idigbio-invertnet.xml</ipt:eml> <pubDate>Fri, 18 Apr 2014 10:16:42 -0400</pubDate> </item> <item > <title>ASU-ASUHIC DwC-Archive</title> <id>98d9b8ed-08d6-47fc-b324-2853e44d75d1</id> <type>DWCA</type> <recordType>DWCA</recordType> <image>http://symbiota4.acis.ufl.edu/scan/portal/images/collicons/asu.jpg</image> <description>Darwin Core Archive for Arizona State University Hasbrouck Insect Collection</description> <link>http://symbiota4.acis.ufl.edu/scan/portal/collections/datasets/dwc/ASU-ASUHIC_DwC-A.zip</link> <ipt:eml>http://symbiota4.acis.ufl.edu/scan/portal/collections/datasets/dwc/ASU-ASUHIC_DwC-A.eml</ipt:eml> <pubDate>Wed, 14 May 2014 09:58:23</pubDate> </item> <item> <title>Test Set ZIP</title> <id>http://localhost/datasets/test.csv.zip</id> <type>CSV-ZIP</type> <description>A Test Dataset</description> <link>datasets/test.csv.zip</link> <pubDate>Thu, 15 Nov 2012 14:29:45 -0500</pubDate> </item> </channel> </rss>
Choosing how to provide RSS
An arbitrary data provider should adopt one of the existing publishing platforms:
- A provider who already has their data in Symbiota, or who would otherwise benefit from joining one of the Symbiota portals, should use Symbiota's RSS publishing mechanism.
- Intermediate users with the ability to run a server can run GBIF's IPT software which includes RSS publishing capability.
- Novice users, or users who simply have no server access, can serve datasets via the iDigBio RSS Feeder or via an existing IPT installation (e.g., VertNet). Data mobilizers are ready to get your data onboard.
- An expert user with the ability to run a server and an existing DwC-A/CSV generation mechanism, who is comfortable handling both XML and character encoding issues, can produce a custom RSS feed. The iDigBio RSS Feeder codebase is available to use as a template. For example, this expert could utilize just the rss.php file and replace the CSV config files with database calls.
Requirements
Software providers who wish to build a new data publishing mechanism should follow the RSS pattern of one of the four existing well-known RSS feed generators examples below. This ensures that RSS feed consumers can properly parse out the needed fields to access the data files and metadata about the datasets being provided.
TABLE WILL GO HERE
Software providers who wish to build a new data publishing mechanism should implement one of the known "feed" patterns from the below examples. The feed should comply with either the RSS or Atom XML publishing specifications.
The publishing feed must include all of the following pieces of information:
- a GUID for the dataset (feed url + unique item identifier is sufficient)
- a link to the Darwin Core Archive dataset file which contains the actual occurrence records and any DwC extensions
- a link to the EML file which contains metadata about the collection / provider
- a publication date for the last date of the dataset update
Examples of RSS feeds from various systems
RSS generated by iDigBio Feeder (http://feeder.idigbio.org/rss.php)
<?xml version="1.0" encoding="UTF-8"?> <rss xmlns:ipt="http://ipt.gbif.org/" version="2.0"> <channel> <title>iDigBio Feeder RSS Feed</title> <link>http://feeder.idigbio.org/rss.php</link> <description>RSS Feed for iDigBio CSV Datasets.</description> <language>en-us</language> <item> <title>Archbold Biological Station</title> <id>http://feeder.idigbio.org/datasets/ABS_iDigBio</id> <type>CSV</type> <recordtype>occurrence</recordtype> <description/> <link>http://feeder.idigbio.org/datasets/ABS_iDigBio.csv</link> <ipt:eml>http://feeder.idigbio.org/eml/ABS_iDigBio.xml</ipt:eml> <pubDate>Wed, 14 May 2014 11:31:45 -0400</pubDate> </item> <item> <title>Carnegie Museum of Natural History Vertebrate Paleontology</title> <id>http://feeder.idigbio.org/datasets/Carnegie_VertPaleo</id> <type>CSV</type> <recordtype>occurrence</recordtype> <description/> <link>http://feeder.idigbio.org/datasets/Carnegie_VertPaleo.csv</link> <ipt:eml>http://feeder.idigbio.org/eml/Carnegie_VertPaleo.xml</ipt:eml> <pubDate>Tue, 01 Jul 2014 10:34:40 -0400</pubDate> </item> </channel> </rss>
RSS generated by IPT (http://hymfiles.biosci.ohio-state.edu:8080/ipt/rss.do)
<?xml version="1.0"?> <rss version="2.0" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:ipt="http://ipt.gbif.org/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#"> <channel> <title>xBioD IPT in the Museum of Biological Diversity at the Ohio State University</title> <link>http://xbiod.osu.edu/ipt</link> <description>Resource metadata of xBioD IPT in the Museum of Biological Diversity at the Ohio State University</description> <language>en-us</language> <!-- RFC-822 date-time / Wed, 02 Oct 2010 13:00:00 GMT --> <pubDate>Mon, 16 Dec 2013 09:51:00 -0500</pubDate> <lastBuildDate>Fri, 05 Jun 2015 17:07:07 -0400</lastBuildDate> <generator>GBIF IPT 2.1.1-r4640</generator> <webMaster>cora.1@osu.edu () ()</webMaster> <docs>http://cyber.law.harvard.edu/rss/rss.html</docs> <ttl>15</ttl> <geo:Point> <geo:lat>39.9971388</geo:lat> <geo:long>-83.0439822</geo:long> </geo:Point> <item> <title>C.A. Triplehorn Insect Collection (OSUC), Ohio State University - Version 82</title> <link>http://xbiod.osu.edu/ipt/resource.do?r=osuc</link> <description>Vouchered occurrence records for insects from the C.A. Triplehorn Insect Collection at the Ohio State University. <a href="http://xbiod.osu.edu/ipt/eml.do?r=osuc">EML</a></description> <author>cora.1@osu.edu</author> <ipt:eml>http://xbiod.osu.edu/ipt/eml.do?r=osuc</ipt:eml> <dc:publisher>Norman Johnson Ohio State University<johnson.2@osu.edu></dc:publisher> <dc:creator>Norman Johnson Ohio State University<johnson.2@osu.edu></dc:creator> <ipt:dwca>http://xbiod.osu.edu/ipt/archive.do?r=osuc</ipt:dwca> <pubDate>Fri, 05 Jun 2015 17:10:38 -0400</pubDate> <guid isPermaLink="false">84ab7b76-f762-11e1-a439-00145eb45e9a/v82</guid> </item> </channel> </rss>
RSS generated by Symbiota (http://portal.neherbaria.org/portal/webservices/dwc/rss.xml)
<?xml version="1.0" encoding="UTF-8"?> <rss version="2.0"> <channel> <title>CNH portal Darwin Core Archive rss feed</title> <link>http://portal.neherbaria.org/portal/</link> <description>CNH portal Darwin Core Archive rss feed</description> <language>en-us</language> <item collid="27"> <title>Harvard University-A DwC-Archive</title> <image>http://www.huh.harvard.edu/images/huh_logo_bw_100.png</image> <description>Darwin Core Archive for Herbarium of the Arnold Arboretum (Harvard University Herbaria)</description> <guid>http://portal.neherbaria.org/portal/collections/misc/collprofiles.php?collid=27</guid> <guid>80b71fde-2241-4777-bfd3-3bdd075b8ba5</guid> <emllink>http://portal.neherbaria.org/portal/collections/datasets/dwc/HarvardUniversity-A_DwC-A.eml</emllink> <type>DWCA</type> <recordType>DWCA</recordType> <link>http://portal.neherbaria.org/portal/collections/datasets/dwc/HarvardUniversity-A_DwC-A.zip</link> <pubDate>Thu, 17 Apr 2014 11:49:03</pubDate> </item> </channel> </rss>
RSS generated by Arthropod EasyCapture (http://www.amnh.begoniasociety.org/dwc/rss.xml)
<rss version="2.0"> <channel> <title>Arthropod Easy Capture (AMNH)</title> <link>https://research.amnh.org/pbi/locality/</link> <description>Arthropod Easy Capture rss feed</description> <language>en-us</language> <item ProjUID="2"> <title> Plants, herbivores, and parasitoids: A model system for the study of tri-trophic associations project </title> <description> Tri-Trophic Thematic Collection Network, 2014 (and updates). Version: 18 Mar 2015. http://tcn.amnh.org/. National Science Foundation grant(s) EF#1115081, EF#1115103, EF#1115080, EF#1115144, EF#1115191, EF#1115104, EF#1115115 </description> <guid> urn:uuid:f0cec69a-853c-11e4-8259-0026552be7ea </guid> <emllink> http://www.amnh.begoniasociety.org/dwc/AEC-TTD-TCN_DwC-A20150318.eml </emllink> <type>DWCA</type> <recordType>DWCA</recordType> <link> http://www.amnh.begoniasociety.org/dwc/AEC-TTD-TCN_DwC-A20150318.zip </link> <pubDate>Wed, 18 Mar 2015 14:49:42</pubDate> </item> <item ProjUID="3"> <title> Collaborative databasing of North American bee collections within a global informatics network project </title> <description> Digital Bee Collections Network, 2014 (and updates). Version: 18 Mar 2015. National Science Foundation grant DBI 0956388 </description> <guid> urn:uuid:13674fa4-8611-11e4-8259-0026552be7ea </guid> <emllink> http://www.amnh.begoniasociety.org/dwc/AEC-DBCNet_DwC-A20150318.eml </emllink> <type>DWCA</type> <recordType>DWCA</recordType> <link> http://www.amnh.begoniasociety.org/dwc/AEC-DBCNet_DwC-A20150318.zip </link> <pubDate>Wed, 18 Mar 2015 14:50:54</pubDate> </item> </channel> </rss>
What happens when your RSS is ready?
Once the links are received, an iDigBio IT staff member goes through the URLs to verify they are functioning as we expect, adding them to the dataset manager. The dataset manager downloads and hashes the datasets and all individual records on a weekly basis, pre-validating the internal uniqueness of IDs within individual datasets (trivial collision). If the dataset is new, all records in the dataset are staged for ingestion. If the dataset is an update, the true difference (using hashes) between the current and new datasets are computed, the changes are staged for ingestion, committed to the idigbio specimen API, and elastic-search is reindexed.
All records in a dataset are expected to have a Globally Unique IDentifier (GUID). Consult our Data Ingestion Guidance for more information of the data requirements. For additional terms, not covered by DwC, the recommendation is to consult with the MISC WG on the existence of an alternative term or the need to create one. A laundry list of terms currently in use can be found at:
Go back to CYWG.