Talk:Data Ingestion Guidance: Difference between revisions

From iDigBio
Jump to navigation Jump to search
No edit summary
No edit summary
 
(17 intermediate revisions by 3 users not shown)
Line 1: Line 1:
--[[User:Dpaul|Dpaul]] 16:49, 9 January 2014 (EST)
TO DO:
#Regarding data@idigbio.org now that dp has a gator link, does this mean I can be added to the mailing list that "sees" data@idigbio.org--[[User:Dpaul|Dpaul]]
#Where do users send email if they have a question
##data@idigbio.org
##the feedback (or both)?--[[User:Dpaul|Dpaul]] ([[User talk:Dpaul|talk]]) 17:48, 9 January 2014 (EST)
#(At Morphbank, to keep this straightforward, all email requests for help go to morphbank@scs.fsu.edu
##We do not have 2 separate paths for help / issues requests.
##I am assuming (I think) that clicking to send "feedback" generates a Redmine ticket (efficient and transparent). But, data@idigbio.org is not transparent.--[[User:Dpaul|Dpaul]]
#About this Section: Registering Your Collection in Preparation for Data Ingestion
##I suggest a different order. See next.--[[User:Dpaul|Dpaul]]


When your data are ready for ingestion, please see the next steps.


#Get an iDigBio account for yourself (if you don't have one yet). https://www.idigbio.org/auth/login.php
Add links to the term definitions when they are mentioned.
##These are the only login credentials you will need.
#Log in with your iDigBio account username and password. https://www.idigbio.org/auth/login.php
#Register your collection. http://portal.idigbio.org/register OR
#Register your collection at [http://grbio.org/ GRBIO]
##Repository: http://grbio.org/find-biorepositories OR
##Institutional Collections: http://grbio.org/find-institutional-collections
#If you are already on the portal page, the 'Register A Collection' is in the menu under your login name in the upper right of the page.
----
--[[User:Dpaul|Dpaul]] ([[User talk:Dpaul|talk]]) 17:22, 9 January 2014 (EST) About this next section
Data Requirements[edit]


    For all data records
e.g.


    all specimen records need to have a GUID in each digital record: a persistent globally unique identifier
[http://purl.org/dc/terms/identifier dc:identifier] for "dc:identifier"
    you need to have ownership of the data in the case of your being its source, on the other hand if you are an aggregator, you need to have the owner's permission to send it to us.
    we would like it to be available to our harvester via IPT and RSS if possible, otherwise in DarwinCore format in a CSV file would work too.
    dates in ISO 8601 format, i.e., YYYY-MM-DD
    caution to preserve diacritics in people and place names.


    For all images/media objects


    each media record needs to have a GUID: a persistent globally unique identifier
    we need there to be Audubon Core metadata file, with one record to go with each media record, and we can provide coaching to help you create that file. The more you can flesh out the details of the image, the more likely it will be to be highly retrievable.
    just like the ownership of catalog records, the media records need to provided freely and with permission, and each record needs to have at least Creative Commons permission = "CC BY"


:I would avoid the word ownership, if at all possible to help the community get around this issue (eventually). This reinforces ideas / misconceptions about data (and copyright, and intellectual property, etc). Something like this for number 2 below.
Some DRAFT changes...
::You have permission to contribute this dataset to iDigBio.


for number 3. do we need to explain or justify? how about
::Data Format choices
:::[http://code.google.com/p/gbif-ecat/wiki/DwCArchive DarwinCore archive format] OR
:::CSV files mapped to [http://rs.tdwg.org/dwc/terms/index.htm Darwin Core] (and other relevant standards, example [http://terms.tdwg.org/wiki/Audubon_Core_Term_List Audubon Core])
::Data Transfer
:::Darwin Core Archive files harvest via IPT and RSS
:::CSV files via (...)


for number 4, please add UTF-8 reference. something like:
<pre>
:UTF-8 encoding preferred (should be required).
  <coreid index="0" />
::validate (or verify) that "special characters" (diacritics like umlauts, tilde, cedilla) are correct in your dataset.
  <field index="1" term="http://purl.org/dc/terms/identifier"/>
  <field index="2" term="http://purl.org/dc/terms/type"/>
  <field index="3" term="http://purl.org/dc/terms/format"/>
  <field index="4" term="http://rs.tdwg.org/ac/terms/accessURI"/>
  <field index="5" term="http://ns.adobe.com/xap/1.0/rights/WebStatement"/>
  <field index="6" term="http://purl.org/dc/terms/rightsHolder"/>
  <field index="7" term="http://purl.org/dc/terms/creator"/>
  <field index="8" term="http://rs.tdwg.org/ac/terms/metadataLanguage"/>
  <field index="6" term="http://ns.adobe.com/xap/1.0/rights/Owner"/>
  <field index="7" term="http://ns.adobe.com/xap/1.0/rights/UsageTerms"/>
  <field index="8" term="http://ns.adobe.com/xap/1.0/rights/WebStatement"/>
  <field index="13" term="http://purl.org/dc/terms/format"/>


----
</pre>
==Data Requirements==
 
===Data Records===
 
#all specimen records need to have a GUID in each digital record: a persistent globally unique identifier
 
#you need to have <strike>ownership</strike> of the data in the case of your being its source, on the other hand if you are an aggregator, you need to have the owner's permission to send it to us.
==Packaging for images / media objects==
#we would like it to be available to our harvester via IPT and RSS if possible, otherwise in DarwinCore format in a CSV file would work too.
Consult iDigBio's media policy https://www.idigbio.org/content/idigbio-image-file-format-requirements-and-recommendations-1 and GBIF's  while preparing your media.
#dates in ISO 8601 format, i.e., YYYY-MM-DD
*Firstly, adding a field in the occurrence file for ''associatedMedia'' is not the way to include media with a specimen record. Media that comes to us via this method, or embedded in a webpage will not
#caution to preserve diacritics in people and place names.
*Each media record should have a unique (within the dataset) identifier in the ''identifier'' field.
*If providing media records with specimen data records, here are the important fields to fill in
** sample of fully-populated AC record (taking into account iDigBio, TDWG, and GBIF recommendations)
***'''id (coreid)''' = If media data are being provided via an extension, this is the coreid field in the Audubon Core extension file. This links to one identifier among the related specimen records and is frequently the occurrenceID of the specimen record. "coreid" is not a term defined by Darwin Core or Audubon Core. <pre>UUID GOES HERE</pre><pre>urn:catalog:institutionCode:collectionCode:catalogNumber</pre>
***'''identifier  (dcterms:identifier or dc:identifier)''' = id of the media record - needs to be unique within Audubon Core file and uniquely identifies the row. It may be tempting to use the URL of the media as the identifier. However, we have seen multiple cases where media have moved, making the identifier not persistent.<pre>UUID GOES HERE</pre> <pre>URL goes here</pre>
***'''type  (dcterms:type)''' = .... <pre>StillImage</pre>
***'''format (dc:format)''' = Media Type / MIME Type (from http://www.iana.org/assignments/media-types/media-types.xhtml controlling vocabulary if possible) <pre>image/jpeg</pre>
***'''accessURI (ac:accessURI)''' = direct http link to the media file. Note that the media type (format) *must* match the media type of the resource at the target end of this accessURI. For example, if the format is "image/jpeg" then accessURI '''must''' link to an image, not a web page.<pre>http://bgbasesrvr.univ.edu/DATABASEIMAGES/LONN00000001.JPG</pre>
***'''providerManagedID (ac:providerManagedID)''' =  if you have a UUID GUID for your media records, then assign it to the optional ac:providerManagedID field. <pre>urn:uuid:32e5da5d-c747-435c-a368-07d989259bf4  (optional)</pre>
Note: dc:terms format and dc:type should match the type of the object returned by ac:accessURI (If ac:accessURI is not present, dc:terms format and dc:type should not be present either), especially in the case where ac:furtherInformationURL is used as an alternative to ac:accessURI.
 
 
 
Here are further recommended fields to fill in:
{| class="wikitable" border="1"
|-
! scope="col" width="15%" | AC  Term
! scope="col" width="45%" class="sortable"| Sample data
! scope="col" width="45%" class="sortable"| Notes
|-
|valign="top"|ac:associatedSpecimenReference
|valign="top"|0e1e12ed-2261-42db-8719-ee98532dab06
|valign="top"|A reference to a specimen associated with this resource.
|-
|valign="top"|dc:rights or dcterms:rights
|valign="top"|dc:rights -  “CC BY-NC"<br>
dcterms:rights - http://creativecommons.org/licenses/by-nc/4.0/
|valign="top"|preferred - dcterms:rights
|-
|valign="top"|ac:licenseLogoURL
|valign="top"|http://mirrors.creativecommons.org/presskit/buttons/80x15/png/by-nc.png
|valign="top"|
|-
|valign="top"|xmpRights:Owner
|valign="top"|New York Botanical Garden
|valign="top"|A list of the names of the owners of the copyright (the one in the dc:rights field). 'Unknown' is an acceptable value, but 'Public Domain' is not.
|-
|valign="top"|dc:creator
|valign="top"|"New York Botanical Garden" or "Jane Doe, Digital Media Manager, New York Botanical Garden"
|valign="top"|The person or organization responsible for creating the media resource, might be less encompassing than what is in xmpRights:Owner.
|-
|valign="top"|dc:type
|valign="top"| StillImage, Sound, MovingImage
|valign="top"|
|-
|valign="top"|dcterms:title
|valign="top"|herbarium sheet of Abarema abbottii (Rose & Leonard) Barneby & J.W.Grimes
|valign="top"|
|-
|}
*'''Note to aggregators''': In the case where the data are coming from an aggregator, an additional ''recordId'' field is required (idigbio:recordId). This is the media identifier, distinct from the one given by the provider in the dcterms:identifier field. It is assumed that aggregators are building their own archives, as this is not a Darwin Core term, and is not supported in the IPT.
*'''Terms''': Use Audubon Core terms, http://terms.tdwg.org/wiki/Audubon_Core_Term_List, with one record for each media record. The more you can flesh out the details of the image, the more likely it will be to be highly retrievable. The best practice is to use the taxonomic and geographic fields to capture as much information as possible when only media are given to iDigBio.
*'''License''': Just like permission of catalog records, the media records need to be provided freely and with permission, and each record should have a Creative Commons license. Content providers are required to adopt a Creative Commons license for information they serve through iDigBio. Except for public-domain or CC0 content, the default license is CC BY (Attribution), which allows users to copy, transmit, reuse, remix, and/or adapt data and media, as long as attribution regarding the source of these data or media is maintained. See http://creativecommons.org/licenses/by/4.0/ for a more detailed explanation of the CC BY license. Any combination of BY, NC, and SA of CC media license you wish to apply is fine with us, however ND is not acceptable. Using ND (no derivatives) will cause the media to be rejected.
Possible licenses:
* CC0: http://creativecommons.org/publicdomain/zero/1.0/
* CC BY: http://creativecommons.org/licenses/by/4.0/
* CC BY-NC: https://creativecommons.org/licenses/by-nc/4.0/
*The media records represent a one-to-one relationship between the media object (the fit-for-display best quality JPG, in the case of images, for example) and the specimen record. There is no need to include links to any other forms of the media, like an enclosing webpage, or thumbnails. Below is some guidance on handling special cases. If none of these media attachment rules make sense to you, please get in touch with us for further assistance.
If you are not using IPT, and only delivering one recordset, generate a meta.xml file by hand and package up the files in a DwC A-like format. (No eml.xml required, contact info and recordset description can be sent in email).
 
===Best practice for getting Audubon Core images linked to specimen records - special cases===
 
{| class="wikitable unsortable" border="1"
|-
! scope="col" width="25%" class="unsortable" | Relationship
! scope="col" width="25%" class="unsortable" | Supported by
! scope="col" width="25%" class="unsortable" | Core Type
! scope="col" width="25%" class="unsortable" | Extensions
|-
|valign="top"|One-specimen-record-to-many-media files
|valign="top"|IPT 2.1/Custom DwC-A
|valign="top"|Specimen (DwC)
|valign="top"|Audubon Core
|-
|valign="top"|Many-specimen-records-to-one-media file
|valign="top"|IPT 2.2/Custom DwC-A
|valign="top"|Audubon Core
|valign="top"|Specimen (DwC)
|-
|valign="top"|Many-specimen-records-to-many-media files
|valign="top"|IPT 2.1/Custom DwC-A
|valign="top"|Specimen (DwC)
|valign="top"|Audubon Core + Relationship
|-
|}
 
Keep in mind that:
* DwC-A is a set of files: a core type + a number of extensions
* All files/tables (core or extension) need to have a unique identifier

Latest revision as of 16:10, 2 June 2017

TO DO:


Add links to the term definitions when they are mentioned.

e.g.

dc:identifier for "dc:identifier"


Some DRAFT changes...


   <coreid index="0" />
   <field index="1" term="http://purl.org/dc/terms/identifier"/>
   <field index="2" term="http://purl.org/dc/terms/type"/>
   <field index="3" term="http://purl.org/dc/terms/format"/>
   <field index="4" term="http://rs.tdwg.org/ac/terms/accessURI"/>
   <field index="5" term="http://ns.adobe.com/xap/1.0/rights/WebStatement"/>
   <field index="6" term="http://purl.org/dc/terms/rightsHolder"/>
   <field index="7" term="http://purl.org/dc/terms/creator"/>
   <field index="8" term="http://rs.tdwg.org/ac/terms/metadataLanguage"/>
   <field index="6" term="http://ns.adobe.com/xap/1.0/rights/Owner"/>
   <field index="7" term="http://ns.adobe.com/xap/1.0/rights/UsageTerms"/>
   <field index="8" term="http://ns.adobe.com/xap/1.0/rights/WebStatement"/>
   <field index="13" term="http://purl.org/dc/terms/format"/>


Packaging for images / media objects

Consult iDigBio's media policy https://www.idigbio.org/content/idigbio-image-file-format-requirements-and-recommendations-1 and GBIF's while preparing your media.

  • Firstly, adding a field in the occurrence file for associatedMedia is not the way to include media with a specimen record. Media that comes to us via this method, or embedded in a webpage will not
  • Each media record should have a unique (within the dataset) identifier in the identifier field.
  • If providing media records with specimen data records, here are the important fields to fill in
    • sample of fully-populated AC record (taking into account iDigBio, TDWG, and GBIF recommendations)
      • id (coreid) = If media data are being provided via an extension, this is the coreid field in the Audubon Core extension file. This links to one identifier among the related specimen records and is frequently the occurrenceID of the specimen record. "coreid" is not a term defined by Darwin Core or Audubon Core.
        UUID GOES HERE
        urn:catalog:institutionCode:collectionCode:catalogNumber
      • identifier (dcterms:identifier or dc:identifier) = id of the media record - needs to be unique within Audubon Core file and uniquely identifies the row. It may be tempting to use the URL of the media as the identifier. However, we have seen multiple cases where media have moved, making the identifier not persistent.
        UUID GOES HERE
        URL goes here
      • type (dcterms:type) = ....
        StillImage
      • format (dc:format) = Media Type / MIME Type (from http://www.iana.org/assignments/media-types/media-types.xhtml controlling vocabulary if possible)
        image/jpeg
      • accessURI (ac:accessURI) = direct http link to the media file. Note that the media type (format) *must* match the media type of the resource at the target end of this accessURI. For example, if the format is "image/jpeg" then accessURI must link to an image, not a web page.
        http://bgbasesrvr.univ.edu/DATABASEIMAGES/LONN00000001.JPG
      • providerManagedID (ac:providerManagedID) = if you have a UUID GUID for your media records, then assign it to the optional ac:providerManagedID field.
        urn:uuid:32e5da5d-c747-435c-a368-07d989259bf4   (optional)

Note: dc:terms format and dc:type should match the type of the object returned by ac:accessURI (If ac:accessURI is not present, dc:terms format and dc:type should not be present either), especially in the case where ac:furtherInformationURL is used as an alternative to ac:accessURI.


Here are further recommended fields to fill in:

AC Term Sample data Notes
ac:associatedSpecimenReference 0e1e12ed-2261-42db-8719-ee98532dab06 A reference to a specimen associated with this resource.
dc:rights or dcterms:rights dc:rights - “CC BY-NC"

dcterms:rights - http://creativecommons.org/licenses/by-nc/4.0/

preferred - dcterms:rights
ac:licenseLogoURL http://mirrors.creativecommons.org/presskit/buttons/80x15/png/by-nc.png
xmpRights:Owner New York Botanical Garden A list of the names of the owners of the copyright (the one in the dc:rights field). 'Unknown' is an acceptable value, but 'Public Domain' is not.
dc:creator "New York Botanical Garden" or "Jane Doe, Digital Media Manager, New York Botanical Garden" The person or organization responsible for creating the media resource, might be less encompassing than what is in xmpRights:Owner.
dc:type StillImage, Sound, MovingImage
dcterms:title herbarium sheet of Abarema abbottii (Rose & Leonard) Barneby & J.W.Grimes
  • Note to aggregators: In the case where the data are coming from an aggregator, an additional recordId field is required (idigbio:recordId). This is the media identifier, distinct from the one given by the provider in the dcterms:identifier field. It is assumed that aggregators are building their own archives, as this is not a Darwin Core term, and is not supported in the IPT.
  • Terms: Use Audubon Core terms, http://terms.tdwg.org/wiki/Audubon_Core_Term_List, with one record for each media record. The more you can flesh out the details of the image, the more likely it will be to be highly retrievable. The best practice is to use the taxonomic and geographic fields to capture as much information as possible when only media are given to iDigBio.
  • License: Just like permission of catalog records, the media records need to be provided freely and with permission, and each record should have a Creative Commons license. Content providers are required to adopt a Creative Commons license for information they serve through iDigBio. Except for public-domain or CC0 content, the default license is CC BY (Attribution), which allows users to copy, transmit, reuse, remix, and/or adapt data and media, as long as attribution regarding the source of these data or media is maintained. See http://creativecommons.org/licenses/by/4.0/ for a more detailed explanation of the CC BY license. Any combination of BY, NC, and SA of CC media license you wish to apply is fine with us, however ND is not acceptable. Using ND (no derivatives) will cause the media to be rejected.

Possible licenses:

If you are not using IPT, and only delivering one recordset, generate a meta.xml file by hand and package up the files in a DwC A-like format. (No eml.xml required, contact info and recordset description can be sent in email).

Best practice for getting Audubon Core images linked to specimen records - special cases

Relationship Supported by Core Type Extensions
One-specimen-record-to-many-media files IPT 2.1/Custom DwC-A Specimen (DwC) Audubon Core
Many-specimen-records-to-one-media file IPT 2.2/Custom DwC-A Audubon Core Specimen (DwC)
Many-specimen-records-to-many-media files IPT 2.1/Custom DwC-A Specimen (DwC) Audubon Core + Relationship

Keep in mind that:

  • DwC-A is a set of files: a core type + a number of extensions
  • All files/tables (core or extension) need to have a unique identifier