NPS EML Scripting

Resources and Guides for using EMLassemblyline to create EML for National Park Service data packages

Editing Templates

Metadata inferred during the templating process should be validated by the user and missing info added. Use spreadsheet and text editors for this process. Template specific guides are listed below.

NOTES:

abstract.txt

Example

Describes the salient features of a dataset in a concise summary much like an abstract does in a journal article. It should cover what the data are and why they were created. A good rule of thumb is that an abstract should be about 250 words or less. The abstract will become a publicly-facing piece of text featured on the DataStore reference page as well as sent on to DataCite, data.gov, and Google’s Dataset Search. A thoughtful and well-planned abstract may be useable not only for the data package but also for the Data Release Report (DRR).

If you write your abstract in a word processor (such as MS Word) and paste it in to the abstract.txt template file, please pass it through a text editor (such as Notepad) to make sure it is UTF-8 encoded and does not have special characters, including line breaks such as
.

Note: editing abstract.txt is best done via a text editor.

methods.txt

Example

Describes the data creation methods. Includes enough detail for future users to correctly use the data. Lists instrument descriptions, protocols, etc. Methods sections can include citations. It may be appropriate to cite the Protocol, datasets that were ingested to generate the data package, software (e.g. R), packages (e.g. dplyr, ggplot2) or custom scripts.

Note: editing methods.txt is best done via a text editor.

keywords.txt

Example

Describes the data in a small set of terms. Keywords facilitate search and discovery on scientific terms, as well as names of research groups, field stations, and other organizations. Using a controlled vocabulary or thesaurus vastly improves discovery. We recommend using the LTER Controlled Vocabulary when possible.

Note: editing keywords.txt is best done via a spreadsheet application.

Columns:

personnel.txt

Example

Describes the personnel and funding sources involved in the creation of the data. This facilitates attribution and reporting.

Valid EML requires at least one person with a creator role. Creator is a synonym for Author in DataStore.

DataStore also requires at least one person with the role if contact. If this is the same person, list that person twice (i.e. on two separate rows).

Additional personnel (field technicians, consultants, collaborators, contributors, etc) may be added to give credit as necessary. Any roles other than “creator” or “contact” will be listed as associatedParties.

For the purposes of NPS data packages, it is likely that you will not have Principle Investigators (PIs), or information about funding or funding agencies.

Note: editing personnel.txt is best done through a spreadsheet application.

Columns:

intellectual_rights.txt

There is no need to edit the intellectual rights file now.

EMLassemblyline autopopulates an “intellectual_rights.txt” file and will use that file to add information to the element in your EML. Once you have finished generating your EML you need to update your intellectual rights to coincide with NPS guidance using a separate EMLeditor::set_int_rights() function.

attributes_*.txt

Example 1, Example 2

If you have multiple data files (.csvs), multiple text files will be generated, each starting with “attributes”, followed by your csv file name, and having the extension “.txt”. These files Describe columns of a data table (classes, units, datetime formats, missing value codes).

Note: editing attribute_.txt is best done using a spreadsheet application.

Columns:

custom_units.txt

Example

Describes non-standard units used in a data table attribute template.

Note: custom-units.txt is best edited via a spreadsheet application.

Columns:

catvars_*.txt

Example 1, Example 2

Describes categorical variables of a data table (if any columns are classified as categorical in table attributes template). If you have multiple data files (csvs), multiple catvars files will be created, one for each csv.

Note: The catvars files are best edited with a spreadsheet application.

Columns:

geographic_coverage.txt

Example

Describes where the data were collected.

If the only geographic coverage information you plan on using are park boundaries, you can skip this step. You can add park unit connections using EMLeditor, which will automatically generate properly formatted GPS coordinates for the park bounding boxes.

If you would like to add additional GPS coordinates (such as for specific site locations, transects, survey plots, or bounding boxes for locations within a park, etc) please do!

Note: Hopefully you won’t have to edit these, but if so they are best edited with a spreadsheet application.

Columns:

Coordinates must be in decimal degrees and include a minus sign (-) for latitudes south of the equator and longitudes west of the prime meridian. For points, repeat latitude and longitude coordinates in respective north/south and east/west columns. If you need to convert from UTMs, try using the utm_to_ll() function in the R/QCkit package.

Currently EML handles points and rectangles well. At the least precise end of spectrum you could enter an entire park unit as geographic For a convenient way to get these coordinates, see the get_park_polygon() function in the R/EMLeditor package.

We strongly encourage you to be as precise as possible with your geographicCoverage and provide sampling points (e.g. along a transect) whenever possible. This information will (eventually) be displayed on a map on the DataStore Reference page for the data package and these points will also be directly discoverable through DataStore searches.

If you have CUI concerns about the specific locations of your sites, consider fuzzing them rather than completely removing them. One good tool for fuzzing geographic coordinates is the fuzz_location() function in the R/QCkit package.

taxonomic_coverage.txt

Example

Describes biological organisms occurring in the data and helps resolve them to authority systems. If matches can be made, then the full taxonomic hierarchy of scientific and common names are automatically rendered in the final EML metadata. This enables future users to search on any taxonomic level of interest across data packages in repositories.

Note: Hopefully you don’t have to edit these.

Columns:

provenance.txt

Example

Describes source datasets. Explicitly listing the DOIs and/or URLs of input data help future users understand in greater detail how the derived data were created and may some day be able to assign attribution to the creators of referenced datasets.

Provenance metadata can be automatically generated for supported repositories simply by specifying an identifier (i.e. EDI) in the systemID column. For unsupported repositories (e.g. DataStore), the systemID column should be left blank.

For many monitoring protocols, there may not be any input datasets, instead the data package is based on newly collected & original data. In this case, leave provenance.txt blank.

Columns:

annotations.txt

Example

Adds semantic meaning to metadata (variables, locations, persons, etc.) through links to ontology terms. This enables greater human understanding and machine actionability (linked data) and greatly improves the discoverability and interoperability of data in general.

Columns:

additional_info

Example

Ancillary info not captured by any of the other templates.