Editing .txt Files • EMLeditor

Metadata inferred during the templating process should be validated by the user and missing info added. Use spreadsheet and text editors for this process. Template specific guides are listed below.

NOTES:

Templates can be generated as .docx, .md, or .txt files. Here we focus on .txt. This is the default file format and also the simplest and least error-prone format.
_Tabular templates: Leave empty cells blank, don’t fill with NAs.
Free-text templates: Keep template content simple. Complex formatting can lead to errors.

abstract.txt

Example

Describes the salient features of a dataset in a concise summary much like an abstract does in a journal article. It should cover what the data are and why they were created. A good rule of thumb is that an abstract should be about 250 words or less. The abstract will become a publicly-facing piece of text featured on the DataStore reference page as well as sent on to DataCite, data.gov, and Google’s Dataset Search. A thoughtful and well-planned abstract may be useable not only for the data package but also for the Data Release Report (DRR).

If you write your abstract in a word processor (such as MS Word) and paste it in to the abstract.txt template file, please pass it through a text editor (such as Notepad) to make sure it is UTF-8 encoded and does not have special characters, including line breaks such as <br>.

Note: editing abstract.txt is best done via a text editor.

methods.txt

Example

Describes the data creation methods. Includes enough detail for future users to correctly use the data. Lists instrument descriptions, protocols, etc. Methods sections can include citations. It may be appropriate to cite the Protocol, datasets that were ingested to generate the data package, software (e.g. R), packages (e.g. dplyr, ggplot2) or custom scripts.

Note: editing methods.txt is best done via a text editor.

keywords.txt

Example

Describes the data in a small set of terms. Keywords facilitate search and discovery on scientific terms, as well as names of research groups, field stations, and other organizations. Using a controlled vocabulary or thesaurus vastly improves discovery. We recommend using the LTER Controlled Vocabulary when possible.

Note: editing keywords.txt is best done via a spreadsheet application.

Columns:

keyword One keyword per line
keywordThesaurus URI of the vocabulary from which the keyword originates.

personnel.txt

Example

Describes the personnel and funding sources involved in the creation of the data. This facilitates attribution and reporting.

Valid EML requires at least one person with a creator role. Creator is a synonym for Author in DataStore.

DataStore also requires at least one person with the role if contact. If this is the same person, list that person twice (i.e. on two separate rows).

Additional personnel (field technicians, consultants, collaborators, contributors, etc) may be added to give credit as necessary. Any roles other than “creator” or “contact” will be listed as associatedParties.

For the purposes of NPS data packages, it is likely that you will not have Principle Investigators (PIs), or information about funding or funding agencies.

Note: editing personnel.txt is best done through a spreadsheet application.

Columns:

givenName First name
middleInitial Middle initial
surName Last name
organizationName Organization the person belongs to
electronicMailAddress Email address
userId Persons research identifier (e.g. ORCID). Links a persons research profile to a data publication.
role Role of the person with respect to the data. Persons serving more than one role are listed on separate lines (e.g. replicate the persons info on separate lines but change the role. Valid options:
- creator Author(s) of the data. Will appear in the data citation.
- PI Principal investigator the data were created under. Will appear with project level metadata. It is OK to leave this blank as there will be no PIs for many NPS data packages.
- contact A point of contact for questions about the data. Can be an organization or position (e.g. data manager). To do this, enter the organization or position name under givenName and leave middleInitial and surName empty.
- Other roles (e.g. Field Technician) will be listed as associated parties to the data. Their specific role (e.g. “Field Tech” will also be listed in metadata)
Funding information is listed with PIs
- projectTitle Title of project the data were created under. If ancillary projects were involved, then add as new lines below the primary project with the PIs info replicated. This can typically be left blank.
- fundingAgency Agency the project was funded by. This can be left blank.
- fundingNumber Grant or award number. Likely leave this blank.

intellectual_rights.txt

There is no need to edit the intellectual rights file now.

EMLassemblyline autopopulates an “intellectual_rights.txt” file and will use that file to add information to the element in your EML. Once you have finished generating your EML you need to update your intellectual rights to coincide with NPS guidance using a separate EMLeditor::set_int_rights() function.

attributes_*.txt

Example 1, Example 2

If you have multiple data files (.csvs), multiple text files will be generated, each starting with “attributes”, followed by your csv file name, and having the extension “.txt”. These files Describe columns of a data table (classes, units, datetime formats, missing value codes).

Note: editing attribute_.txt is best done using a spreadsheet application.

Columns:

attributeName Column name. Make sure that each column has an attributeName and tha they match (including case sensitivity)
attributeDefinition Column definition
class Column class. Valid options are:
- numeric Numeric variable
- categorical Categorical variable (i.e. nominal)
- character Free text character strings (e.g. notes)
- Date Date and time variable
unit Column unit. Required for numeric classes. Select from EML’s standard unit dictionary, accessible with view_unit_dictionary(). Use values in the “id” column. If not found, then define as a custom unit (see custom_units.txt).
dateTimeFormatString Format string. Required for Date classes. Valid format string components are:
- Y Year
- M Month
- D Day
- h Hour
- m Minute
- s Second Common separators of format string components (e.g. - / :) are supported.
missingValueCode Missing value code. Required for columns containing a missing value code).
missingValueCodeExplanation Definition of missing value code.

custom_units.txt

Example

Describes non-standard units used in a data table attribute template.

Note: custom-units.txt is best edited via a spreadsheet application.

Columns:

id Unit name listed in the unit column of the table attributes template (e.g. feetPerSecond)
unitType Unit type (e.g. velocity)
parentSI SI equivalent (e.g. metersPerSecond)
multiplierToSI Multiplier to SI equivalent (e.g. 0.3048)
description Abbreviation (e.g. ft/s)

catvars_*.txt

Example 1, Example 2

Describes categorical variables of a data table (if any columns are classified as categorical in table attributes template). If you have multiple data files (csvs), multiple catvars files will be created, one for each csv.

Note: The catvars files are best edited with a spreadsheet application.

Columns:

attributeName Column name
code Categorical variable
definition Definition of categorical variable

geographic_coverage.txt

Example

Describes where the data were collected.

If the only geographic coverage information you plan on using are park boundaries, you can skip this step. You can add park unit connections using EMLeditor, which will automatically generate properly formatted GPS coordinates for the park bounding boxes.

If you would like to add additional GPS coordinates (such as for specific site locations, transects, survey plots, or bounding boxes for locations within a park, etc) please do!

Note: Hopefully you won’t have to edit these, but if so they are best edited with a spreadsheet application.

Columns:

geographicDescription Brief description of location.
northBoundingCoordinate North coordinate
southBoundingCoordinate South coordinate
eastBoundingCoordinate East coordinate
westBoundingCoordinate West coordinate

Coordinates must be in decimal degrees and include a minus sign (-) for latitudes south of the equator and longitudes west of the prime meridian. For points, repeat latitude and longitude coordinates in respective north/south and east/west columns. If you need to convert from UTMs, try using the utm_to_ll() function in the R/QCkit package.

Currently EML handles points and rectangles well. At the least precise end of spectrum you could enter an entire park unit as geographic For a convenient way to get these coordinates, see the get_park_polygon() function in the R/EMLeditor package.

We strongly encourage you to be as precise as possible with your geographicCoverage and provide sampling points (e.g. along a transect) whenever possible. This information will (eventually) be displayed on a map on the DataStore Reference page for the data package and these points will also be directly discoverable through DataStore searches.

If you have CUI concerns about the specific locations of your sites, consider fuzzing them rather than completely removing them. One good tool for fuzzing geographic coordinates is the fuzz_location() function in the R/QCkit package.

taxonomic_coverage.txt

Example

Describes biological organisms occurring in the data and helps resolve them to authority systems. If matches can be made, then the full taxonomic hierarchy of scientific and common names are automatically rendered in the final EML metadata. This enables future users to search on any taxonomic level of interest across data packages in repositories.

Note: Hopefully you don’t have to edit these.

Columns:

taxa_raw Taxon name as it occurs in the data and as it will be listed in the metadata if no value is listed under the name_resolved column. Can be single word or species binomial.
name_type Type of name. Can be “scientific” or “common”.
name_resolved Taxons name as found in an authority system.
authority_system Authority system in which the taxa’s name was found. Can be: “ITIS”, “WORMS”, “or”GBIF“.
authority_id Taxa’s identifier in the authority system (e.g. 168469).

provenance.txt

Example

Describes source datasets. Explicitly listing the DOIs and/or URLs of input data help future users understand in greater detail how the derived data were created and may some day be able to assign attribution to the creators of referenced datasets.

Provenance metadata can be automatically generated for supported repositories simply by specifying an identifier (i.e. EDI) in the systemID column. For unsupported repositories (e.g. DataStore), the systemID column should be left blank.

For many monitoring protocols, there may not be any input datasets, instead the data package is based on newly collected & original data. In this case, leave provenance.txt blank.

Columns:

dataPackageID Data package identifier. Supplying a valid packageID and systemID (of supported systems) is all that is needed to create a complete provenance record.
systemID System (i.e. data repository) identifier. Currently supported systems are: EDI (Environmental Data Initiative). Leave this column blank unless specifying a supported system.
url URL linking to an online source (i.e. data, paper, etc.). Required when a source can’t be defined by a packageID and systemID.
onlineDescription Description of the data source. Required when a source can’t be defined by a packageID and systemID.
title The source title. Required when a source can’t be defined by a packageID and systemID.
givenName A creator or contacts given name. Required when a source can’t be defined by a packageID and systemID.
middleInitial A creator or contacts middle initial. Required when a source can’t be defined by a packageID and systemID.
surName A creator or contacts middle initial. Required when a source can’t be defined by a packageID and systemID.
role “creator” and “contact” of the data source. Required when a source can’t be defined by a packageID and systemID. Add both the creator and contact as separate rows within the template, where the information in each row is duplicated except for the givenName, middleInitial, surName (or organizationName), and role fields.
organizationName Name of organization the creator or contact belongs to. Required when a source can’t be defined by a packageID and systemID.
email Email of the creator or contact. Required when a source can’t be defined by a packageID and systemID.

annotations.txt

Example

Adds semantic meaning to metadata (variables, locations, persons, etc.) through links to ontology terms. This enables greater human understanding and machine actionability (linked data) and greatly improves the discoverability and interoperability of data in general.

Columns:

id A unique identifier for the element being annotated.
element The element being annotated.
context The context of the subject (i.e. element value) being annotated (e.g. If the same column name occurs in more than one data tables, you will need to know which table it came from.).
subject The element value to be annotated.
predicate_label The predicate label (a.k.a. property) describing the relation of the subject to the object. This label should be copied directly from an ontology.
predicate_uri The predicate label URI copied directly from an ontology.
object_label The object label (a.k.a. value) describing the subject. This label should be copied directly from an ontology.
object_uri The object URI copied from an ontology.

additional_info

Example

Ancillary info not captured by any of the other templates.