NPS EML Scripting

Resources and Guides for using EMLassemblyline to create EML for National Park Service data packages

Step-by-step-guide

Summary

This code creates an EML file for a data package. In this case the example is an EVER AA dataset, which consists of two data files, both located in the Example_files folder:

The first set of steps involves inputting information about your data package. The second set of steps (“EMLassemblyline Functions”) is a step by step process where each section should be reviewed and edited if necessary, and run one by one. Not all of the functions apply to all data packages. The EMLassemblyline functions will generate .txt files that need to be manually edited.

The final section has the make_eml() function to put together the full metadata file.

Before you Start

Required Software

You will need to install R and may find RStudio helpful. Both can be installed from Software Center without need for admin rights on your machine. For more information, see the R Advisory Group’s website.

If you’ve never installed EMLassemblyline, you’ll needed to install it (as well as some other packages) and then load them into R’s working memory:

## install necessary packages:
install.packages(c("devtools", "lubridate", "tidyverse", "stringr"))

# If you run into errors installing packages from github on NPS computers you
#may first need to run:
# options(download.file.method="wininet")
devtools::install_github("EDIorg/EMLassemblyline")
  
#Load required packages
library(EMLassemblyline)
library(lubridate)
library(tidyverse)
library(stringr)

DataStore

You will need to initiate a draft reference on DataStore. You will also need to know the URLs of your data files. These will be located on DataStore. Go to DataStore, select “Create” from the green drop down menu and then choose “Draft Reference”. For now, select “Tabular Dataset” as the reference type (Data Package Reference Type coming soon!). Make sure to take note of the 7-digit code for the Reference as you will need it later.

Don’t activate the reference yet!

Using the Template

Data Package Info

#Metadata filename - this becomes the filename, so make sure it ends in _metadata to comply with the NPS data package specifications
metadata_id <- "TEST_EVER_AA_metadata"

#Overall package title (replace with your title).
package_title <- "TEST_Title"

#Description of data collection status - choose from 'ongoing' or 'complete'
data_type <- "complete"

Files: Local path

Tell EMLassemblyline where your files are (and what they are).

For vectors with more than one item, keep the order the same (i.e. item #1 should correspond to the same file in each vector).

#Path to data file(s). The default "getwd()" assumes that your working
#directory in R is where your data files for your data package are.

#to check your current working directory use getwd()

#To change your working directory use >setwd("path/to/my/datafiles"). 

#Alternatively you can continue working in your current directory and assign
#the path to your data files to 'working_folder' instead of 'getwd()'
working_folder <- getwd()

#Vector of dataset file names. Tell EMLassemblyline what the names of your 
#data files are. Watch out for spelling and case.  
data_files <- c("qry_Export_AA_Points.csv", 
                "qry_Export_AA_VegetationDetail.csv")

#If the only .csv files in your working_folder are datafiles for your data 
#package, you could instead use:
#data_files <- c(list.file(pattern="*.csv"))
  
#Vector of dataset names (brief name for each file). Make sure the names of are
#in the same order as the file names!
data_names <- c("TEST_AA Point Data",
                "TEST_AA Vegetation Data")
  
#Vector of dataset descriptions (about 10 words describing each file). 
#Descriptions will be used in auto-generated tables within the ReadMe and DRR. 
#If need to use more than about 10 words, consider putting that information in
#the abstract, methods, or additional info sections. Again, be sure these are
#in the same order as your data files!
data_descriptions <- c("TEST_Everglades Vegetation Map Accuracy Assessment point data",
                       "Everglades Vegetation Map Accuracy Assessment vegetation data")

Files: Ultimate URL

Tell EMLassemblyline where your files will ultimately be located. Create a vector of dataset URLs - for DataStore I recommend setting this to the main reference page. All data files from a single data package can be accessed from the same page so the URLs are the same.

#The code from the draft reference you initiated above (replace 293181 with your code)
DSRefCode<-2293181

#No need to edit this
DSURL<-paste0("https://irma.nps.gov/DataStore/Reference/Profile/", DSRefCode)

#No need to edit this
DSURL<-paste0("https://irma.nps.gov/DataStore/Reference/Profile/", DSRefCode)

Tell EMLassembly line where to find the table and field that contains scientific names. These will be used to fill the taxonomic coverage metadata. If you don’t have scientific names (e.g. water quality), skip this step and do not run OPTIONAL 5 (below).

data_taxa_table <- "qry_Export_AA_VegetationDetail.csv"
data_taxa_field <- "Scientific_Name"

Geographic Coordinates

Tell EMLassemblyline where to find the table and fields that contain the geographic coordinates and site names that can be used to fill the geographic coverage metadata. Comment these out and do not run Function 4 (below) if your data package does not contain geographic information.

If the only geographic coverage information you plan on using are park boundaries, you can also skip this step and comment out Function 4 (below). You can add park unit connections later using EMLeditor, which will automatically generate properly formatted GPS coordinates for the park bounding boxes.

Coordinates must be in decimal degrees and include a minus sign (-) for latitudes south of the equator and longitudes west of the prime meridian. For points, repeat latitude and longitude coordinates in respective north/south and east/west columns. If you need to convert from UTMs, try using the utm_to_ll() function in the R/QCkit package.

The example below will generate a single point for each site. We strongly encourage you to be as precise as possible with your geographicCoverage and provide sampling points (e.g. along a transect) whenever possible. This information will (eventually) be displayed on a map on the DataStore Reference page for the data package and these points will also be directly discoverable through DataStore searches.

If you have CUI concerns about the specific locations of your sites, consider fuzzing them rather than completely removing them. One good tool for fuzzing geographic coordinates is the fuzz_location() function in the R/QCkit package.

data_coordinates_table <- "qry_Export_AA_Points.csv"
data_latitude <- "decimalLatitude"
data_longitude <- "decimalLongitude"
data_sitename <- "Point_ID"

Start and End Dates

The start date and end date This should indicate the first and last data point in the data package (across all files) and does not include any planning, pre- or post-processing time. The format should be one that complies with the International Standards Organization’s standard 8601. The recommended format for EML is: #YYYY-MM-DD, where Y is the four digit year, M is the two digit month code (01 - 12 for example, January = 01), and D is the two digit day of the month (01 - 31).

startdate <- ymd("2010-01-26")
enddate <- ymd("2013-01-04")

EMLassemblyline Functions

The next set of optional items are meant to be considered one by one and only run if applicable to a particular data package. The first year will typically see all of these run, but if the data format and protocol stays constant over time it may be possible to skip some or all of these in future years.

Function 1: Core Metadata

The template_core_metadata() function Creates blank TXT template files for the abstract, additional information, custom units, intellectual rights, keywords, methods, and personnel. Some of these .txt files are easily edited in text editors (e.g. abstract.txt) but some are best edited in Excel (or another spreadsheet application), for instance the personnel.txt file.

template_core_metadata(path = working_folder, 
                       license = "CC0")

Editing Templates

For specific guidance on editing these templates as well as example templates, see the documentation on editing templates.

Typically these files can be reused between years.

Intellectual Rights

The template_core_metadata() function will generate a “intellectual_rights.txt” file that has already been populated with an EDI-specific CC0 license. The CC0 license will need to be updated. However, to ensure that the licence meets NPS specifications and properly coincides with CUI designations, the best way to update the license information is during a later step using the EMLeditor::set_int_rights() function. For now, you can leave this template file unaltered.

Function 2: Attributes

Creates an “attributes_datafilename.txt” file for each data file. This can be opened in Excel (We recommend against trying to update these in a text editor) and fill in/adjust the columns for attributeDefinition, class, unit, etc. refer to the webpage on editing templates for helpful hints and examples. You can use EMLassemblyline::view_unit_dictionary() for a list of potential units. This will only need to be run again if the attributes (name, order or new/deleted fields) are modified from the previous year. NOTE that if this already exists from a previous run, it is not overwritten.

template_table_attributes(path = working_folder,
                          data.table = data_files,
                          write.file = TRUE)

Function 3: Categoricals

Creates a “catvars_datafilename.txt” file for each data file that has columns with a class = categorical. These txt files will include each unique ‘code’ and allow input of the corresponding ‘definition’. Note that since the list of codes is harvested from the data itself, it’s possible that additional codes may have been relevant/possible but they are not automatically included here. Consider your lookup lists carefully to see if additional options should be included (e.g if your dataset quality control flagging values are all set to “Accepted” this function will not include “Raw” or “Provisional” in the resulting file and you may want to add those manually). NOTE that if this already exists from a previous run, it is not overwritten.

template_categorical_variables(path = working_folder,
                               data.path = working_folder, 
                               write.file = TRUE)

Function 4: Geography

Creates a geographic_coverage.txt file that lists your sites as points as long as your coordinates are in lat/long. If your coordinates are in UTM it is probably easiest to convert them first or create the geographic_coverage.txt file another way. For instance, try using the utm_to_ll() function in R/QCkit.

template_geographic_coverage(path = working_folder, 
                             data.path = working_folder, 
                             data.table = data_coordinates_table, 
                             lat.col = data_latitude, 
                             lon.col = data_longitude, 
                             site.col = data_sitename, 
                             write.file = TRUE)

Function 5: Taxonomy

Creates a taxonomic_coverage.txt file if you have taxonomic data. In terms of authorities 3 = ITIS, 9 = WORMS, and 11 = GBIF. Turning this function on may dramatically increase the time it takes for the make_eml() function to run, especially if you have many taxa or a slow internet connection. You may want to consider turning this off for testing/development purposes.

template_taxonomic_coverage(path = working_folder, 
                            data.path = working_folder, 
                            taxa.table = data_taxa_table, 
                            taxa.col = data_taxa_field, 
                            taxa.authority = c(3,11), 
                            taxa.name.type = 'scientific', 
                            write.file = TRUE)

Make EML function

Run this (it may take a little while) and see if it validates (you should see ‘Validation passed’). Additionally there could be some issues that review as well. Run the function ‘issues()’ at the end of the process to get feedback on items that might be missing.

make_eml(path = working_folder,
           dataset.title = package_title,
           data.table = data_files,
           data.table.name = data_names,
           data.table.description = data_descriptions,
           data.table.url = data_urls,
           temporal.coverage = c(startdate, enddate),
           maintenance.description = data_type,
           package.id = metadata_id)

Finished? Not quite…

Now that you have valid EML metadata, you need to add NPS-specific elements and fields. For instance, unit connections, DOIs, referencing a DRR, etc. To do that, use the R/EMLeditor package, which includes instructions for a minimal EMLeditor workflow for creating compliant EML metadata.