NPSdataverse • NPSdataverse

Summary

The NPSdataverse is a suite of R packages developed to create, document, publish, and access data and metadata in open and machine-readable formats.NPSdataverse is modeled off of the tidyverse concept of several packages built with a common goal (Wickham et al. 2019).The NPSdataverse supports Ecological Metadata Language (EML) metadata and .csv data files. Some of the constituent R packages (EML and EMLassemblyline) are general-use and aimed at authoring EML documents. Other R packages (QCkit, EMLeditor, DPchecker and NPSutils) are designed and maintained by the National Park Service (NPS).Although many functions within the NPSdataverse packages are NPS-specific (particularly some API calls), whenever possible the functions are written so that they can also be used by the general public. Scientists conducting permitted research in NPS units can utilize the NPSdataverse to efficiently and consistently meet the data delivery requirements of their permits. Additionally, the packages will be useful for data management plans in a wide variety of grant proposals and for anyone that needs to create open data and machine-readable metadata. The ability to swiftly and easily author, edit, and check Ecological Metadata Language (EML) metadata in a reproducible fashion will be useful for data publication at any number of repositories or data journals. Finally, a scripted interface for downloading NPS data and leveraging metadata while loading it into R or other platforms for subsequent analyses and visualizations will be useful to researchers in the government, academia, and industry as well as the public.

Usage

The NPSdataverse package is a meta-package that loads packages within the NPSdataverse into R (Baker, Patterson, and DeVivo 2025). The NPSdataverse provides a convenient way to download, install, and load many of the R packages needed to create and access data packages, which consist of rich Ecological Metadata Language metadata and .csv data files:

pak::pkg_install("nationalparkservice/NPSdataverse")
library(NPSdataverse)

NPSdataverse will automatically check that the latest version of each R package is being loaded: either from the main development branch on GitHub.com or the latest version on CRAN. If updates are indicated, the user will be alerted and given instructions on how to update the relevant packages. To prevent API limits at GitHub (and to facilitate scripted workflows such as those at High Performance Computing facilities), NPSdataverse only checks for updates from an interactive R session and will skip checks when the system is not on-line or GitHub.com is not responding.

Statement of Need

Following a movement for transparency in scientific research and data accessibility, the U.S. implemented the federal OPEN Government Data Act (“H. R. 4174” 2018). The Open Data Act mandates that federal agencies provide data in open formats with metadata. Subsequently, many funding agencies such as the National Science Foundation have required grant awardees make their data public, often including metadata (The National Science Foundation Open Government Plan 3.5 2016). Multiple publishers have followed suit (Wiley 2022; Springer 2023) and require data availability statements upon publication.

One goal of open science, and requirement of the recent “Nelson Memo” from the U.S. Office of Science and Technology Policy (Nelson et al. 2022) is to make data FAIR: findable, inter-operable, accessible, and reuseable (Wilkinson et al. 2016). These goals are often achieved by including structured, machine-readable metadata that conforms to a defined schema along with the data. Ecological Metadata Language Metadata (EML) is one metadata standard that is particularly amenable to studies with rich taxonomy (Jones et al. 2006, 2019). It has been adopted by multiple research organizations including the Ecological Data Initiative (EDI), National Ecological Observatory Network (NEON), Global Biodiversity Information Facility (GBIF), Swedish Biodiversity Data Infrastructure (SBDI), French Biodiversity Hub (“Pole National de Donnees de Biodiversite”), U.S. National Park Service, and others.

Nevertheless, actual availability of data and metadata varies (Federer 2018; Tedersoo et al. 2021), perhaps because there is a need for more infrastructure and tools to meet the goals of open data and open science (Huston, Edge, and Bernier 2019). Multiple solutions have been presented, including ezEML, a tool for authoring metadata in Ecological Metadata Language and publishing data and metadata to a repository (Vanderbilt et al. 2022). ezEML has an intuitive graphical user interface with a relatively low learning curve; however, it does have some drawbacks. For instance, ezEML is not scriptable, which makes repeated deployments of the same or similar workflows challenging and can limit reproducibility. ezEML also requires that the user upload their data to an external site for processing, which may not be suitable for sensitive data. Here we introduce the NPSdataverse, a series of R packages for authoring, editing, and checking EML metadata locally in a robust, repeatable, and scriptable fashion. R Packages within the NPSdataverse leverage earlier work using R to create and manipulate XML based EML files (Boettiger 2019). Building upon that framework, we add user-friendly EML creation workflows; integration with taxonomic databases; fast, easy editing of existing metadata; congruence checks to test correspondence between data and metadata; and integration with public repositories such as the National Park Service’s DataStore. R packages within the NPSdataverse also include functions that expedite data quality control, facilitate data interoperability, provide the ability to download data directly from DataStore, and leverage the rich EML associated with the data regardless of repository of origin.

References

Baker, Robert, Judd Patterson, and Joe DeVivo. 2025. NPSdataverse: Tools and Packages for Data and Metadata Manipulation. https://doi.org/10.57830/2313107.

Boettiger, Carl. 2019. “Ecological Metadata as Linked Data.” Journal of Open Source Software 4 (34): 1276. https://doi.org/10.21105/joss.01276.

Federer, Christopher W. AND Joubert, Lisa M. AND Belter. 2018. “Data Sharing in PLOS ONE: An Analysis of Data Availability Statements.” PLOS ONE 13 (5): 1–12. https://doi.org/10.1371/journal.pone.0194768.

“H. R. 4174.” 2018. Law. H.R.4174 - 115th Congress. https://www.congress.gov/bill/115th-congress/house-bill/4174.

Huston, P, VL Edge, and E Bernier. 2019. “Open Science/Open Data: Reaping the Benefits of Open Data in Public Health.” Canada Communicable Disease Report 45 (11): 252.

Jones, Matthew, Margaret O’Brien, Bryce Mecum, Carl Boettiger, Mark Schildhauer, Mitchell Maier, Timothy Whiteaker, Stevan Earl, and Steven Chong. 2019. “Ecological Metadata Language Version 2.2.0.” https://doi.org/10.32614/cran.package.eml.

Jones, Matthew, Mark P. Schildhauer, O. J. Reichman, and Shawn Bowers. 2006. “The New Bioinformatics: Integrating Ecological Data from the Gene to the Biosphere.” Journal Article. Annual Review of Ecology, Evolution, and Systematics 37 (Volume 37, 2006): 519–44. https://doi.org/10.1146/annurev.ecolsys.37.091305.110031.

Nelson, Alondra et al. 2022. “Memorandum for the Heads of Executive Departments and Agencies: Ensuring Free, Immediate, and Equitable Access to Federally Funded Research.”

Springer. 2023. “Data Availability Statement.” https://www.springer.com/gp/editorial-policies/data-availability-statement?srsltid=AfmBOoq9OGxFR-H9UXUfYx_Nl1fRgfnBfCIFl3nbUqkNcRey1oaTBNqn.

Tedersoo, Leho, Rainer Küngas, Ester Oras, Kajar Köster, Helen Eenmaa, Äli Leijen, Margus Pedaste, et al. 2021. “Data Sharing Practices and Data Availability Upon Request Differ Across Scientific Disciplines.” Scientific Data 8 (1): 192. https://doi.org/10.1038/s41597-021-00981-0.

The National Science Foundation Open Government Plan 3.5. 2016. Alexandria, VA, USA: National Science Foundation; Publication number: NSF 16-131. https://www.nsf.gov/notices/general/national-science-foundation-open-government-plan-40.

Vanderbilt, Kristin, Jon Ide, Corinna Gries, Susanne Grossman-Clarke, Paul Hanson, Margaret O’Brien, Mark Servilla, Colin Smith, Robert Waide, and Kyle Zollo-Venecek. 2022. “Publishing Ecological Data in a Repository: An Easy Workflow for Everyone.” The Bulletin of the Ecological Society of America 103 (4): e2018. https://doi.org/10.1002/bes2.2018.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the Tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.

Wiley. 2022. “Wiley’s Data Sharing Policies.” https://authorservices.wiley.com/author-resources/Journal-Authors/open-access/data-sharing-citation/data-sharing-policy.html.

Wilkinson, Mark D., Michel Dumontier, Ijsbrand Jan Allbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, et al. 2016. “The FAIR Guiding Principles for Scientific Data Management and Stewardship.” Scientific Data 3 (1): 160018. https://doi.org/10.1038/sdata.2016.18.