DPchecker • DPchecker

Install DP checker

You can install DPchecker as part of the NPSdataverse using:

install.packages("devtools")
devtools::install_github("nationalparkservice/NPSdataverse")
library(NPSdataverse)

Check a data package

The entire package

The most common use case for DPchecker is to run a single function, run_congruence_checks() that will run all of the DPchecker tests at once. To do this you will need your fully constructed data package in a single folder consisting of: * EML-formatted metadata with a file a name that ends in _metadata.xml * UTF-8 encoded .csv files You will also need to supply the path to your data package. If you are using Rstudio and have started a new project, you can put the data package folder in your Rproject folder and tell R where to find it:

run_congruence_checks("my_data_package_folder")

If your data package is somewhere else on your hard drive, you will have to describe the path to the data package folder. In this example, the data package folder is a folder called “nps_data” located in the Downloads folder (and “username” would be your username):

dp <- "C:/Users/username/Downloads/my_data_package_folder"
run_congruence_checks(dp)

Metadata only

In some cases, you may want to check just the EML metadata file for completeness without checking whether it properly coincides with the data files (perhaps you are trouble shooting a metadata issue or were sent just the metadata file to check). In that case, you can restrict the run_congruence_checks() function to just check metadata elements:

# In this case "my_data_package_folder" need only contain the metadata file but could include .csvs
dp <- "C:/Users/username/Downloads/my_data_package_folder"
run_congruence_checks(dp, check_metadata_only = TRUE)

Generate a log file

If you want to generate a log file of the run_congruence_checks() results you can do so. The log file may be useful for collaborating on trouble shooting or may simply be handy for your records. Log files should not be included in the data package upload. The log file will be written to the directory of your Rproject by default, but you can also specify the directory it should be saved to.

# save log file to current working directory:
run_congruence_checks(dp, output_filename = "congruence_log_YYYY-MM-DD")

# save the log file to another directory:
save_here <- "C:/Users/username/Documents"
run_congruence_checks(dp, output_filename = "congruence_log_YYYY-MM-DD", output_dir = save_here)

Interpreting results

DPchecker tests are designed to help data package creators produce high quality, complete data packages that can fully leverage DataStore’s ability to ingest machine-readable metadata, and be maximally useful to downstream data users. The same set of tests will also be useful for data package reviewers.

Passing a test is indicated with a green check mark ( $\checkmark$ ). When a test fails, it may fail with an error (a red $\times$ ) or a warning (a yellow exclamation mark, !).

Errors must be addressed prior to upload. Please modify your data package so that DPchecker does not return any errors.

Warnings are helpful indications that the data package creator may want to look into something. It may not be wrong, but it might be unusual. For instance, if a data package lacked taxonomic or geographic coverage it would fail the taxonomic or geographic coverage test with a warning because while lacking taxonomy or geography is unusual, it may not be incorrect. Warnings may also be used to alert data package creators of best practices - for instance if an abstract is less than 20 words long the test will produce a warning suggesting the data package creator consider writing a more informative abstract.

Tests conducted

DPchecker v0.3.2 and above runs two types of tests: metadata only tests and tests to determine whether the metadata and data files are congruent. Metadata tests can be broken down in to two sub-categories, metadata compliance and metadata completeness. The tests run and the order in which they are run are listed below.

Metadata compliance

These tests determine whether the metadata is schema valid and adheres to some rules for data packages. They only require the *_metadata.xml file to run and do not require that the data files be present. These include:

The metadata file is schema valid (test_validate_schema())
Each filename is used exactly once in the metadata (test_dup_meta_entries())
The version of EML is supported (test_metadata_version()
Metadata indicates that Each data file has a single-character field delimiter (test_delimiter())
Metadata indicates that each data file contains exactly one header row (test_header_num())
Metadata indicates that data files do not have footers (test_footer())
Metadata contains taxonomic coverage element (test_taxonomic_cov())
Metadata contains geographic coverage element (test_geographic_cov())
Metadata contain NPS content unit links (test_content_units())
Metadata contains a Digital Object Identifier (DOI) (test_doi())
Metadata DOI is properly formatted (test_doi_format())
Metadata contains URLs for each data table (test_datatable_urls)
Metadata URLs are properly formatted and correspond to the DOI indicated in the metadata (test_datatable_urls_doi)
Metadata contains a publisher element (test_publisher())
Metadata indicates data column names begin with a letter and do not contain spaces or special characters (test_valid_fieldnames())
Metadata indicates that file names being with a letter and do not contain special characters or spaces. (test_valid_filenames())
Metadata contains emails, but only .gov emails (test_pii_meta_emails())

EML elements required for DataStore:

These tests ensure that the EML elements necessary for DataStore to properly extract metadata and populate a reference exist, are in the correct location, and are properly formatted. These elements are also often the aspects of metadata that will be passed on to other repositories and search engines such as DataCite data.gov and google’s dataset search. Therefore, these checks may throw warnings with suggestions on best practices - such as removing stray characters from abstracts or suggesting a more informative title if your title is unusually short. Required EML element tests only require the *_metadata.xml file to run and do not require that the data files be present.

Creator element exists and if individual creators exist, they all have valid (<3 words) surNames (test_creator())
Publication date is present and in the correct ISO-8601 format (test_pub_date())
Data package title is present in metadata (test_dp_title())
Data package metadata contains at least one keyword (test_keywords())
Metadata states data was created by or for NPS (test_by_for_nps())
Metadata indicates the publisher is the National Park Service (test_publisher_name())
Metadata indicates the publisher state is CO (test_publisher_state())
Metadata indicates the publisher city is Fort Collins (test_publisher_city())
Metadata contains a well formatted abstract for the data package (test_dp_abstract())
Metadata contains a well formatted methods section for the data package (test_methods())
All dataTables listed in metadata have a unique file description (test_file_descript())
Metadata contains a valid CUI code (test_cui_dissemination())
Metadata contains a valid license name (test_license())
Metadata contains an Intellectual Rights statement (test_int_rights()
All attributes listed in metadata have attribute definitions (test_attribute_defs())
All attributes listed in metadata have storage types associated with them (test_storage_type())
All attribute storage types are valid values (test_storage_type())

Recommended EML elements

These elements aren’t required. If they are missing, the tests will generate a warning that you can choose to ignore. However, if you have included these elements, please resolve any errors before submitting the data package.

All individual Creators have an ORCiD associated with them (test_orcid_exists())
All ORCiDs are properly formatted (test_orcid_format())
All ORCiDs resolve to an ORCiD profile (test_orcid_resolves())
All ORCiDs resolve to an ORCiD profile that matches the Creator’s last name (test_orcid_match())
The metadata contains a well formatted additionalInfo (“Notes” on DataStore) section (test_notes())
The metadata contains a DataStore Project reference in “projects”(test_project())

Metadata and Data Congruence

These functions check to make sure the values and fields in the metadata file accurately corresponds to the data files supplied. These test require the entire data package - both the *_metadata.xml file and all data files (*.csv) must be present.

All data files are listed in metadata and all metadata file names refer to data files (test_file_name_match())
All columns in data match all columns in metadata (test_fields_match())
All NAs (missing data) are properly accounted for in metadata (test_missing_data())
Columns indicated as numeric in metadata contain only numeric values and missing value codes in the data (test_numeric_fields())
Columns indicated as dates in metadata have matching date formats in the metadata and the data. This checks each cell in each date column against the format provided in the metadata and so can take some time for larger data packages (test_dates_parse())
Columns indicated as dates in metadata contain values that fall within the stated temporal coverage in metadata (test_date_range())

Data and Metadata Compliance

These functions check the data and metadata files for compliance. Please resolve any errors before uploading your data.

Data files do not contain any email addresses that constitute personally identifiable information (PII) (test_pii_data_emails()