DPchecker
DPchecker.Rmd
Install DP checker
You can install DPchecker as part of the NPSdataverse using:
install.packages("devtools")
devtools::install_github("nationalparkservice/NPSdataverse")
library(NPSdataverse)
Check a data package
The entire package
The most common use case for DPchecker is to run a single function,
run_congruence_checks()
that will run all of the DPchecker
tests at once. To do this you will need your fully constructed data
package in a single folder consisting of: * EML-formatted metadata with
a file a name that ends in _metadata.xml * UTF-8 encoded .csv files You
will also need to supply the path to your data package. If you are using
Rstudio and have started a new project, you can put the data package
folder in your Rproject folder and tell R where to find it:
run_congruence_checks("my_data_package_folder")
If your data package is somewhere else on your hard drive, you will have to describe the path to the data package folder. In this example, the data package folder is a folder called “nps_data” located in the Downloads folder (and “username” would be your username):
dp<-"C:/Users/username/Downloads/my_data_package_folder"
run_congruence_checks(dp)
Metadata only
In some cases, you may want to check just the EML metadata file for
completeness without checking whether it properly coincides with the
data files (perhaps you are trouble shooting a metadata issue or were
sent just the metadata file to check). In that case, you can restrict
the run_congruence_checks()
function to just check metadata
elements:
# In this case "my_data_package_folder" need only contain the metadata file but could include .csvs
dp<-"C:/Users/username/Downloads/my_data_package_folder"
run_congruence_checks(dp, check_metadata_only = TRUE)
Generate a log file
If you want to generate a log file of the
run_congruence_checks()
results you can do so. The log file
may be useful for collaborating on trouble shooting or may simply be
handy for your records. Log files should not be included in the
data package upload. The log file will be written to the
directory of your Rproject by default, but you can also specify the
directory it should be saved to.
# save log file to current working directory:
run_congruence_checks(dp, output_filename = "congruence_log_YYYY-MM-DD")
# save the log file to another directory:
save_here <- "C:/Users/username/Documents"
run_congruence_checks(dp, output_filename = "congruence_log_YYYY-MM-DD", output_dir = save_here)
Interpreting results
DPchecker tests are designed to help data package creators produce high quality, complete data packages that can fully leverage DataStore’s ability to ingest machine-readable metadata, and be maximally useful to downstream data users. The same set of tests will also be useful for data package reviewers.
Passing a test is indicated with a green check mark (). When a test fails, it may fail with an error (a red ) or a warning (a yellow exclamation mark, !).
Errors must be addressed prior to upload. Please modify your data package so that DPchecker does not return any errors.
Warnings are helpful indications that the data package creator may want to look into something. It may not be wrong, but it might be unusual. For instance, if a data package lacked taxonomic or geographic coverage it would fail the taxonomic or geographic coverage test with a warning because while lacking taxonomy or geography is unusual, it may not be incorrect. Warnings may also be used to alert data package creators of best practices - for instance if an abstract is less than 20 words long the test will produce a warning suggesting the data package creator consider writing a more informative abstract.
Tests conducted
DPchecker v0.3.2 and above runs two types of tests: metadata only tests and tests to determine whether the metadata and data files are congruent. Metadata tests can be broken down in to two sub-categories, metadata compliance and metadata completeness. The tests run and the order in which they are run are listed below.
Metadata compliance
These tests determine whether the metadata is schema valid and adheres to some rules for data packages. They only require the *_metadata.xml file to run and do not require that the data files be present. These include:
- The metadata file is schema valid (
test_validate_schema()
) - Each filename is used exactly once in the metadata (
test_dup_meta_entries()
) - The version of EML is supported (
test_metadata_version()
- Metadata indicates that Each data file has a single-character field
delimiter (
test_delimiter()
) - Metadata indicates that each data file contains exactly one header
row (
test_header_num()
) - Metadata indicates that data files do not have footers (
test_footer()
) - Metadata contains taxonomic coverage element (
test_taxonomic_cov()
) - Metadata contains geographic coverage element (
test_geographic_cov()
) - Metadata contains a Digital Object Identifier (DOI) (
test_doi()
) - Metadata DOI is properly formatted (
test_doi_format()
) - Metadata contains URLs for each data table (test_datatable_urls)
- Metadata URLs are properly formatted and correspond to the DOI indicated in the metadata (test_datatable_urls_doi)
- Metadata contains a publisher element (
test_publisher()
) - Metadata indicates data column names begin with a letter and do not
contain spaces or special characters (
test_valid_fieldnames()
) - Metadata indicates that file names being with a letter and do not
contain special characters or spaces. (
test_valid_filenames()
) - Metadata contains emails, but only .gov emails (
test_pii_meta_emails()
)
EML elements required for DataStore:
These tests ensure that the EML elements necessary for DataStore to properly extract metadata and populate a reference exist, are in the correct location, and are properly formatted. These elements are also often the aspects of metadata that will be passed on to other repositories and search engines such as DataCite data.gov and google’s dataset search. Therefore, these checks may throw warnings with suggestions on best practices - such as removing stray characters from abstracts or suggesting a more informative title if your title is unusually short. Required EML element tests only require the *_metadata.xml file to run and do not require that the data files be present.
- Creator element exists and if individual creators exist, they all
have valid (<3 words) surNames (
test_creator()
) - Publication date is present and in the correct ISO-8601 format (
test_pub_date()
) - Data package title is present in metadata (
test_dp_title()
) - Data package metadata contains at least one keyword (
test_keywords()
) - Metadata states data was created by or for NPS (
test_by_for_nps()
) - Metadata indicates the publisher is the National Park Service (
test_publisher_name()
) - Metadata indicates the publisher state is CO (
test_publisher_state()
) - Metadata indicates the publisher city is Fort Collins (
test_publisher_city()
) - Metadata contains a well formatted abstract for the data package (
test_dp_abstract()
) - Metadata contains a well formatted methods section for the data
package (
test_methods()
) - All dataTables listed in metadata have a unique file description (
test_file_descript()
) - Metadata contains a valid CUI code (
test_cui_dissemination()
) - Metadata contains a valid license name (
test_license()
) - Metadata contains an Intellectual Rights statement (
test_int_rights()
- All attributes listed in metadata have attribute definitions (
test_attribute_defs()
) - All attributes listed in metadata have storage types associated with
them (
test_storage_type()
) - All attribute storage types are valid values (
test_storage_type()
)
Recommended EML elements
These elements aren’t required. If they are missing, the tests will generate a warning that you can choose to ignore. However, if you have included these elements, please resolve any errors before submitting the data package.
- All individual Creators have an ORCiD associated with them (
test_orcid_exists()
) - All ORCiDs are properly formatted (
test_orcid_format()
) - All ORCiDs resolve to an ORCiD profile (
test_orcid_resolves()
) - All ORCiDs resolve to an ORCiD profile that matches the Creator’s
last name (
test_orcid_match()
) - The metadata contains a well formatted additionalInfo (“Notes” on
DataStore) section (
test_notes()
) - The metadata contains a DataStore Project reference in “projects”(
test_project()
)
Metadata and Data Congruence
These functions check to make sure the values and fields in the metadata file accurately corresponds to the data files supplied. These test require the entire data package - both the *_metadata.xml file and all data files (*.csv) must be present.
- All data files are listed in metadata and all metadata file names
refer to data files (
test_file_name_match()
) - All columns in data match all columns in metadata (
test_fields_match()
) - All NAs (missing data) are properly accounted for in metadata (
test_missing_data()
) - Columns indicated as numeric in metadata contain only numeric values
and missing value codes in the data (
test_numeric_fields()
) - Columns indicated as dates in metadata have matching date formats in
the metadata and the data. This checks each cell in each date column
against the format provided in the metadata and so can take some time
for larger data packages (
test_dates_parse()
) - Columns indicated as dates in metadata contain values that fall
within the stated temporal coverage in metadata (
test_date_range()
)
Data and Metadata Compliance
These functions check the data and metadata files for compliance. Please resolve any errors before uploading your data.
- Data files do not contain any email addresses that constitute
personally identifiable information (PII) (
test_pii_data_emails()
- Metadata do not contain GPS coordinates if the data package is not
public (
test_public_points()
)