NIH Common Fund Metabolomics Data Sharing Plan

Fiehn, Oliver (fiehn@ucdavis.edu)
Subramaniam, Shankar (shankar@sdsc.edu)

The main purpose of this document is to serve as a data sharing plan for all the grants and projects funded by Common Fund Metabolomics Program and applicable across the consortium. For broader NIH funded grants readers are advised to visit the links at the end of this document.

Introduction

Data sharing is essential for enhanced utilization of research results for translation into knowledge, products, and procedures to improve human health. Since 1996, NIH has required data sharing in areas such as DNA sequences, mapping information and crystallographic coordinates. Protein and DNA sequences are made available to researchers through public data archives, such as GenBank or the Gene Expression Omnibus. Data sharing allows the scientific community to validate research findings, create new datasets by combining data from multiple sources and explore topics not envisioned by investigators who generated the initial data set. With the creation of the Metabolomics Data Repository managed by Data Repository and Coordination Center (DRCC), the NIH acknowledges the importance of data sharing for metabolomics.

Metabolomics represents the systematic study of low molecular weight molecules found in a biological sample, providing a "snapshot" of the current and actual state of the cell or organism at a specific point in time. Thus, the metabolome represents the functional activity of biological systems. As with other ‘omics’, metabolites are conserved across animals, plants and microbial species, facilitating the extrapolation of research findings in laboratory animals to humans. Common technologies for measuring the metabolome include mass spectrometry (MS) and nuclear magnetic resonance spectroscopy (NMR), which can measure hundreds to thousands of unique chemical entities.

Data sharing in metabolomics will include primary raw data and the biological and analytical meta-data necessary to interpret these data. Through cooperation between investigators, metabolomics laboratories and data coordinating centers, these data sets should provide a rich resource for the research community to enhance preclinical, clinical and translational research.

Rules and requirements

  1. Research projects that require data sharing:
    All NIH Common Fund Metabolomics Program supported research projects which create metabolomics data as part of the funded research will be required to submit data and metadata to the Metabolomics Data Repository managed by Data Repository and Coordination Center (DRCC). Other NIH funded projects that generate metabolomics data are strongly encouraged to submit data to the DRCC.
  2. Direct cost limits:
    Data sharing is required for all research projects funded by the Common Fund Metabolomics Program independent of funding amount and direct costs. It is allowable to request reasonable funds within project proposals that will cover the cost of local storage, data curation efforts and upload to the DRCC. Data sharing funds should be requested in the budget and budget justification section of the grant or project proposal. The Regional Comprehensive Metabolomics Research Cores (RCMRCs) and the DRCC will be available for consultation with the PI and can provide cost estimates based on the scope of the project.
  3. Exemptions from data sharing:
    Investigators are expected to include a data-sharing plan in their application stating how they will share the data or, if they cannot share the data, a clear explanation why the data cannot be shared will be required. Exemptions are not granted automatically. Applicants are required to contact NIH Common Funds- Metabolomics program staff prior to submission if a request for withholding data is included in an application.
  4. Start date for data sharing:
    Data sharing as described in this plan for all grants supported by NIH Common Fund Metabolomics Program will be applicable from the beginning of the grant award.
  5. Definition of metabolomics data to be shared:
    Data to be shared includes four general data types: 1. the raw data generated by the metabolomics laboratory, 2. the analytical metadata, 3. the associated biological and clinical data in compliance with HIPPA guidelines and 4. the final result matrix with quantitative or semi-quantitative metabolite values and appropriate substance identifiers.
    1. 5.1) Raw data includes:
      • The spectrometric, spectrographic and chromatographic data as created by the instrument software.
      • A description of the platform and vendor software version used to generate and analyze raw data files.

      An open exchange format submission is encouraged, as long as the raw data and exchange format contain the same level of information. File names should use identifiers that can be linked to the final result matrix of an experiment.

    2. 5.2) Analytical metadata includes:
      • Details on how samples were obtained at the biological or clinical laboratory
      • Sample storage conditions.
      • Sample preparation and extraction protocols,
      • Analytical methods including the instrument and analytical methods with enough detail to allow for an independent replication of the experiment.
    3. 5.3) Biological metadata includes:
      • The taxonomic definition of species, organs, cell types or cell line information that was used in in-vivo and in-vitro experiments.
      • Submission of more detailed metadata is highly encouraged, including animal husbandry, dietary information, or important human subject metadata such as age, body mass index, gender, co-morbidities, fasting state, medication and other anonymized information.
      • All Biological and clinical metadata should ensure adherence to HIPPA regulations ensuring patient confidentiality (see point 6 below).
    4. 5.4) Metabolites x sample ID, including a list of all known and, where appropriate, unknown metabolites for each given experimental sample.
      • The final data results matrix must contain the same local sample identifiers specified in the accompanying metadata file(s) and raw data file(s) in order to ensure an unambiguous relationship between experimental metadata, results and raw data.
      • The results matrix may consist of measurements for known and/or unknown (unidentified) metabolites.
      • In the case of known metabolites, the InChIKey and/or PubChem compound ID (if these are available) should be provided. Other compound identifiers (e.g. KEGG ID, ChemSpider ID) will be translated to the corresponding InChIKey and PubChem compound ID using the UC Davis Chemical Translation Service.
      • In the case of unknown metabolites, the local identifier and other annotations such as measured m/z value, retention index and type should be provided in order to track these metabolites across different experiments.
      • The results matrix must contain the units of measurement (pmol/ml, ng/sample, MS peak height, MS peak area, etc).
  6. Privacy issues including IRB and HIPPA compliance:
    It is the responsibility of the investigators, their Institutional Review Boards (IRB), and their institutions to protect the rights of participants and the confidentiality of their data. Federal Health Insurance Privacy and Portability Act (HIPAA) Privacy Rule require data should be free of identifiers that would permit linkages to individual research participants, and exclude variables that could lead to deductive disclosure of the identity of individual subjects (http://www.hhs.gov/ocr/privacy/hipaa/understanding/coveredentities/De-identification/guidance.html ). Effective strategies should be adopted to minimize risk of disclosing a participant's identity. No data that allows for disclosure of a patients identity should be submitted to the Metabolomics Data Repository managed by Data Repository and Coordination Center (DRCC).
  7. Embargo times:
    Investigators who collect the data have a legitimate interest in benefiting from their investment of time and effort. Data should be shared no later than the acceptance for first publication of the findings from the data set. The embargo time for data sharing will expire one year from the end of the active grant. In the case of grant renewal the data sharing embargo time will expire with the end of the funding period for the renewed grant. Data can be released in a tiered manner by the PIs, but no later than the expiration of embargo time to the public.
  8. Non-compliance:
    Non-compliance of data sharing may result in delays in funding for the non-competing segment of grant award funded by Common Fund Metabolomics Program.
  9. Amendments and revisions to the Data Sharing plan:
    The data sharing policies are subject to revision based on new data sharing policies being generated by NIH and OSTP.

Frequently asked questions

The frequently asked questions (FAQs) section covers questions that may or may not be covered in the Data Sharing Document text.

Q: When will the metabolomics data sharing policy put in place?
A: Data sharing under the specific guidelines of this plan will become a mandatory requirement for all Common Fund metabolomics research projects starting October 1st 2013.

Q: I have an NIH study that was active before the NIH data sharing policy was put in place. Do I need to submit data from such studies?
A: No, only data from NIH Common Fund Metabolomics Program funded studies need to be submitted to the Metabolomics Repository located at DRCC. However older data sets are welcome and investigators are highly encouraged to submit the data whenever possible.

Q: Are paid or commercial services obtained from a metabolomics core laboratory covered by the data sharing policy?
A: Paid metabolomics core laboratory services are not covered per se, however if the principal investigator is paying with Common Fund Metabolomics Program funds or the costs of the project is offset in part from the Common Fund Metabolomics Program, it is the PIs responsibility to submit such data to the metabolomics data center.

Q: Is the metabolomics core laboratory director responsible for submitting data to the data center?
A: No, it is the principle investigator’s responsibility to send data to the metabolomics data center. However, investigators may use the NIH Common Fund Regional Comprehensive Metabolomics Research Cores (RCMRCs) on their behalf to submit files.

Q: I obtained metabolomics funding from another federal agency such as NSF, DOE, EPA. Am I required to submit data to the Metabolomics Data Repository at DRCC?
A: No, only NIH Common Fund Metabolomics Program funded studies need to comply, but any metabolomics data relevant to biomedical research that satisfies the data deposition requirements of the Metabolomics Data Repository is highly encouraged.

Q: I am a metabolomics researcher from another country, can I submit metabolomics data?
A: Yes, any metabolomics data of relevance to human health is welcome.

Q: What is the license of the data once it is publicly available?
A: After the embargo date the data will be in the public domain.

Q: Can I retract data which is incorrect or one that contains false annotations?
A: Curators from Metabolomics Data Repository will work with you to resolve such issues.

Q: How can I make sure my data sets are correctly connected to my multiple NIH grants?
A: The Metabolomics Data Repository has a systems function that will allow for selection of active NIH grants based on the NIH Reporter System. This information will automatically link related studies. Additionally the principal investigator can update that information on the grant website using the eraCommons login.

Q: I am a NIH Common Fund Metabolomics Program funded principal investigator (PI) and I collaborate with a clinical/medical principal investigator. The medical PI cannot release any patient data and therefore will not be able to share any data. How can I comply with the data sharing policy?
A: Medical information should not be released to the public or to the metabolomics data repository in a way that can be identified. Data should be free of identifiers that would permit linkages to individual research participants, and exclude variables that could lead to deductive disclosure of the identity of individual subjects. When data sharing is limited, applicants should explain such limitations in their data sharing plans. Data needs to be in compliance with Health Insurance Portability and Accountability Act (HIPPA) rules (http://www.hhs.gov/ocr/privacy/hipaa/understanding/coveredentities/De-identification/guidance.html).

Q: Is patient data safe in the metabolomics data center?
A: Metabolomics Data Repository will make sure all the data deposited will be securely maintained. However, it is the responsibility of the investigator to ensure that the clinical data be free of identifiers that would permit linkages to individual research participants, and exclude variables that could lead to deductive disclosure of the identity of individual subjects. Data needs to be in compliance with the Health Insurance Portability and Accountability Act (HIPAA) privacy rules before submitting it to the Metabolomics Data Repository (http://www.hhs.gov/ocr/privacy/hipaa/understanding/coveredentities/De-identification/guidance.html ).

Q: I received Pilot and Feasibility funding through the Common Fund Metabolomics Program Regional Comprehensive Metabolomics Research Cores (RCMRCs). Are there any special rules that apply?
A: Yes, all data generated through Pilot and Feasibility funding will be required to deposit to the metabolomics DRCC. However, it will be kept in and safe and firewalled zone and will be released after the embargo time is over. RCMRCs will work with the investigators in ensuring and clarifying any concerns before the data can be transferred to Metabolomics Data Repository

Q: My data sets are so large it will take weeks to upload all the data.
A: Please talk to the Metabolomics Data Repository curators to send the data in physical form, such as hard disks or solid state disks (SSDs) (see contact information on the web site, http://www.metabolomicsworkbench.org.

Q: I have data from an older prior 2012 NIH funded study. Can I send the data to the metabolomics data center?
A: Yes, any metabolomics data relevant to human health and associated raw data, especially with analytical and biological metadata annotations is highly encouraged.

Q: How do I pay for preparing data for sharing and archiving?
A: NIH recognizes that it takes time and money to prepare data for sharing. You should request funds for data archiving and sharing as part of your grant application for collecting the data. If you have already collected the data, you may want to ask your NIH Project Officer if a competitive or administrative supplement is suitable for this purpose. You may also contact Metabolomics Data Repository for assistance in data transfer.

Q: Can I download data from the repository and publish new research based on such data?
A: Yes, as long as the related source of the data is correctly cited. That includes the accession number of the experiment as well as associated publications where the data was first released. However, there may be restrictions, if you are not willing to share the data with Metabolomics Data Repository. You are highly encouraged to contact DRCC personnel for more information.

Q: What is meta-data and what should I publish besides the raw data files?
A: Meta-data in metabolomics can be of analytical instrument or biological nature. Meta-data for analytical instrumentation can include the instrument type, such as GC-MS, LC-MS, NMR, the type of instrument and vendor, the experiment such as 1H-NMR, COSY, LC-MS/MS and protocols of extraction and data processing. Biological data includes taxonomy data, organ and cell type and the minimal description of the experiment such as drug vs. non-drug treatment or time-course experiments and appropriate sample size information. The repository will require a minimum meta-data set and additional data should be submitted as protocol or standard operating procedure in Word or PDF format. All experiments (other than patient clinical studies) should be completely reproducible and a complete set of protocols and metadata will ensure this.
We recommend the guidelines of the Metabolomics Society as published below, but the data/metadata requirements may extend beyond what is proposed in these documents. The Metabolomics Society published a set of rules in Metabolomics (ISSN 1573-3890), Volume 3, Number 3, September 2007; http://www.springerlink.com/content/1573-3882/3/3/

Q: Are there any guidelines or examples how to structure data for submission?
A: See http://www.metabolomicsworkbench.org

Q: How will NIH enforce the data sharing rules?
A: This document mainly provides guidance for NIH Common Fund Metabolomics Program funded grants and projects. Data sharing requirements terms and conditions have already been negotiated with the awardee institutions and investigators at the time of Notice of Grant Award (NGA). If data sharing compliance is not met, it may lead to unnecessary delays in non-competing award process. If the investigators seek waiver for data sharing; a strong and compelling justification will be necessary to explain why such a waiver is necessary for withholding the data from sharing.

Q: Does data sharing pertain only to published data?
A: No. Data-sharing plans should encompass all data from funded research that can be shared without compromising individual subjects' rights and privacy that would help analyzing the metabolomics data, regardless of whether the data have been used in a publication. Furthermore, data sharing prior to the publication of major results, is encouraged in many instances if appropriate. For example, when data are collected to provide a resource for the scientific community (as in the case of many large surveys); as such it may not necessarily lead any publications. Whenever applicable, raw data from the measurements should also be shared.

Q: How will the quality of the data be evaluated?
A: A quality index that represents missing meta-data or missing annotations will be assigned by curators. Automatic curation tools that can detect missing substance identifiers, missing metadata or broken raw data files are in development.

Sources

Resources for data sharing policy guidance for NIH funded grants