INSDC continues its aim to increase the number of sequences for which the origin of the sample can be precisely located in time and space through harmonisation of accurate geographical annotation and date and time of collection information.
In this update, INSDC will elaborate on the plans for the new standards being introduced for spatiotemporal metadata as well as the next steps for implementation.
Technical implementation
Mandatory spatiotemporal data will be captured in pre-existing fields. For sequence flat files, the data will be captured in the source qualifiers: ‘country’ and ‘collection_date’; for BioSamples the data will be captured in country and collection date attributes. BioSample fields, implementation and tooling may differ between partners and the INSDC partners may follow this announcement with individual statements about implementation.
Minimum reporting requirements for these fields are as follows, though further granularity is encouraged:
Location of collection: the locality of isolation of the sequenced sample should be indicated to country level at least and should be provided in terms of political names for nations, oceans or seas using values from the controlled vocabulary at http://www.insdc.org/documents/country-qualifier-vocabulary
Date/time of collection: the date and time at which the specimen was collected should be provided, at least to the nearest year.
INSDC recognises that there are valid exemptions from this rule. Accordingly, the INSDC ‘missing value’ reporting standards will be extended to add another layer of granularity so users can report specific use-cases where they are unable to report spatiotemporal metadata. Previous ‘lower-level’ terms ‘not collected’, ‘not provided’ and ‘restricted access’ will be split into different use-case specific missing values which will form a new set of ‘reporting-level’ terms. Users will be encouraged to use these new ‘reporting-level’ INSDC missing value terms going forward. Although other missing values may remain in place for backwards compatibility purposes, partners may discontinue using ‘lower-level’ or ‘top-level’ in future. The list of ‘reporting-level’ terms that will be added are detailed below:
control sample | Information is not applicable as the sample represents a negative control sample collected in a lab. |
sample group | Information is not applicable as the sample represents a group of samples that do not have a single origin. E.g. for co-assembly or transcriptome assembly. |
synthetic construct | Information does not exist as the sample represents an ab-initio synthetic construct. |
lab stock | Information was not collected as the sample represents a cultured cell line or model organism under long-term lab control. |
third party data | Information does not exist as the metadata was not collected or reported in records predating the 2023 agreement. For use in Third PArty data submissions. |
data agreement established pre-2023 | Data agreements were established before introduction of the 2023 INSDC spatiotemporal metadata standard and metadata can not be provided. A value may be given at a later stage. |
endangered species | Information can not be reported as the target organism is endangered e.g. on the IUCN red-list. |
human-identifiable | Information can not be reported as the metadata would make the sample human-identifiable. |
Users can expect another announcement in a month’s time with an update to the INSDC missing value reporting page at which point these terms will be usable for BioSample registration.
Timeline
The timeline for implementation will be split across two main phases. These phases outline the main milestones where the new standards will be put in place for different record types. Between these phases, please note that INSDC partners may also progress on tightening more complex validation to ensure correct usage of the missing values. E.g. cross-referencing validation of ‘model organism’ exceptions against a list of valid taxa; or ensuring ‘sample group’ declarations contain references to more than one individually registered BioSamples.
See details of these key phases below.
Phase I – new standard in place for BioSamples by the end of May 2023
It will become mandatory to provide country and collection date metadata for all new registered BioSamples associated with INSDC data following this date unless a valid exemption is declared. As a result, all new raw (SRA/ENA/DRA) data and genomes will have associated spatiotemporal metadata in the BioSample.
Phase II – new standard in place for sequences by the end of Dec 2024
It will become mandatory to provide country and collection date metadata for all newly submitted sequence records through any remaining submission routes within 2 years, this includes sequences submitted without BioSample references.
We thank users who have provided feedback on this so far and would like to encourage further feedback, particularly whether there are exemptions you feel are applicable that are not yet catered for. Please provide your feedback to the INSDC member database to which you normally submit:
DDBJ: please email ddbjsub@ddbj.nig.ac.jp
ENA (EMBL-EBI): please email ena-collaborations@ebi.ac.uk
GenBank and SRA (NCBI): please email gb-admin@ncbi.nlm.nih.gov