ClinVar Data Dictionary Overview Status of this document General ...

ClinVar Data Dictionary Overview Status of this document General ...


DataDictionary, August 11, 2011

ClinVar Data Dictionary

This document defines the data elements represented in the ClinVar database. The

document includes descriptions of how data are managed, the XML used to represent

each concept, the field name in the spreadsheet version of the submission document,

and allowed values. Not all values need be submitted; some will be reported based on

information in NCBI’s databases.

Most elements in the database are characterized with respect to the submitter,

identifiers used by the submitter, date submitted, date modified, validity status, review

status, and whether the data should be public or private. Rather than repeating these

elements for each data category defined below, the word Source/Status will be used as

a pointer to the Data source/Status section, where the source and status elements are


Status of this document

Draft and incomplete. For discussion only.

General data elements used in multiple contexts


Many concepts in the database are represented by what ClinVar terms an attribute,

which is an open-ended structure providing the equivalent of a type of information, the

value for that data type, submitter(s) of that attribute, free text comment(s) describing

that attribute, and citation(s) related to that attribute. Rather than repeating this

description per attribute, the word AttributeSet will be used to indicate that the data

are stored using this data structure, with the attribute types expected for that database



Note: To decrease the reporting burden, methods used to capture clinical data for a

submitter shall be shared between ClinVar and the Genetic Testing Registry (GTR).

This section summarizes the elements that will be captured to describe an assay

method. We recognize that not all will apply to the approaches enumerated in the

Evidence section; rules to enforce logical consistency will be applied at the database

Page 1

level and not always in the xsd. More than one method may be submitted per

observation. An example would be one method for primary data collection and another

for validation. A set of methods can also be submitted to apply to the set of reported


Description (optional): free text describing the method

XML: Method.Description

db: GTR.clinvar.method.description

platform type (e.g. next-gen)

XML: Method.TypePlatform

db: GTR.clinvar.method.method_type

platform name (e.g. Illumina Hi-Seq)

XML: Method.NamePlatform

db: GTR.clinvar.method.platform

Confidence measures (optional, incomplete, will allow user-supplied

method/value pairs)

Operationally, the methods used to assess confidence are themselves methods.

The following types of information may be in scope:

o coverage after removal of duplicates

o quality score of base call or variation type (e.g. are there co-occuring

independent single substitutions or one multinucleotide variation)

o quality score of mapping

o was the variant detected on both stands?

Purpose (required) (e.g. review|primary assay|validation)

XML: Method.Purpose


o Name


o Version

XML: Method.Software.version

o Purpose (variant calling, alignments, etc.)

DataDictionary, August 11, 2011

Page 2

XML: Method.Software.purpose

Minimum value reported (optional)


Maximum value reported (optional)


Citations (optional)


Method of validation (analytical validity) (optional)

o Confirmation by independent technologies

Reference standard (optional)

XML: Method.ReferenceStandard

Type (clinical testing, reference population, case-control, curation, in vivo, in


XML: Method.Type

Primary source (data mining, submitter-generated)

//what does this add over curation/clinical testing??

GTR_test_id (optional)

XML: Method.XRef


cell line (optional)

XML: ObservedIn.CellLine

origin (required): germline, somatic, uncertain, not determined

XML: ObservedIn.origin

db: clinvar.observation.cell_line

species (required)

XML: ObservedIn.Species

db: clinvar.observation.txid

age ranges (optional)

XML: ObservedIn.Age

country of origin (optional). If multiple, provided as semi-colon delimited.


ethnicity (optional)

XML: ObservedIn.Ethnicity

tissue (optional)

XML: ObservedIn.Tissue

DataDictionary, August 11, 2011

Page 3

strain/breed (optional)

XML: ObservedIn.Strain

Number of individuals (optional)

XML: ObservedIn.NumberTested

Number of males (optional)

XML: ObservedIn.NumberMales

Number of females (optional)

XML: ObservedIn.NumberFemales

Number of families (optional)

XML: ObservedIn.FamilyInfo.NumFamilies

PositiveFamilyHistory (optional)

XML: ObservedIn.FamilyInfo.FamilyHistory

Public pedigree ID(optional)

XML: ObservedIn.FamilyInfo.PedigreeID

Comment (optional): free text describing the sample

XML: ObservedIn.Comment

db: GTR.dbo.comment.comment

Data source

Submitter name

XML: Submitter.PersonName

Submitter affiliation

XML: Submitter.Affiliation

Contact information


Submitter id

XML: Submitter.PersonID

Date submitted

XML: ClinvarSubmissionID.submitterDate

Date updated

Release status

Submitter’s identifier for the record submitted (optional)

XML: ClinvarSubmission.ClinvarSubmissionID.localKey

Record status (preliminary, under review, reviewed)

DataDictionary, August 11, 2011

Page 4

Comment [d1]: We need a definition

URL to submitter’s record

Public status: public/private


Citations include published articles and URLs. If a database name and identifier are

supplied, the full text is not required.

Source: the name of the data service providing an id


ID: the identifier in that source


URL: complete URL


CitationText: when there is no database ID for the publication



A free text comment can be provided to describe submitted data.

Text: the content of the comment


Type: public (will be rendered on the web) or private (to explain a submission

and be stored in the database but not rendered on the web.)

XML: Comment.Type

Description of one phenotype (trait)


Preferred name

The name of the phenotype used for reporting from ClinVar by default.

When available, this will be a preferred term from SNOMED CT. Other sources

may include Office of Rare Diseases Research (ORDR), Human Phenotype

Ontology (HPO), OMIM®, MeSH. The submitter’s name will be retained, but

mapped to controlled vocabularies.

Required, only one allowed.

DataDictionary, August 11, 2011

Page 5


AttributeSet: preferred name


Alternate name(s)

Other names used for this phenotype

Optional, multiple allowed


AttributeSet: alternate name


Preferred acronym

The acronym of the phenotype used for reporting from ClinVar by default. This

usually is reported by OMIM.

Optional, only one allowed.


AttributeSet: preferred symbol

XML: trait.symbol.type=preferred

Alternate acronym(s)

Alternate acronyms of the phenotype.

Optional, multiple allowed.


AttributeSet: alternate symbol

XML: trait.symbol.type=alternate

DataDictionary, August 11, 2011

Page 6



A generic structure to capture values assigned to a defined information

categories. The values can be words, integers, and/or dates. This structure is

used heavily for names of different categories and for observations/findings of

different types. Types are restricted by an enumerated list of allowed values per

major information set. These restrictions may be applied in the XSD, or only in

the underlying relational database.

Optional, multiple allowed



XML: trait.attribute.type in (MIM number, public definition, age of onset,

penetrance, severity, increased risk, decreased risk)

NOTE: mode of inheritance will be stored as an attribute of the

genotype/phenotype relationship, not as an attribute of the phenotype itself.

Classification of the phenotype. The current choices are:

Disease: usually a diagnostic term

Drug response: usually constructed as name of drug + resistance or


Blood group: names of blood groups

Finding: for measures/clinical features

XML: trait.type

DataDictionary, August 11, 2011

Page 7

Relationships among phenotypes

Maintenance of phenotype-phenotype relationships follows the model of UMLS. The

major categories in use by ClinVar are

parent-child, for example when there is locus heterogeneity in a disorder and

the subtype specific to each locus has been assigned a specific term and thus

has an identifier in the database

manifestation of, for example for the relationship between a finding and a

diagnostic name

Description of variant allele(s)


Genomic Location

Cytogenetic: chromosome and band

Computed by NCBI


o Chromosome nucleotide accession and version

o Variation Description (e.g. HGVS)

o Assembly name (e.g. NCBI36, GRCh37.p3)

o Nucleotide position (single if a point, multiple if a range)

o +/- stand

o Is in duplicated region / is there a pseudogene

o Uncertainty as appropriate

o Computed by NCBI

Location relative to a gene

Based on sequence ontology terms and computed per transcript

Computed by NCBI and/or provided by submitter

Transcript nucleotide accession and versions

Variation Description (e.g. HGVS)

LRG/RefSeqGene intron or exon number (optional)

XML: Measure.Location

Historical intron or exon number (optional)

UTR and upstream/downstream locations

DataDictionary, August 11, 2011

Page 8

Distance from nearer splice junction

(can be calculated if not provided)

Regulatory site (yes/no or name of promoter/locus control region)

Total exons in transcript

Protein Location

Protein sequence accession and version

Variation Description (e.g. HGVS)

Region name (active site, conserved domain, etc.)

Are there other variations reported in this codon (calculated)

Molecular consequence:

These elements are based on sequence ontology terms when available (should more

be requested?), and, when possible, shall be computed per transcript by NCBI. Rules

for ‘splice_site_lost’

Frameshift (SO:0000865)

Missense (SO:0001783)

Nonsense (SO:0001587)

Synonymous (SO:0001588)

In frame (SO:0001650)

Functional consequence:

These attributes will be provided by the submitter since they require identification of

the consequences of the molecular change.

Loss of function

Gain of function

Skipped exons





Results in nonsense mediated decay

AttributeSet: consequence

XML: Measure.consequence

DataDictionary, August 11, 2011

Page 9


Optional, multiple allowed



XML: measure.attribute.type in (partial listing: HGVS expression, OMIM alleleic

variant ID, dbSNP id, dbVar accession, allele name)

Type of variation (incomplete listing, based on sequence ontology


Single nucleotide variant (SO:0001483)

Multiple nucleotide variation (SO:0001013)

Insertion (SO:0000667)

Duplication (SO:1000035)

Deletion (SO:0000159)


Repeat expansion

Relationships among variations

Named haplotype

Compound heterozygote

Co-occurrence (epistatic contribution to asserted phenotype)

Description of the asserted relationship between a set of

phenotypes and a set of variations

Mode of inheritance

Controlled values from Human Phenotype Ontology (HPO)



DataDictionary, August 11, 2011

Page 10

Clinical significance (optional)

Clinical significance is listed as optional because LSDB submit curated relationships

between variation and disorders without providing significance.

As asserted by the submitter



Record status

Review status: indicates the level of confidence in any assertion

selected from among

o not reviewed

o reviewed by single submitter

o reviewed by expert panel

o reviewed by professional society

o conflict identified (calculated by NCBI if there are multiple submissions

for the same phenotype/allele relationship)


XML: ObservedIn

Note: Multiple “ObservedIn” structures can be included in a given ClinVar submission in support of the

assertion about the clinical significance of a given variant. These maybe derived from the literature,

population studies, research, in silico predictions, or genetic testing.

In silico (optional)

This section would be used for SIFT, PolyPhen-2, etc. as well as reports of sequence



XML: ObservedIn.Method

DataDictionary, August 11, 2011

Page 11

Value: value calculated from the indicated method

XML: ObservedIn.Observation

Experimental (optional)

Method (required)

XML: ObservedIn.Method

Citations (Optional)

Sample (required)


Findings: (required) results obtained from the method applied to the sample

Interpretation: (optional) conclusions drawn from the finding

o May be controlled vocabulary: inactive, reduced activity, normal

Observations in humans

Method (required)

Sample (required)



Findings (required)

XML: ObservedIn.Observation (implemented as an Attribute structure)

Data shall be aggregated in the following categories, with options dependent on

the type of study defined by the method. In the XML, these shall be captured as

attributes, with each category of information as an independent type

Clinical testing - Counts from Affected

Note: The field “Independent Observations” in the Sample section indicates if these counts are independent (probands or

singletons) or if counts may include related subjects.

Number of affected subjects tested

XML: ObservedIn.Observation.Type=AffectedTested

Number of chromosomes from affected subjects tested

XML: ObservedIn.Observation.Type=AffectedChrTested

Number of variant chromosomes in affected subjects

XML: ObservedIn.Observation.Type=VariantChrAffected

Number of heterozygotes (exclude compound heterozygotes)

Number of compound/double heterozygotes

Number of hemizygotes

Number of variant homozygotes

DataDictionary, August 11, 2011

Page 12

Number of de novo occurrences

Number of occur in trans with a different LOF of function pathogenic variant where

double LOF is lethal

Number of co-occurrences with another likely causative variant that explains the


Number where another variant present that could explain phenotype

Additional Family-based fields (optional, when applicable)

(includes the proband in counts)

Number of informative meiosis

Number of concordant affected subjects (geno+pheno+) (relevant MOI)

(2 or more geno+pheno+ in a family)

Number of variant alleles in affected

Number of discordant (geno-pheno+) relatives of proband (relevant MOI)

Number of discordant (geno+pheno-) relatives of proband (relevant MOI)

Number of reference(normal) alleles transmitted from heterozygous parents to

affected subject(s)

Number of affected heterozygotes in family(s) (exclude compound heterozygotes)

Number of affected compound/double heterozygotes

Number of affected hemizygotes

Number of affected homozygote variants

LOD score

Co-occurrence (include data from one sample per family/singleton)

Note: The fields below are repeated observations for each co-occurring variation


Number of heterozygotes of asserted variant with heterozygote of other

variation in trans

Number of heterozygotes of asserted variant with heterozygote of other

variation phase unknown

Number of homozygotes of asserted variation with heterozygote of other

Number of heterozygotes of asserted variation with homozygote of other

Number of homozygotes of asserted variation with homozygote of other

Definition of other variation

DataDictionary, August 11, 2011

Page 13

o Variation identifier

o Allele

o Gene

Co-occurrence with other pathogenic

For genotypes at this location occurring with another locus known to affect the same

phenotype )

Note: The fields below are a repeating set for each co-occurring allele

Definition of other variation

o Variation identifier

o Allele

o Gene

Number of heterozygotes of asserted variant with heterozygote of other in trans

Number of heterozygotes of asserted variant with heterozygote of other, phase


Number of homozygotes of asserted variant with heterozygote of other

Number of heterozygotes of asserted variant with homozygote of other

Number of homozygotes of asserted variant with homozygote of other

Controls: Unaffected for phenotype being asserted.

These data may include general population surveys such as 1000 genomes, or samplesets

such as population based carrier screening

Number of reference homozygotes

Number of heterozygotes

Number of variant homozygotes

o Compute if variant homozygotes observed

Number of variant alleles observed

o Compute allele frequency

Case-Control Study

Number of affected reference homozygote

Number of affected heterozygote

Number of affected variant homozygote

DataDictionary, August 11, 2011

Page 14

Number of unaffected reference homozygote

Number of unaffected heterozygote

Number of unaffected variant homozygote



Subject-based, observation specific

NOTE: This section is included for completeness, but the data will not be stored in ClinVar. Instead,

individual-level data submitted to dbGaP/genotype archive/BioSamples will be aggregated for

submission to ClinVar

Anonymous case ID (Lab ID)

Anonymous pedigree ID

Patient Age

Patient Gender

Patient sub-phenotype

Proband status (yes|no)

Description of a gene


Preferred name

The full name from HUGO Gene Nomenclature Committee (HGNC)

Required, only one allowed.




Alternate name(s)

Other names used for this gene. Provided via the Gene database.

Optional, multiple allowed



DataDictionary, August 11, 2011

Page 15


Preferred symbol

The official symbol from HGNC.

Required, only one allowed.



XML: measure.symbol.type=preferred

Alternate symbols(s)

Alternate gene symbols from Gene.



Optional, multiple allowed.



XML: measure.symbol.type=alternate

Examples would be HGNC ids, GeneID, MIM number, chromosome, cytogenetic

band, chromosome sequence location, related pseudogenes/paralogs

NOTE: many of these are not duplicated in ClinVar but are provided by NCBI as

imports from the Gene database.


XML: measure.attribute.type=[]

Gene: subtypes of genes are defined by the Gene database.

XML: measure.type

DataDictionary, August 11, 2011

Page 16

Description of the relationship between a set of

phenotypes and a set of genes

Note: Documentation incomplete. This will be represented similar to the phenotypevariation


Accessions and versions

SCV. The accession for each submitted assertion. This is provided by NCBI, but shall be

included in a submission that needs an update.

RCV. The accession for a reviewed assertion.

Discussion topics

Should ClinVar store in silico data?

Regarding in silico data I think one challenge is that storage of this derived data may not

be the most effective way to capture that data as opposed to maintaining source

linkages to auto-derive that data in real time or the most real-time possible (even if to

the user it looks like a stored element). So probably best to capture data on pieces of

evidence that cannot be re-derived including population frequency in cases vs controls,

de novo occurrence, LOD scores for positive meiotic segregations with disease, etc, etc.

That said, it would be useful to capture the basis for why we made certain


Should ClinVar provide reports of different approaches to classify


To that extent, we also need to start sharing the rules we all use to classify varaints

based upon these elements. To start that discussion, I have attached the rules we are

DataDictionary, August 11, 2011

Page 17

using today for variant classification (this version is a bit cardiomyopathy centric but

we’re trying to make it applicable across all high penetrant Mendelian diseases). Would

love to see other’s documents in this area.

DataDictionary, August 11, 2011

Page 18

More magazines by this user
Similar magazines