SynVar

Variant expansion and normalisation

Background

Genetic variants are drawing increasing interest regarding their role in pathologies, for designing new drugs or refining treatment efficacy through stratification. However, variant interpretation depends on time-consuming curation tasks. To support variant interpretation efforts and decisions based on the latest evidence, we propose Variomes [1], a service performing variant-specific triage of publications.

To increase the comprehensiveness of Variomes, we developed SynVar. This tool enables variant expansion and cross-level representation normalizations. This task faces different challenges:

Variants can be represented at different levels - genomic (g.), cDNA (c., transcript-based), or protein (p.) - with a combinatorial (many-to-many) relationship between them.
Variant descriptions depend on a reference sequence on which the variation is described, to avoid positional ambiguity.
The majority of variants mentioned in the literature do not follow a standard nomenclature.

While several variant databases and registries exist, such as ClinVar, dbSNP, and the ClinGen Allele Registry, relying on them as reference terminologies presents several limitations:

Depending on the resource, variants may be represented at different molecular levels (genomic, coding DNA, or protein), preventing a direct one-to-one correspondence between representations. In addition, some records are position-specific but not allele-specific, reducing precision.
Dependence on external databases limits the retrieval of newly described variants that are not yet catalogued.
These resources do not capture non-standard or alternative variant expressions as they appear in the literature.

Description

To enable a smooth and effective retrieval of variants in the literature, we developed a variant expansion and normalisation tool that enables to generate for a given variant – including variants not described in existing databases – its corresponding description at the genomic (g.), cDNA (transcript-based, c.), and protein (p.) levels, in the HGVS format as well as in many non-standard yet frequently used descriptions found in the literature. It is adapted for variant expansion and normalisation from any description level.

Supported variant types

SynVar supports the following variant types according to HGVS nomenclature:

Substitutions (SNPs): Single nucleotide or amino acid changes (e.g. V600E, c.1799T>A, g.55181378G>A)
Deletions: Deletion of one or more nucleotides or amino acids (e.g. E746_A750del, c.2235_2249del)
Duplications: Duplication of one or more nucleotides or amino acids (e.g. V600dup, c.1799dup)
Insertions: Insertion of one or more nucleotides or amino acids (e.g. c.7397_7398insT)
Deletion-insertions (delins): Combined deletion and insertion (e.g. c.112_117delinsAT)
Frameshifts: Variants causing a frameshift (e.g. p.Arg97fs, c.289delC)

Isoform support

SynVar can recognize and process variants specified on protein isoforms. When the optional parameter iso=true is provided, the tool expands the variant to all available isoforms of the gene. The system accepts:

Gene names (e.g. TP53) - validates against the canonical isoform first. If the variant is not valid on the canonical isoform, the system automatically searches other isoforms. When iso=true, expands to all isoforms.
RefSeq protein identifiers (e.g. NP_001119586.1) - recognizes the specific isoform corresponding to the RefSeq ID and expands to all isoforms when iso=true

Example: TP53 R248W with iso=true returns variant representations for all 9 TP53 isoforms. The variant is first validated on the canonical isoform (P04637-1). If not valid there, the system automatically searches other isoforms. With iso=true, all 9 isoforms are returned regardless of which isoform was initially validated.

Workflow

Use-cases

Protein variant: the change is validated on the reference sequence of the canonical isoform, by default, as retrieved by the UniProt API tool [2]. The valid variant is then backtranslated into the possible cDNA variants, using the back-translator tool from Mutalyzer [3]. Finally the cDNA variant is mapped onto its genomic position (GRCh37 and GRCh38 builds) using VariantValidator [4].

cDNA variant: the variant is validated and mapped onto genome position using VariantValidator [4], which also translates it into the corresponding protein variant.

Genomic variant: the variant is validated and converted to the cDNA variants using VariantValidator [4], if not intergenic. VariantValidator also provides the translation into protein variants. If intergenic, only genomic variant representations are generated.

dbSNP id: The different genomic variants associated to the dbSNP [5] id are retrieved through the NCBI eutils services. The conversion and translation procedure from genomic variant is similar to the one described above.

ClinGen Allele Registry ID: The genomic variant corresponding to the ClinGen Allele Registry ID (CA ID) is retrieved through the ClinGen Allele Registry [6]. The genomic mapping and translation is similar to the one described above.

Output

Results are returned as a list of genomic variants (defined by chromosome, position, reference allele and alternate allele), along with their corresponding cDNA (transcript-based) and protein variants, grouped by genes and isoforms. The output is in XML format. The main elements are the following:

synonym: ❯aliases: Alternative names of genes and proteins
hgvs: Variant description in the standard HGVS format. A primary standardized HGVS representation is provided, together with additional HGVS representations at the genomic (g.), coding DNA (c., transcript-based), and protein (p.) levels
syntactic-variation: Alternative textual representations of the variant as encountered in the literature, including non-standard and commonly used mention forms

Programmatic access

URL

https://synvar.sibils.org/generate/literature/fromMutation

Parameters

variant: Variant description, ClinGen Allele Registry ID, or dbSNP id (e.g. V617F, Val600Glu, rs113488022, CA251544, BRAF V600E). Required.

Optional parameters

ref: Gene name, chromosome number or name (e.g. JAK2, BRAF, 9, X). Optional. If not provided and the variant parameter contains the gene/reference information (e.g. BRAF V600E), the system will automatically extract it. Also optional when using database identifiers (dbSNP, ClinGen).
level: Level of the provided variant description: protein, cdna (or transcript for backwards compatibility), genome, dbsnp, or clingen. Optional (default: any). The cdna level refers to the HGVS c. notation (coding DNA, transcript-based coordinates). When set to any or omitted, the system attempts to detect the variant level automatically based on the variant syntax and on the validity of the variant at each level. Note: Specifying the level explicitly is more efficient as it avoids testing all possible levels.
iso: Validate on and expand to all isoforms: false (default) or true. When set to true, detects and expands the variant to all available isoforms of the gene.
map: Require genome mapping for output: true (default) or false. When set to false, outputs syntactic variations even if the variant could not be mapped to the genome. Useful for generating literature search terms for variants that cannot be validated or mapped.
norm: Return only normalized identifiers: false (default) or true. When set to true, returns only HGVS, dbSNP ID, and ClinGen Allele ID without syntactic variations.
format: Output format: xml (default), json (same structure as XML but in JSON format), or beacon (Beacon v2 JSON format).

Examples

Special cases with map parameter

https://synvar.sibils.org/generate/literature/fromMutation?ref=BRAF&variant=V600K&level=protein&map=false (map=false: generates syntactic variations even without genome mapping. V600K is a protein substitution that corresponds to an indel at DNA level, which may complicate mapping)
https://synvar.sibils.org/generate/literature/fromMutation?ref=BRAF&variant=V600K&level=protein (map=true by default: only outputs if genome mapping succeeds)

Automatic detection (without ref or level parameters)

https://synvar.sibils.org/generate/literature/fromMutation?variant=BRAF V600E (gene name in variant: automatically extracts BRAF as ref and detects protein level)
https://synvar.sibils.org/generate/literature/fromMutation?variant=V600E&ref=BRAF (no level specified: automatically detects protein level from V600E syntax)
https://synvar.sibils.org/generate/literature/fromMutation?variant=c.1799T>A&ref=BRAF (no level specified: automatically detects cDNA level from c. prefix)

Variant extraction from complex text

https://synvar.sibils.org/generate/literature/fromMutation?variant=In melanoma patients, mutations in BRAF and NRAS genes are common. We identified the V600E and Q61R substitutions associated with poor prognosis (extracts "V600E" with "BRAF" and "Q61R" with "NRAS" from distant mentions in sentence)

Normalization only (norm parameter)

https://synvar.sibils.org/generate/literature/fromMutation?ref=BRAF&variant=V600E&level=protein&norm=true (norm=true: returns only HGVS, rsID, and CAID without syntactic variations)
https://synvar.sibils.org/generate/literature/fromMutation?ref=BRAF&variant=V600E&level=protein (norm=false by default: includes all syntactic variations for literature search)

Output formats

https://synvar.sibils.org/generate/literature/fromMutation?ref=BRAF&variant=V600E&level=protein (XML format, default)
https://synvar.sibils.org/generate/literature/fromMutation?ref=BRAF&variant=V600E&level=protein&format=json (JSON format with same structure as XML)
https://synvar.sibils.org/generate/literature/fromMutation?ref=BRAF&variant=V600E&level=protein&format=beacon (Beacon v2 JSON format)

Search interface

Fields

Gene/Chromosome: Gene name or chromosome number/name (e.g. JAK2, BRAF, 9, X, MT). The field can be empty if a dbSNP or ClinGen Allele Registry ID is searched.
Variant: Variant in the following format: V600E (for protein, p. notation) or c.1799T>A (for cDNA, c. notation) or g.140753336A>T (for genomic, g. notation) or a dbSNP id (e.g. rs113488022) or ClinGen Allele Registry ID (e.g. CA251544).
Level: Level of the provided variant description (protein, cdna, genome, dbsnp or clingen). Note: transcript is also accepted as an alias for cdna.

Template programs

Example scripts to query the service and parse the output:

Python: queryVariant.py - Includes examples for XML and JSON parsing, isoform expansion
Java: queryVariant.java - Demonstrates XML parsing and extracting HGVS notations and alternative textual representations

References

Mottaz A, Pasche E, Michel PA, Mottin L, Teodoro D, Ruch P. Designing an Optimal Expansion Method to Improve the Recall of a Genomic Variant Curation-Support Service. Stud Health Technol Inform. 2022 May 25;294:839-843. doi: 10.3233/SHTI220603. PubMed
Pasche E, Mottaz A, Caucheteur D, Gobeill J, Michel PA, Ruch P. Variomes: a high recall search engine to support the curation of genomic variants. Bioinformatics. 2022 Apr 28;38(9):2595-2601. doi: 10.1093/bioinformatics/btac146. PubMed>
The UniProt Consortium (2023). UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Research, 51(D1), D523–D531. https://doi.org/10.1093/nar/gkac1052
den Dunnen J. T. (2016). Sequence Variant Descriptions: HGVS Nomenclature and Mutalyzer. Current protocols in human genetics, 90, 7.13.1–7.13.19. https://doi.org/10.1002/cphg.2
Freeman, P. J., Hart, R. K., Gretton, L. J., Brookes, A. J., & Dalgleish, R. (2018). VariantValidator: Accurate validation, mapping, and formatting of sequence variation descriptions. Human mutation, 39(1), 61–68. https://doi.org/10.1002/humu.23348
Smigielski, E. M., Sirotkin, K., Ward, M., & Sherry, S. T. (2000). dbSNP: a database of single nucleotide polymorphisms. Nucleic acids research, 28(1), 352–355. https://doi.org/10.1093/nar/28.1.352
Pawliczek, P., Patel, R. Y., Ashmore, L. R., Jackson, A. R., Bizon, C., Nelson, T., Powell, B., Freimuth, R. R., Strande, N., Shah, N., Riegel, B., Meeks, M., Levy, M. A., Kattman, B., Berg, J. S., & Harrison, S. M. (2018). ClinGen Allele Registry links information about genetic variants. Human mutation, 39(11), 1690–1701. https://doi.org/10.1002/humu.23637