SynVar
Variant expansion and normalisation
Background
Genetic variants are drawing increasing interest regarding their role in pathologies, for designing new drugs or refining treatment efficacy through stratification. However, variant interpretation depends on time-consuming curation tasks. To support variant interpretation efforts and decisions based on the latest evidence, we propose Variomes [1], a service performing variant-specific triage of publications.
To increase the comprehensiveness of Variomes, we developed SynVar. This tool enables variant expansion and cross-level representation normalizations. This task faces different challenges:
- Variants can be represented at different levels - genomic (g.), cDNA (c., transcript-based), or protein (p.) - with a combinatorial (many-to-many) relationship between them.
- Variant descriptions depend on a reference sequence on which the variation is described, to avoid positional ambiguity.
- The majority of variants mentioned in the literature do not follow a standard nomenclature.
While several variant databases and registries exist, such as ClinVar, dbSNP, and the ClinGen Allele Registry, relying on them as reference terminologies presents several limitations:
- Depending on the resource, variants may be represented at different molecular levels (genomic, coding DNA, or protein), preventing a direct one-to-one correspondence between representations. In addition, some records are position-specific but not allele-specific, reducing precision.
- Dependence on external databases limits the retrieval of newly described variants that are not yet catalogued.
- These resources do not capture non-standard or alternative variant expressions as they appear in the literature.
Description
To enable a smooth and effective retrieval of variants in the literature, we developed a variant expansion and normalisation tool that enables to generate for a given variant – including variants not described in existing databases – its corresponding description at the genomic (g.), cDNA (transcript-based, c.), and protein (p.) levels, in the HGVS format as well as in many non-standard yet frequently used descriptions found in the literature. It is adapted for variant expansion and normalisation from any description level.
Supported variant types
SynVar supports the following variant types according to HGVS nomenclature:
- Substitutions (SNPs): Single nucleotide or amino acid changes (e.g. V600E, c.1799T>A, g.55181378G>A)
- Deletions: Deletion of one or more nucleotides or amino acids (e.g. E746_A750del, c.2235_2249del)
- Duplications: Duplication of one or more nucleotides or amino acids (e.g. V600dup, c.1799dup)
- Insertions: Insertion of one or more nucleotides or amino acids (e.g. c.7397_7398insT)
- Deletion-insertions (delins): Combined deletion and insertion (e.g. c.112_117delinsAT)
- Frameshifts: Variants causing a frameshift (e.g. p.Arg97fs, c.289delC)
Isoform support
SynVar can recognize and process variants specified on protein isoforms. When the optional parameter iso=true is provided, the tool expands the variant to all available isoforms of the gene. The system accepts:
- Gene names (e.g. TP53) - validates against the canonical isoform first. If the variant is not valid on the canonical isoform, the system automatically searches other isoforms. When iso=true, expands to all isoforms.
- RefSeq protein identifiers (e.g. NP_001119586.1) - recognizes the specific isoform corresponding to the RefSeq ID and expands to all isoforms when iso=true
Example: TP53 R248W with iso=true returns variant representations for all 9 TP53 isoforms. The variant is first validated on the canonical isoform (P04637-1). If not valid there, the system automatically searches other isoforms. With iso=true, all 9 isoforms are returned regardless of which isoform was initially validated.
Workflow
Use-cases
Protein variant: the change is validated on the reference sequence of the canonical isoform, by default, as retrieved by the UniProt API tool [2]. The valid variant is then backtranslated into the possible cDNA variants, using the back-translator tool from Mutalyzer [3]. Finally the cDNA variant is mapped onto its genomic position (GRCh37 and GRCh38 builds) using VariantValidator [4].
cDNA variant: the variant is validated and mapped onto genome position using VariantValidator [4], which also translates it into the corresponding protein variant.
Genomic variant: the variant is validated and converted to the cDNA variants using VariantValidator [4], if not intergenic. VariantValidator also provides the translation into protein variants. If intergenic, only genomic variant representations are generated.
dbSNP id: The different genomic variants associated to the dbSNP [5] id are retrieved through the NCBI eutils services. The conversion and translation procedure from genomic variant is similar to the one described above.
ClinGen Allele Registry ID: The genomic variant corresponding to the ClinGen Allele Registry ID (CA ID) is retrieved through the ClinGen Allele Registry [6]. The genomic mapping and translation is similar to the one described above.
Output
Results are returned as a list of genomic variants (defined by chromosome, position, reference allele and alternate allele), along with their corresponding cDNA (transcript-based) and protein variants, grouped by genes and isoforms. The output is in XML format. The main elements are the following:
- synonym: ❯aliases: Alternative names of genes and proteins
- hgvs: Variant description in the standard HGVS format. A primary standardized HGVS representation is provided, together with additional HGVS representations at the genomic (g.), coding DNA (c., transcript-based), and protein (p.) levels
- syntactic-variation: Alternative textual representations of the variant as encountered in the literature, including non-standard and commonly used mention forms
Programmatic access
URL
https://synvar.sibils.org/generate/literature/fromMutation
Parameters
- variant: Variant description, ClinGen Allele Registry ID, or dbSNP id (e.g. V617F, Val600Glu, rs113488022, CA251544, BRAF V600E). Required.
Optional parameters
- ref: Gene name, chromosome number or name (e.g. JAK2, BRAF, 9, X). Optional. If not provided and the variant parameter contains the gene/reference information (e.g. BRAF V600E), the system will automatically extract it. Also optional when using database identifiers (dbSNP, ClinGen).
- level: Level of the provided variant description: protein, cdna (or transcript for backwards compatibility), genome, dbsnp, or clingen. Optional (default: any). The cdna level refers to the HGVS c. notation (coding DNA, transcript-based coordinates). When set to any or omitted, the system attempts to detect the variant level automatically based on the variant syntax and on the validity of the variant at each level. Note: Specifying the level explicitly is more efficient as it avoids testing all possible levels.
- iso: Validate on and expand to all isoforms: false (default) or true. When set to true, detects and expands the variant to all available isoforms of the gene.
- map: Require genome mapping for output: true (default) or false. When set to false, outputs syntactic variations even if the variant could not be mapped to the genome. Useful for generating literature search terms for variants that cannot be validated or mapped.
- norm: Return only normalized identifiers: false (default) or true. When set to true, returns only HGVS, dbSNP ID, and ClinGen Allele ID without syntactic variations.
- format: Output format: xml (default), json (same structure as XML but in JSON format), or beacon (Beacon v2 JSON format).
Examples
Substitutions (SNPs)
Deletions
Duplications and Insertions
Isoform-specific queries
Database identifiers
Special cases with map parameter
Automatic detection (without ref or level parameters)
Variant extraction from complex text
Normalization only (norm parameter)
Search interface
Fields
- Gene/Chromosome: Gene name or chromosome number/name (e.g. JAK2, BRAF, 9, X, MT). The field can be empty if a dbSNP or ClinGen Allele Registry ID is searched.
- Variant: Variant in the following format: V600E (for protein, p. notation) or c.1799T>A (for cDNA, c. notation) or g.140753336A>T (for genomic, g. notation) or a dbSNP id (e.g. rs113488022) or ClinGen Allele Registry ID (e.g. CA251544).
- Level: Level of the provided variant description (protein, cdna, genome, dbsnp or clingen). Note: transcript is also accepted as an alias for cdna.
Template programs
Example scripts to query the service and parse the output:
- Python: queryVariant.py - Includes examples for XML and JSON parsing, isoform expansion
- Java: queryVariant.java - Demonstrates XML parsing and extracting HGVS notations and alternative textual representations
References
- Mottaz A, Pasche E, Michel PA, Mottin L, Teodoro D, Ruch P. Designing an Optimal Expansion Method to Improve the Recall of a Genomic Variant Curation-Support Service. Stud Health Technol Inform. 2022 May 25;294:839-843. doi: 10.3233/SHTI220603. PubMed
- Pasche E, Mottaz A, Caucheteur D, Gobeill J, Michel PA, Ruch P. Variomes: a high recall search engine to support the curation of genomic variants. Bioinformatics. 2022 Apr 28;38(9):2595-2601. doi: 10.1093/bioinformatics/btac146. PubMed>
- The UniProt Consortium (2023). UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Research, 51(D1), D523–D531. https://doi.org/10.1093/nar/gkac1052
- den Dunnen J. T. (2016). Sequence Variant Descriptions: HGVS Nomenclature and Mutalyzer. Current protocols in human genetics, 90, 7.13.1–7.13.19. https://doi.org/10.1002/cphg.2
- Freeman, P. J., Hart, R. K., Gretton, L. J., Brookes, A. J., & Dalgleish, R. (2018). VariantValidator: Accurate validation, mapping, and formatting of sequence variation descriptions. Human mutation, 39(1), 61–68. https://doi.org/10.1002/humu.23348
- Smigielski, E. M., Sirotkin, K., Ward, M., & Sherry, S. T. (2000). dbSNP: a database of single nucleotide polymorphisms. Nucleic acids research, 28(1), 352–355. https://doi.org/10.1093/nar/28.1.352
- Pawliczek, P., Patel, R. Y., Ashmore, L. R., Jackson, A. R., Bizon, C., Nelson, T., Powell, B., Freimuth, R. R., Strande, N., Shah, N., Riegel, B., Meeks, M., Levy, M. A., Kattman, B., Berg, J. S., & Harrison, S. M. (2018). ClinGen Allele Registry links information about genetic variants. Human mutation, 39(11), 1690–1701. https://doi.org/10.1002/humu.23637