Variant expansion and normalisation
Genetic variants are drawing increasing interest regarding their role in pathologies, for designing new drugs or refining treatment efficacy through stratification. However, variant interpretation depends on time-consuming curation tasks. To support variant interpretation efforts and decisions based on the latest evidence, we propose Variomes [1, 2], a service performing variant-specific triage of publications.
To increase the comprehensiveness of Variomes, we developed SynVar. This tool enables variant expansion and cross-level representation normalizations. This task faces different challenges:
While several variant databases and registries exist, such as ClinVar, dbSNP, and the ClinGen Allele Registry, relying on them as reference terminologies presents several limitations:
To enable a smooth and effective retrieval of variants in the literature, we developed a variant expansion and normalisation tool that enables to generate for a given variant – including variants not described in existing databases – its corresponding description at the genomic (g.), cDNA (transcript-based, c.), and protein (p.) levels, in the HGVS format as well as in many non-standard yet frequently used descriptions found in the literature. It is adapted for variant expansion and normalisation from any description level.
SynVar supports the following variant types according to HGVS nomenclature:
In addition to HGVS notation, variants can be provided in the following formats, which are automatically converted to HGVS before processing:
The following optional parameters generate additional synonyms in mode=expand:
Protein variant: the change is validated on the reference sequence of the canonical isoform, by default, as retrieved by the UniProt API [3]. The valid variant is then backtranslated into the possible cDNA variants, using the back-translator tool from Mutalyzer [4]. Finally the cDNA variant is mapped onto its genomic position (GRCh37 and GRCh38 builds) using VariantValidator [5].
cDNA variant: the variant is validated and mapped onto genome position using VariantValidator [5], which also translates it into the corresponding protein variant.
Genomic variant: the variant is validated and converted to the cDNA variants using VariantValidator [5], if not intergenic. VariantValidator also provides the translation into protein variants. If intergenic, only genomic variant representations are generated.
dbSNP id: The different genomic variants associated to the dbSNP [6] id are retrieved through the NCBI eutils services. The conversion and translation procedure from genomic variant is similar to the one described above.
ClinGen Allele Registry ID: The genomic variant corresponding to the ClinGen Allele Registry ID (CA ID) is retrieved through the ClinGen Allele Registry [7]. The genomic mapping and translation is similar to the one described above.
Results are returned as a list of genomic variants, along with their corresponding cDNA (transcript-based) and protein variants, grouped by genes and isoforms. The output content depends on the mode parameter:
mode=expand (default): full variant expansion with the following elements:
mode=normalize: normalized identifiers without syntactic variations:
The output format is controlled by the format parameter: xml (default), json (same structure in JSON), or vrs (GA4GH VRS-structured JSON including a VRS Allele object derived from the SPDI representation; the vrs format implies mode=normalize).
https://synvar.sibils.org/api
The previous URL /generate/literature/fromMutation is still supported for backward compatibility.
| Parameter | Description | Example | Default value | |
|---|---|---|---|---|
| variant | Variant description, ClinGen Allele Registry ID, or dbSNP id. Can include the gene/reference or free text containing variants. Also accepts SPDI, VCF, IVS, and HGVS repeat notation as input. Free text (variant in standard or non-standard format, with or without gene/reference) |
V600E, BRAF V600E, c.1799T>A, NM_004333.6:c.1799T>A, rs113488022, CA251544, NC_000007.14:140753335:A:T (SPDI), 7:140753336:A:T (VCF), IVS1+1G>A | no default value | mandatory |
| ref | Gene name or chromosome number/name. Optional when included in the variant field, or when using dbSNP/ClinGen identifiers. Free text (gene name, chromosome number/name, sequence accession: RefSeq NM_/NP_/NC_, Ensembl ENST/ENSP, LRG) |
BRAF, JAK2, 9, X | no default value | optional |
| level | Level of the variant description. When set to any, the level is detected automatically. Possible values: protein, cdna (or transcript), genome, genome38 (or genome_grch38), genome37 (or genome_grch37), dbsnp, clingen, any The genome38/genome37 shortcuts combine level=genome with assembly filtering. |
protein | any | optional |
| iso | Expand to all available isoforms of the gene. Possible values: true, false |
true | false | optional |
| map | Require genome mapping. When true, results are only returned if genome mapping succeeds. When false, outputs syntactic variations even without successful genome mapping. Possible values: true, false |
true | false | optional |
| mode | Processing mode. expand generates all synonyms and syntactic variations. normalize returns only normalized identifiers (HGVS, dbSNP, ClinGen, SPDI, VCF) without syntactic variations. Possible values: expand, normalize The previous parameter norm=true is equivalent to mode=normalize. |
normalize | expand | optional |
| format | Output format. xml and json return the same structure in different formats. vrs returns a GA4GH VRS-structured JSON with HGVS, SPDI, VCF and VRS Allele (implies mode=normalize). Possible values: xml, json, vrs |
json | xml | optional |
| startMet | Enable Start Met ±1 shift. Generates additional protein synonyms at position−1 and accepts input with +1 fallback (e.g. BRAF V600E also generates V599E). Possible values: true, false |
true | false | optional |
| insForDup | Generate insertion-equivalent synonyms for duplications (e.g. A763dup → A763_Y764insA). Possible values: true, false |
true | false | optional |
| leftAlign | Generate left-aligned (shifted) synonyms for deletions and duplications in repetitive regions. HGVS mandates 3' alignment; this adds left-aligned and intermediate forms. Possible values: true, false |
true | false | optional |
| assembly | Restrict genomic mapping to a specific genome assembly. When not specified, both GRCh38 and GRCh37 mappings are returned. Possible values: GRCh38, GRCh37, hg38, hg19 Alternatively, the assembly can be specified via the level parameter: genome38 or genome_grch38 for GRCh38, genome37 or genome_grch37 for GRCh37. |
GRCh38 | both assemblies | optional |
Example scripts to query the service and parse the output: