G2P VEP plugin

The G2P VEP plugin identifies likely disease causing genes based on the knowledge encoded in the G2P database and runs as part of the Variant Effect Predictor (VEP).

Ensembl Variant Effect Predictor

Ensembl VEP predicts the molecular consequence of a variant and reports further optional annotation.

If the input file contains variant data for a set of individuals the VEP generates one line of output for each pair of variant allele and overlapping transcript per individual.

How the plugin works:

The G2P VEP plugin adds further annotation to the line of output based on the individual's genotypes and the knowledge contained in the G2P database. The G2P VEP plugin uses a set of filters for identifying potentially causal variants. If the plugin counts a sufficient number of causal variants (variant hits) for a G2P gene it will report the gene as likely disease causing and all variants that passed the filters. The number of sufficient causal variants is derived from the allelic requirement of the gene which is stored in the G2P database.

By default the plugin adds certain information to the VEP output individual information, gene symbol or HGNC id, global allele frequency data from 1000 Genomes Phase 3 data for any colocated variant, SIFT predictions, Polyphen-2 predictions.

The plugin by default also checks for existing variants that are colocated with the given variants and will exclude those flagged as failed by Ensembl QC checks.

Filtering rules:

Consider the variant as potentially causal if the variant passes all filtering steps.

  1. The variant overlaps a G2P gene
  2. The variant consequence is in the list of severe consequences. The default list contains the following terms: splice_donor_variant, splice_acceptor_variant, stop_gained, frameshift_variant,stop_lost,initiator_codon_variant, inframe_insertion, inframe_deletion, missense_variant, coding_sequence_variant, start_lost, transcript_ablation, transcript_amplification, protein_altering_variant
  3. All allele frequencies from co-located variants in reference populations (1000 Genomes project, gnomAD) need to be below a given threshold. The default frequency values for an allele in a bi-allelic gene is 0.005 and for an allele in a mono-allelic gene is 0.0001.

The sufficient number of variant hits is determined by the gene's allelic requirement.

G2P supports biallelic_autosomal, monoallelic_autosomal, mitochondrial, monoallelic_Y_hem, monoallelic_X_hem, monoallelic_X_het, monoallelic_PAR, biallelic_PAR as an allelic requirement. To ensure compatibility with our old terminologies, we still support the allelic requirements, monoallelic, biallelic, hemizygous, x-linked dominant, x-linked dominance.

Gene classification G2P allelic requirement Filtering rules
biallelic
  • biallelic_autosomal
  • biallelic (supporting old terminologies)
  • biallelic_PAR
  • A count of at least 2 heterozygous variants or 1 homozygous variants which passes all other filtering rules
    af => 0.005, rules => {HET => 2, HOM => 1} 
    monoallelic
  • monoallelic (supporting old terminologies)
  • monoallelic_autosomal
  • monoallelic_PAR
  • monoallelic_X_hem
  • monoallelic_X_het
  • monoallelic_Y_hem
  • mitochondrial
  • hemizygous (supporting old terminologies)
  • x-linked dominant (supporting old terminologies)
  • x-linked dominance (supporting old terminologies)
  • A count of 1 heterozygous variants or 1 homozygous variants which passes all other filtering rules
     af => 0.0001, rules => {HET => 1, HOM => 1} 

    Installing and running the VEP and G2P VEP plugin

    For installation and running the VEP script please refer to the VEP GitHub repository and VEP documentation pages. Plugins are installed and configured during the VEP installation. The G2P VEP plugin is located in the VEP plugins repository.

    To run the G2P VEP plugin add the following argument to the VEP command:

    The file to be used for running G2P plugin is the panel file from G2P or PanelApp. The plugin can not be run without the file.

    vep_g2p_plugin_overview

    Options are passed to the plugin as key=value pairs

    Key Description Input or Default value Output
    file Path to G2P data file. The file needs to be uncompressed.
    - Download from http://www.ebi.ac.uk/gene2phenotype/downloads
    - Download from PanelApp
    The plugin can not run without this data file. Data from this file is used in the filtering process. The text output and html output are also annotated with data from this file
    af_monoallelic maximum allele frequency for inclusion for monoallelic genes 0.0001 A different value can be used by
    ./vep -i input.vcf --plugin G2P,file='DDG2P.csv',af_monoallelic=0.00001
    af_biallelic maximum allele frequency for inclusion for biallelic genes 0.005 A different value can be used by
    ./vep -i input.vcf --plugin G2P,file='DDG2P.csv',af_biallelic=0.05
    confidence_levels We still support confidence levels of our old terminology. Confidence levels to include: definitive, strong, limited, moderate, confirmed, probable, possible, both RD and IF. Separate multiple values with '&'.
    https://www.ebi.ac.uk/gene2phenotype/terminology
    Supported values: definitive, strong, moderate, confirmed, probable, limited. By default the plugin reports: definitive, strong, moderate, confirmed, probable Confidence levels are used to determine which genes are used in the filtering process. The G2P confidence levels is reported in the HTML and text output. Some G2P entries have the flag "Requires clinical review", this is reported in the HTML and text output to show careful consideration of the results are required
    all_confidence_levels Set value to 1 to include all confidence levels: definitive, strong, limited, moderate, confirmed, probable and possible 0
    af_from_vcf set value to 1 to include allele frequencies from VCF files. The location of the VCF file is configured in ensembl-variation/modules/Bio/EnsEMBL/Variation/DBSQL/vcf_config.json or ensembl-vep/Bio/EnsEMBL/Variation/DBSQL/vcf_config.json depending on how the ensembl-variation API was installed 0 This option can be used to filter against population frequency sets (UK10K and TOPMed) which are not in the Ensembl VEP reference data cache but for which VCF files are available. Filtering using additional VCF files takes more time than using the VEP cache only.
    af_from_vcf_keys Select VCF collections. Separate multiple values with '&'. Should be only be used if option af_from_vcf is used. VCF collections presently supported are - uk10k (assembly GRCh37 and GRCh38), topmed (assembly GRCh37 and GRCh38). The VCF collection specified are used in the filtering process, to determine maximum allele frequency. For example, if the variants in gnomADg_v3.1.2 has an allele frequency higher than the frequency specified for the G2P gene, it is excluded.
    variant_include_list A list of variants to include even if variants do not pass allele frequency filtering. The include list needs to be a sorted, bgzipped and tabixed VCF file.
    types SO consequence types to include. Separate multiple values with '&'. splice_donor_variant, splice_acceptor_variant, stop_gained, frameshift_variant, stop_lost, initiator_codon_variant, inframe_insertion, inframe_deletion, missense_variant, coding_sequence_variant, start_lost, transcript_ablation, transcript_amplification, protein_altering_variant
    log_dir The log_dir is required to store log_files which are used for writting intermediate results. The log_dir should be empty. The log_files can be consulted for any frequency filtering decisions. current_working_dir/g2p_log_dir_[year]_[mon]_[mday]_[hour]_[min]_[sec] log_dir contains information of gene and variants that did not pass all the filtering rules.
    txt_report Write all G2P complete genes and attributes to txt file current_working_dir/txt_report_[year]_[mon]_[mday]_[hour]_[min]_[sec].txt The G2P plugin output that contains a summary report of genes passing VEP-G2P filtering
    html_report Write all G2P complete genes and attributes to html file current_working_dir/html_report_[year]_[mon]_[mday]_[hour]_[min]_[sec].html The G2P plugin output that contains a summary report of genes passing VEP-G2P filtering for visualization in a web browser.
    filter_by_gene_symbol The plugin by default filters by HGNC ID using G2P panel files. Set this option to 1 to filter by gene symbol 0 This is the default option using PanelApp files.
    only_mane The plugin by default filters every transcript. This option is set to 1 to ensure filtering of only MANE transcripts 0 Information may be lost using this option.

    Allele frequencies

    The G2P plugin filters input variants on allele frequencies. The allele frequencies are retrieved from major genotyping projects like the 1000 Genomes project and gnomAD. The VEP provides a cache which contains allele frequencies in order to speed up the variant annotation.

    To use the VCF file for filtering, the G2P plugin option af_from_vcf needs to be set to 1.

    ./vep -i input.vcf --plugin G2P,file='DDG2P.csv,af_from_vcf=1'

    Available population allele frequency data

    reference population short name description source
    minor_allele_freqglobal allele frequency (AF) from 1000 Genomes Phase 3 dataVEP cache
    AAExome Sequencing Project 6500:African_AmericanVEP cache
    AFR1000GENOMES:phase_3:AFRVEP cache
    AMR1000GENOMES:phase_3:AMRVEP cache
    EAExome Sequencing Project 6500:European_AmericanVEP cache
    EAS1000GENOMES:phase_3:EASVEP cache
    EUR1000GENOMES:phase_3:EURVEP cache
    SAS1000GENOMES:phase_3:SASVEP cache
    gnomADeGenome Aggregation Database:TotalVEP cache and VCF file.
    gnomADe:afrGenome Aggregation Database exomes r2.1:African/African AmericanVEP cache and VCF file
    gnomADe:amrGenome Aggregation Database exomes r2.1:LatinoVEP cache and VCF file
    gnomADe:asjGenome Aggregation Database exomes r2.1:Ashkenazi JewishVEP cache and VCF file
    gnomADe:easGenome Aggregation Database exomes r2.1:East AsianVEP cache and VCF file
    gnomADe:finGenome Aggregation Database exomes r2.1:FinnishVEP cache and VCF file
    gnomADe:NFEGenome Aggregation Database exomes r2.1:Non-Finnish EuropeanVEP cache and VCF file
    gnomADe:othGenome Aggregation Database exomes r2.1:Other (population not assigned)VEP cache and VCF file
    gnomADe:SASGenome Aggregation Database exomes r2.1:South AsianVEP cache and VCF file
    gnomADg:ALLGenome Aggregation Database genomes v3:All gnomAD genomes individualsVEP Cache and VCF file
    gnomADg:afrGenome Aggregation Database genomes v3:African/African AmericanVEP Cache and VCF file
    gnomADg:amiGenome Aggregation Database genomes v3:AmishVEP Cache and VCF file
    gnomADg:amrGenome Aggregation Database genomes v3:Latino/Admixed AmericanVEP Cache and VCF file
    gnomADg:asjGenome Aggregation Database genomes v3:Ashkenazi JewishVEP Cache and VCF file
    gnomADg:easGenome Aggregation Database genomes v3:East AsianVEP Cache and VCF file
    gnomADg:finGenome Aggregation Database genomes v3:FinnishVEP Cache and VCF file
    gnomADg:nfeGenome Aggregation Database genomes v3:Non-Finnish EuropeanVEP Cache and VCF file
    gnomADg:easGenome Aggregation Database genomes v3:South AsianVEP Cache and VCF file
    gnomADg:othGenome Aggregation Database genomes v3:Other (population not assigned)VEP Cache and VCF file
    TOPMedTrans-Omics for Precision Medicine (TOPMed) ProgramVCF file
    ALSPACUK10K:ALSPAC cohortVCF file
    TWINSUKUK10K:TWINSUK cohortVCF file

    Example input and output files

    Speed and Optimization

    PanelApp

    The G2P VEP plugin accepts PanelApp data files as input. We use the following mappings to translate between the terminologies used by G2P and PanelApp.

    G2P PanelApp
    G2P confidence Gene Ratings
    Definitive Green
    Strong Amber
    Moderate Amber
    Limited Red
    Allelic requirement Model of inheritance from PanelApp
  • monoallelic_autosomal
  • monoallelic_PAR
  • MONOALLELIC, autosomal or pseudoautosomal, not imprinted
    MONOALLELIC, autosomal or pseudoautosomal, maternally imprinted (paternal allele expressed)
    MONOALLELIC, autosomal or pseudoautosomal, paternally imprinted (maternal allele expressed)
    MONOALLELIC, autosomal or pseudoautosomal, imprinted status unknown
    BOTH monoallelic and biallelic, autosomal or pseudoautosomal
    BOTH monoallelic and biallelic (but BIALLELIC mutations cause a more SEVERE disease form), autosomal or pseudoautosomal
  • biallelic_autosomal
  • biallelic_PAR
  • BIALLELIC, autosomal or pseudoautosomal
    BOTH monoallelic and biallelic, autosomal or pseudoautosomal
    BOTH monoallelic and biallelic (but BIALLELIC mutations cause a more SEVERE disease form), autosomal or pseudoautosomal
  • monoallelic_X_hem
  • X-LINKED: hemizygous mutation in males, biallelic mutations in females
  • monoallelic_X_het
  • X-LINKED: hemizygous mutation in males, monoallelic mutations in females may cause disease (may be less severe, later onset than males)