gene2phenotype

G2P VEP plugin

The G2P VEP plugin identifies likely disease causing genes based on the knowledge encoded in the G2P database and runs as part of the Variant Effect Predictor (VEP).

Ensembl Variant Effect Predictor

Ensembl VEP predicts the molecular consequence of a variant and reports further optional annotation.

If the input file contains variant data for a set of individuals the VEP generates one line of output for each pair of variant allele and overlapping transcript per individual.

How the plugin works:

The G2P VEP plugin adds further annotation to the line of output based on the individual's genotypes and the knowledge contained in the G2P database. The G2P VEP plugin uses a set of filters for identifying potentially causal variants. If the plugin counts a sufficient number of causal variants (variant hits) for a G2P gene it will report the gene as likely disease causing and all variants that passed the filters. The number of sufficient causal variants is derived from the allelic requirement of the gene which is stored in the G2P database.

By default the plugin adds certain information to the VEP output individual information, gene symbol or HGNC id, global allele frequency data from 1000 Genomes Phase 3 data for any colocated variant, SIFT predictions, Polyphen-2 predictions.

The plugin by default also checks for existing variants that are colocated with the given variants and will exclude those flagged as failed by Ensembl QC checks.

Filtering rules:

Consider the variant as potentially causal if the variant passes all filtering steps.

The variant overlaps a G2P gene
The variant consequence is in the list of severe consequences. The default list contains the following terms: splice_donor_variant, splice_acceptor_variant, stop_gained, frameshift_variant,stop_lost,initiator_codon_variant, inframe_insertion, inframe_deletion, missense_variant, coding_sequence_variant, start_lost, transcript_ablation, transcript_amplification, protein_altering_variant
All allele frequencies from co-located variants in reference populations (1000 Genomes project, gnomAD) need to be below a given threshold. The default frequency values for an allele in a bi-allelic gene is 0.005 and for an allele in a mono-allelic gene is 0.0001.

The sufficient number of variant hits is determined by the gene's allelic requirement.

G2P supports biallelic_autosomal, monoallelic_autosomal, mitochondrial, monoallelic_Y_hem, monoallelic_X_hem, monoallelic_X_het, monoallelic_PAR, biallelic_PAR as an allelic requirement. To ensure compatibility with our old terminologies, we still support the allelic requirements, monoallelic, biallelic, hemizygous, x-linked dominant, x-linked dominance.

Gene classification	G2P allelic requirement	Filtering rules
biallelic	biallelic_autosomal biallelic (supporting old terminologies) biallelic_PAR	A count of at least 2 heterozygous variants or 1 homozygous variants which passes all other filtering rules af => 0.005, rules => {HET => 2, HOM => 1}
monoallelic	monoallelic (supporting old terminologies) monoallelic_autosomal monoallelic_PAR monoallelic_X_hem monoallelic_X_het monoallelic_Y_hem mitochondrial hemizygous (supporting old terminologies) x-linked dominant (supporting old terminologies) x-linked dominance (supporting old terminologies)	A count of 1 heterozygous variants or 1 homozygous variants which passes all other filtering rules af => 0.0001, rules => {HET => 1, HOM => 1}

Installing and running the VEP and G2P VEP plugin

For installation and running the VEP script please refer to the VEP GitHub repository and VEP documentation pages. Plugins are installed and configured during the VEP installation. The G2P VEP plugin is located in the VEP plugins repository.

To run the G2P VEP plugin add the following argument to the VEP command:

The file to be used for running G2P plugin is the panel file from G2P or PanelApp. The plugin can not be run without the file.

vep_g2p_plugin_overview

Options are passed to the plugin as key=value pairs

Key	Description	Input or Default value	Output
file	Path to G2P data file. The file needs to be uncompressed. - Download from http://www.ebi.ac.uk/gene2phenotype/downloads - Download from PanelApp	The plugin can not run without this data file.	Data from this file is used in the filtering process. The text output and html output are also annotated with data from this file
af_monoallelic	maximum allele frequency for inclusion for monoallelic genes	0.0001	A different value can be used by ./vep -i input.vcf --plugin G2P,file='DDG2P.csv',af_monoallelic=0.00001
af_biallelic	maximum allele frequency for inclusion for biallelic genes	0.005	A different value can be used by ./vep -i input.vcf --plugin G2P,file='DDG2P.csv',af_biallelic=0.05
confidence_levels	We still support confidence levels of our old terminology. Confidence levels to include: definitive, strong, limited, moderate, confirmed, probable, possible, both RD and IF. Separate multiple values with '&'. https://www.ebi.ac.uk/gene2phenotype/terminology	Supported values: definitive, strong, moderate, confirmed, probable, limited. By default the plugin reports: definitive, strong, moderate, confirmed, probable	Confidence levels are used to determine which genes are used in the filtering process. The G2P confidence levels is reported in the HTML and text output. Some G2P entries have the flag "Requires clinical review", this is reported in the HTML and text output to show careful consideration of the results are required
all_confidence_levels	Set value to 1 to include all confidence levels: definitive, strong, limited, moderate, confirmed, probable and possible	0
af_from_vcf	set value to 1 to include allele frequencies from VCF files. The location of the VCF file is configured in ensembl-variation/modules/Bio/EnsEMBL/Variation/DBSQL/vcf_config.json or ensembl-vep/Bio/EnsEMBL/Variation/DBSQL/vcf_config.json depending on how the ensembl-variation API was installed	0	This option can be used to filter against population frequency sets (UK10K and TOPMed) which are not in the Ensembl VEP reference data cache but for which VCF files are available. Filtering using additional VCF files takes more time than using the VEP cache only.
af_from_vcf_keys	Select VCF collections. Separate multiple values with '&'. Should be only be used if option af_from_vcf is used.	VCF collections presently supported are - uk10k (assembly GRCh37 and GRCh38), topmed (assembly GRCh37 and GRCh38).	The VCF collection specified are used in the filtering process, to determine maximum allele frequency. For example, if the variants in gnomADg_v3.1.2 has an allele frequency higher than the frequency specified for the G2P gene, it is excluded.
variant_include_list	A list of variants to include even if variants do not pass allele frequency filtering. The include list needs to be a sorted, bgzipped and tabixed VCF file.
types	SO consequence types to include. Separate multiple values with '&'.	splice_donor_variant, splice_acceptor_variant, stop_gained, frameshift_variant, stop_lost, initiator_codon_variant, inframe_insertion, inframe_deletion, missense_variant, coding_sequence_variant, start_lost, transcript_ablation, transcript_amplification, protein_altering_variant
log_dir	The log_dir is required to store log_files which are used for writting intermediate results. The log_dir should be empty. The log_files can be consulted for any frequency filtering decisions.	current_working_dir/g2p_log_dir_[year]_[mon]_[mday]_[hour]_[min]_[sec]	log_dir contains information of gene and variants that did not pass all the filtering rules.
txt_report	Write all G2P complete genes and attributes to txt file	current_working_dir/txt_report_[year]_[mon]_[mday]_[hour]_[min]_[sec].txt	The G2P plugin output that contains a summary report of genes passing VEP-G2P filtering
html_report	Write all G2P complete genes and attributes to html file	current_working_dir/html_report_[year]_[mon]_[mday]_[hour]_[min]_[sec].html	The G2P plugin output that contains a summary report of genes passing VEP-G2P filtering for visualization in a web browser.
filter_by_gene_symbol	The plugin by default filters by HGNC ID using G2P panel files. Set this option to 1 to filter by gene symbol	0	This is the default option using PanelApp files.
only_mane	The plugin by default filters every transcript. This option is set to 1 to ensure filtering of only MANE transcripts	0	Information may be lost using this option.

Allele frequencies

The G2P plugin filters input variants on allele frequencies. The allele frequencies are retrieved from major genotyping projects like the 1000 Genomes project and gnomAD. The VEP provides a cache which contains allele frequencies in order to speed up the variant annotation.

To use the VCF file for filtering, the G2P plugin option af_from_vcf needs to be set to 1.

./vep -i input.vcf --plugin G2P,file='DDG2P.csv,af_from_vcf=1'

Available population allele frequency data

reference population short name	description	source
minor_allele_freq	global allele frequency (AF) from 1000 Genomes Phase 3 data	VEP cache
AA	Exome Sequencing Project 6500:African_American	VEP cache
AFR	1000GENOMES:phase_3:AFR	VEP cache
AMR	1000GENOMES:phase_3:AMR	VEP cache
EA	Exome Sequencing Project 6500:European_American	VEP cache
EAS	1000GENOMES:phase_3:EAS	VEP cache
EUR	1000GENOMES:phase_3:EUR	VEP cache
SAS	1000GENOMES:phase_3:SAS	VEP cache
gnomADe	Genome Aggregation Database:Total	VEP cache and VCF file.
gnomADe:afr	Genome Aggregation Database exomes r2.1:African/African American	VEP cache and VCF file
gnomADe:amr	Genome Aggregation Database exomes r2.1:Latino	VEP cache and VCF file
gnomADe:asj	Genome Aggregation Database exomes r2.1:Ashkenazi Jewish	VEP cache and VCF file
gnomADe:eas	Genome Aggregation Database exomes r2.1:East Asian	VEP cache and VCF file
gnomADe:fin	Genome Aggregation Database exomes r2.1:Finnish	VEP cache and VCF file
gnomADe:NFE	Genome Aggregation Database exomes r2.1:Non-Finnish European	VEP cache and VCF file
gnomADe:oth	Genome Aggregation Database exomes r2.1:Other (population not assigned)	VEP cache and VCF file
gnomADe:SAS	Genome Aggregation Database exomes r2.1:South Asian	VEP cache and VCF file
gnomADg:ALL	Genome Aggregation Database genomes v3:All gnomAD genomes individuals	VEP Cache and VCF file
gnomADg:afr	Genome Aggregation Database genomes v3:African/African American	VEP Cache and VCF file
gnomADg:ami	Genome Aggregation Database genomes v3:Amish	VEP Cache and VCF file
gnomADg:amr	Genome Aggregation Database genomes v3:Latino/Admixed American	VEP Cache and VCF file
gnomADg:asj	Genome Aggregation Database genomes v3:Ashkenazi Jewish	VEP Cache and VCF file
gnomADg:eas	Genome Aggregation Database genomes v3:East Asian	VEP Cache and VCF file
gnomADg:fin	Genome Aggregation Database genomes v3:Finnish	VEP Cache and VCF file
gnomADg:nfe	Genome Aggregation Database genomes v3:Non-Finnish European	VEP Cache and VCF file
gnomADg:eas	Genome Aggregation Database genomes v3:South Asian	VEP Cache and VCF file
gnomADg:oth	Genome Aggregation Database genomes v3:Other (population not assigned)	VEP Cache and VCF file
TOPMed	Trans-Omics for Precision Medicine (TOPMed) Program	VCF file
ALSPAC	UK10K:ALSPAC cohort	VCF file
TWINSUK	UK10K:TWINSUK cohort	VCF file

Example input and output files

Speed and Optimization

VEP can look up existing annotations from locally installed cache files in order to increase the speed of computation. The VEP installation process will guide you through the cache file selection and installation process.
More ways to make sure that your VEP installation is running as fast as possible.

PanelApp

The G2P VEP plugin accepts PanelApp data files as input. We use the following mappings to translate between the terminologies used by G2P and PanelApp.

G2P	PanelApp
G2P confidence	Gene Ratings
Definitive	Green
Strong	Amber
Moderate	Amber
Limited	Red
Allelic requirement	Model of inheritance from PanelApp
monoallelic_autosomal monoallelic_PAR	MONOALLELIC, autosomal or pseudoautosomal, not imprinted
	MONOALLELIC, autosomal or pseudoautosomal, maternally imprinted (paternal allele expressed)
	MONOALLELIC, autosomal or pseudoautosomal, paternally imprinted (maternal allele expressed)
	MONOALLELIC, autosomal or pseudoautosomal, imprinted status unknown
	BOTH monoallelic and biallelic, autosomal or pseudoautosomal
	BOTH monoallelic and biallelic (but BIALLELIC mutations cause a more SEVERE disease form), autosomal or pseudoautosomal
biallelic_autosomal biallelic_PAR	BIALLELIC, autosomal or pseudoautosomal
	BOTH monoallelic and biallelic, autosomal or pseudoautosomal
	BOTH monoallelic and biallelic (but BIALLELIC mutations cause a more SEVERE disease form), autosomal or pseudoautosomal
monoallelic_X_hem	X-LINKED: hemizygous mutation in males, biallelic mutations in females
monoallelic_X_het	X-LINKED: hemizygous mutation in males, monoallelic mutations in females may cause disease (may be less severe, later onset than males)