Workflows
All data analysis for the eQTL Catalogue is performed using reproducible and containerised Nextflow workflows.
The four primary workflows are:
- RNA-seq quantification: eQTL-Catalogue/rnaseq
- Gene expression QC and normalisation: eQTL-Catalogue/qcnorm
- Genotype QC and imputation: eQTL-Catalogue/genimpute
- Association testing and fine mapping: eQTL-Catalogue/qtlmap
More details about running these workflows can be found in this tutorial.
Methods
Detailed description of the methods can be found in our flagship paper. Below is a short overview of the main analysis steps.
Gene expression and splicing quantification
The RNA sequencing quantification workflow is available from the eQTL-Catalogue/rnaseq GitHub repository. The workflow implements the following five quantification methods:
- gene expression: RNA-seq reads were aligned to the GRCh38 reference genome using HISAT2 and reads overlapping GENCODE v39 transcript annoations were counted using featureCounts.
- exon expression: DEXSeq was used to convert GENCODE v39 transcript annotations to non-overlapping exon annotations. Reads overlapping exons were counted using featureCounts.
- transcript usage: Salmon was used to estimate the expression levels of all annotated transcripts in GENCODE v39.
- txrevise event usage: txrevise was used to convert Ensembl 105 transcript annotations to independent promoter, splice junction and 3ʹ end usage events. Salmon was used to estimate the expression levels of those events.
- txrevise event usage: Leafcutter was used to quantify splice junction usage.
Normalisation
Briefly, we use the following normalisation strategies:
- gene counts: Conditional quantile normalisation with cqn using gene length and GC content as covariates followed by inverse normal transformation.
- exon counts: Conditional quantile normalisation with cqn using exon length and GC content as covariates followed by inverse normal transformation.
- transcript usage: Transcript usage is calculated by dividing the transcript expression estimates (TPM units) the total expression of all transcripts of the same gene. Transcript usage values (0…1 scale) are further standardised using inverse normal transformation.
- txrevise event usage: Promoter, splice junction and 3ʹ end event usage is calculated by dividing the event expression estimates (TPM units) by the total expression of all events of the same class (promoters, splicing events, 3ʹ end events) within the same gene. Txrevise event usage values (0…1 scale) are further standardised using inverse normal transformation.
- LeafCutter: Normalisation for LeafCutter junction usage values are normalised the same way as txrevise and transcript usage estimates.
Genotype imputation and quality control
For most datasets, we performed basic genotype QC and imputed the datasets to the 1000 Genomes 30x on GRCh38 reference panel with the eQTL-Catalogue/genimpute workflow.
Association testing
The association testing pipeline is available from eQTL-Catalogue/qtlmap GitHub repository. The main analysis steps are:
- Perform principal component analysis (PCA) of the genotype data with PLINK 1.9.
- Perfrom PCA analysis of the molecular phenotype data (prcomp function in R).
- Perform assocation testing with QTLtools.
- Perform statistical fine mapping with susieR.
In association testing, we use the following parameters:
- Use first six principal components from the genotype data and first six principal components from the molecular trait data as covariates.
- Test all variants that are +/- 1Mb from the start of the gene (as defined in Ensembl).