We have carried out several pilots to investigate extending Catalog scope and alternative methods of data acquisition with a view to incorporating these into future Catalog improvements. Each of these pilots has been designed to allow us to estimate the feasibility of the approach, resources needed to implement and support this, and obtain feedback from users.
We are investigating expanding the scope of the Catalog to include sequencing-based association studies. Currently the Catalog only includes studies which use array-based genotyping technologies.
Sequencing based-association studies differ from array-based studies in a number of ways. Sequencing allows the genotyping of many more, particularly rare, variants. The statistical power of single variant association tests for rare variants is low unless sample sizes or effect sizes are very large, and the problem of multiple testing is significant. To address these issues, different statistical methods have been developed, particularly to evaluate aggregate association with multiple variants in a region (e.g. gene). The input to these analyses are usually a specific sub-set of variants (e.g. predicted loss of function).
We initially investigated the inclusion of sequencing-based association studies in 2017 (see below). We have now performed another review (May 2019) of the literature (see below).
See the list of SEQUENCING-BASED ASSOCIATION STUDIES we have identified as potentially eligible for inclusion in the GWAS Catalog. This list includes curated meta-data (trait, single/multi-variant analyses, whole-genome/whole-exome coverage, variant number, curator comments). In order to maintain data consistency we are defining eligibility in line with the current Catalog. That is population-based association study assaying >100,000 SNPs distributed genome- or exome-wide.
The GWAS Catalog can now host summary statistics for sequencing-based association studies. If you are an author and have summary statistics to share please contact us (gwas-info@ebi.ac.uk).
Should we include single-variant analyses?
Should we include multi-variant (e.g. gene-based) analyses?
If so, how much detail would you need on the rules used to create the sub-set of input variants? (e.g. missense, predicated LOF)
How much detail do you need on the statistical methods?
We want to provide you useful, structured meta-data that would allow you to make a basic interpretation of the associations, but remember, we’re limited to what authors provide! Also bear in mind, the Catalog is manually curated and we need to balance resources with the usefulness of the data to the user community.
Please share your thoughts on this form or email gwas-info@ebi.ac.uk
Please suggest sequencing-based publications you think should be included in the Catalog here. These will be reviewed and assessed by a Catalog curator.
Identification of relevant publications
Publications were identified via a number of sources (standard Catalog machine learning search, cohort websites, grant reporting mechanisms, various PubMed and EuropePMC searches, references).
Identification of studies is challenging as, unlike array-based GWAS, there is lack of community consensus on language describing these studies, and sequencing data are used in a wide variety of non-relevant analyses.
Determining study eligibility
In order to maintain data consistency we initially defined eligibility in line with the current Catalog. That is population-based association study assaying >100,000 SNPs distributed genome- or exome-wide.
We observed that due to the strict filtering of variants that is common in sequencing-based association studies the number of variants analysed often falls around or below our cut-off of 100,000 variants. Similarly analyses are often targeted to specific genomic regions. While most of these analyses are clearly outside the remit of the Catalog there are larger-scale studies that may be of relevance. Therefore eligibility criteria could be re-assessed in the future.
There are still low numbers of sequencing association studies.
We have identified approximately 75 eligible publications (2012 to mid 2019, May 2019 review).
Eligible publications, as well as a selection of ineligible publications, are listed here with GWAS Catalog curator assessments. The list is a living document and we encourage users to suggest additional publications by adding to it.
Observations on eligible studies:
The majority of publications perform both multi-variant and single variant analyses; 65% both multi- and single variant, 21% multi-variant only, 15% single variant only (2017-2019).
There is increased community consensus on statistical methods, SKAT-O is dominant for multi-variant analyses.
A pilot study was performed to assess the feasibility of including sequencing-based association studies in the GWAS Catalog. Literature was reviewed and a sample set of publications were curated. Users were surveyed regarding which aspects of these studies are important for interpretation. The outcome of the pilot was that it is not immediately possible to expand the scope of the Catalog to include sequencing-based studies, primarily because of the wide range of additional meta-data required and the varied types of statistical analyses performed. There was a lack of community consensus on which data elements were required for interpretation, particularly regarding statistical analysis.
Ten studies which include sequencing-based association analyses were curated as part of the pilot. Study level and association data are provided in the standard GWAS Catalog download format with additional columns for sequencing specific data.
These data are available on our FTP server. Study level and association data are provided in the standard GWAS Catalog download format, with additional fields specific to sequencing studies provided as additional columns (see the README file for further detail).
Additional study level data fields are: Genotyping technology (discovery/initial), Sequencing platform, Array platform, Sequencing coverage, Number of variants analysed, Genome build, Analysis level, For multiple variants - number of variants per test, Statistical tests used to generate reported p-values, Statistical test used for extracted p-values, Why this test was chosen for extraction, Types of variant included in analysis, P-value threshold stated by author. Single variant/multi-variant analysis is indicated after the author name. The absence of either indicates that entry is currently eligible and published in the standard GWAS Catalog.
Additional association level data is provided in the P-VALUE (TEXT) field where 'TEST'=statistical test and 'NUM_VAR'=number of variants analysed.
Feedback from our users has indicated a high demand for targeted array studies to be included in the GWAS Catalog.
In 2016 we carried out a pilot on the inclusion of targeted array based GWAS in the Catalog involving:
Designing and carrying out a literature search to identify all studies that carry out genotyping and association analysis of at least 100,000 SNPs.
This identified approximately 150 targeted array studies that have been published up to June 2016 (including 62 Immunochip, 40 Metabochip and 36 exome arrays)
Curation of a representative set of these studies indicated that inclusion would require minimal changes to Catalog infrastructure and has enabled us to estimate the additional curation resources required to include these in the future.
Data from the studies curated as part of the targeted array can be found on our FTP server.
Following on from this, we are currently piloting the inclusion of targeted array studies in the Catalog. Prioritisation of targeted and exome array studies for inclusion in the Catalog is now performed by 1) relevance of the trait analysed 2) user request, with Open Targets being the main user in this phase of the pilot. In September 2017, Open Targets (www.opentargets.org) requested curation of fifty-five currently out-of-scope GWAS publications for inclusion in the GWAS Catalog. Moreover, the GWAS Catalog team have preliminarily identified over 150 publications based on targeted or exome array analysis from 2012 to 2017. These will also be curated as part of the inclusion pilot.
To support scaling of curation alternative methods of data acquisition have been explored. The pilot involved:
Designing a prototype deposition system was using online deposition forms (created in Cognito Forms, see figures below) with templates for sample and association results (created in Google Sheets).
Emailing 115 authors of 79 selected GWAS publications (53 whole-genome array and 26 targeted array) and inviting them to submit data using the test submission system. The deposition pilot was also advertised on Twitter, which was re-tweeted to over 20,000 followers.
We received an uptake of 10% from direct emails, with no uptake from Twitter.
Feedback from submitters, along with review of the submitted data, indicated that the format of deposition is easy to understand and allows authors to submit all relevant data with a high level of accuracy and rapidly.
Even a small rate of deposition represents a gain for the Catalog as it removes a lengthy paper reading and literature extraction step and the quality of deposited data is high. It should also be noted that retrospective deposition is not our preferred model and reduces take up.
Author deposition prototype, main page
Author deposition prototype, sample descriptions page
Author deposition prototype, associations upload page
Data submitted as part of the author deposition pilot can be found on our FTP server.