Gene-based GWAS data requirements

This page describes the columns that should be present in your gene-based GWAS summary statistics file for submission to the GWAS Catalog.

We want the community's feedback!

If you have any feedback about the data requirements listed below, please contact gwas-info@ebi.ac.uk

Example table

`chromosome`	`base_pair_start`	`base_pair_end`	`hgnc_symbol`	`neg_log10_p_value`	`beta`	`standard_error`	`n_snps`
8	94925972	94949378	TP53INP1	9.45	0.048	0.008	42
13	32315086	32400268	BRCA2	13.661	-0.035	0.003	115

Minimum data requirements

A gene name (hgnc_symbol or ensembl_gene_id), a p-value, and position information (chromosome, base_pair_start, base_pair_end) are required fields for gene-based GWAS.

Field (column) structure

Column names must appear exactly as shown below. Any differences or typos in your column names will cause validation errors.

Your analysis must report genome-wide results

GWAS Catalog submissions are expected to be full genome-wide datasets, not just top hits.
Gene-based GWAS analyses should contain at least 10,000 pre-QC genes.
It's OK if quality control steps have reduced the number of rows in your final dataset below the minimum row count, but please ensure you are submitting the full set of variants that were analysed in your study, including data which didn't meet GWAS significance.

Gene name

Required — select one

At least one gene identifier must be provided.

Column	Description
`hgnc_symbol`	HGNC symbol (e.g. TP53)
`ensembl_gene_id`	Ensembl gene ID (e.g. ENSG00000141510)

p value representation

Required — select one

How are p values stored in your file? Please only include one field.

Column	Description
`p_value`	p value (e.g. 0.00034). Smaller p values are more significant.
`neg_log10_p_value`	Negative log₁₀ p value (e.g. 3.47). Larger values indicate greater significance.

Position

Required — all fields required

How is the position of the gene represented?

Column	Description
`chromosome`	Chromosome where the gene is located
`base_pair_start`	Start position of the gene (0/1 based)
`base_pair_end`	End position of the gene (0/1 based)

Effect size

Optional

Measuring effect sizes is optional for gene-based GWAS. The most common choices are beta and odds ratio.

Column	Description
`beta`	Regression coefficient.
`odds_ratio`	Odds ratio estimate.
`z_score`	Z-score statistic.
`hazard_ratio`	Hazard ratio estimate.

Uncertainty estimate

Conditional

If you include an effect size, you must provide an uncertainty estimate. The most common choices are standard error for beta, and confidence intervals for odds ratios or hazard ratios.

Required when an effect size is provided.

Column	Description
`standard_error`	Standard error.
`ci_lower`	Lower bound of the confidence interval (typically odds ratio).
`ci_upper`	Upper bound of the confidence interval (typically odds ratio).

Other fields

Optional

Please consider including this data to improve the quality of your submission.

Column	Description
`n`	Sample size per gene.
`n_snps`	Number of SNPs included in the gene-based test.

You can also include a reasonable number of extra fields relevant to your study in your submission.

Validation rules

These rules are enforced during validation. The same rules apply in both the web tool and the command line interface.

Beta requires standard error: If beta is selected as an effect size, standard_error must also be included as an uncertainty estimate. Standard error may only be used with beta.
Confidence intervals require odds ratio or hazard ratio: Confidence interval bounds are only valid when odds_ratio or hazard_ratio is selected as an effect size. If one bound is provided, both must be provided.
Z-score does not accept uncertainty estimates: Z-score is a standardised statistic. When z_score is the sole or primary effect size, no uncertainty estimate (standard error or confidence interval) should be provided.
Odds ratio and hazard ratio are mutually exclusive: Odds ratio and hazard ratio cannot both be provided in the same file.
Primary effect size must be designated: If more than one effect size column is present, one must be designated as the primary measure of effect. The primary measure of effect is placed in a standardised column in the output file.

Example table​

Field (column) structure​