pubmed.ncbi.nlm.nih.gov

Rare-variant association testing for sequencing data with the sequence kernel association test - PubMed

  • ️Sat Jan 01 2011

Rare-variant association testing for sequencing data with the sequence kernel association test

Michael C Wu et al. Am J Hum Genet. 2011.

Abstract

Sequencing studies are increasingly being conducted to identify rare variants associated with complex traits. The limited power of classical single-marker association analysis for rare variants poses a central challenge in such studies. We propose the sequence kernel association test (SKAT), a supervised, flexible, computationally efficient regression method to test for association between genetic variants (common and rare) in a region and a continuous or dichotomous trait while easily adjusting for covariates. As a score-based variance-component test, SKAT can quickly calculate p values analytically by fitting the null model containing only the covariates, and so can easily be applied to genome-wide data. Using SKAT to analyze a genome-wide sequencing study of 1000 individuals, by segmenting the whole genome into 30 kb regions, requires only 7 hr on a laptop. Through analysis of simulated data across a wide range of practical scenarios and triglyceride data from the Dallas Heart Study, we show that SKAT can substantially outperform several alternative rare-variant association tests. We also provide analytic power and sample-size calculations to help design candidate-gene, whole-exome, and whole-genome sequence association studies.

Copyright © 2011 The American Society of Human Genetics. Published by Elsevier Inc. All rights reserved.

PubMed Disclaimer

Figures

Figure 1
Figure 1

Simulation-Study-Based Power Comparisons of SKAT and Burden Tests Empirical power at α = 10−6 under an assumption that 5% of the rare variants with MAF < 3% within random 30 kb regions were causal. Top panel: continuous phenotypes with maximum effect size (|β|) equal to 1.6 when MAF = 10−4; bottom panel: case-control studies with maximum OR = 5 when MAF = 10−4. Regression coefficients for the s causal variants were assumed to be a decreasing function of MAF as |βj|=c|log10MAFj|(j = 1,…,p [see Figure S2]), where c was chosen to result in these maximum effect sizes. From left to right, the plots consider settings in which the coefficients for the causal rare variants are 100% positive (0% negative), 80% positive (20% negative), and 50% positive (50% negative). Total sample sizes considered are 500, 1000, 2500, and 5000, with half being cases in case-control studies. For each setting, six methods are compared: SKAT, SKAT in which 10% of the genotypes were set to missing and then imputed (SKAT_M), restricted SKAT (rSKAT) in which unweighted SKAT is applied to variants with MAF < 3%, the weighted sum burden test (W) with the same weights as used by SKAT, counting-based burden test (N), and the CAST method (C). All the burden tests used MAF < 3% as the threshold. For each method, power was estimated as the proportion of p values < α among 1000 simulated data sets.

Figure 2
Figure 2

Sample Sizes Required for Reaching 80% Power Analytically estimated sample sizes required for reaching 80% power to detect rare variants associated with a continuous (top panel) or dichotomous phenotype in case-control studies (half are cases) (bottom panel) at the α = 10−6, 10−3, and 10−2 levels, under the assumption that 5% of rare variants with MAF < 3% within the 30 kb regions are causal. Plots correspond to 100%, 80%, and 50% of the causal variants associated with increase in the continuous phenotype or risk of the dichotomous phenotype. Regression coefficients for the s causal variants were assumed to be the same decreasing function of MAF as that in Figure 1. The absolute values of Required total sample sizes are plotted again the maximum effect sizes (ORs) when MAF = 10−4. Estimated total sample sizes were averaged over 100 random 30 kb regions.

Figure 3
Figure 3

Power Comparisons Based on Simulation and Analytic Estimation Power as a function of total sample size estimated by simulation with 1000 replicates and by the proposed power formula for continuous and dichotomous case-control traits. Simulation configurations correspond to those used in Figure 1, in which 80% of the regression coefficients for the causal rare variants were positive.

Similar articles

Cited by

References

    1. Hindorff L.A., Sethupathy P., Junkins H.A., Ramos E.M., Mehta J.P., Collins F.S., Manolio T.A. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc. Natl. Acad. Sci. USA. 2009;106:9362–9367. - PMC - PubMed
    1. Margulies M., Egholm M., Altman W.E., Attiya S., Bader J.S., Bemben L.A., Berka J., Braverman M.S., Chen Y.J., Chen Z. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437:376–380. - PMC - PubMed
    1. Mardis E.R. Next-generation DNA sequencing methods. Annu. Rev. Genomics Hum. Genet. 2008;9:387–402. - PubMed
    1. Ansorge W.J. Next-generation DNA sequencing techniques. New Biotechnol. 2009;25:195–203. - PubMed
    1. Eichler E.E., Flint J., Gibson G., Kong A., Leal S.M., Moore J.H., Nadeau J.H. Missing heritability and strategies for finding the underlying causes of complex disease. Nat. Rev. Genet. 2010;11:446–450. - PMC - PubMed

Publication types

MeSH terms