sgkit.gwas_linear_regression
sgkit.gwas_linear_regression#
- sgkit.gwas_linear_regression(ds, *, dosage, covariates, traits, add_intercept=True, call_genotype='call_genotype', merge=True)#
Run linear regression to identify continuous trait associations with genetic variants.
This method solves OLS regressions for each variant simultaneously and reports effect statistics as defined in [1]. This is facilitated by the removal of sample (i.e. person/individual) covariates through orthogonal projection of both the genetic variant and phenotype data [2]. A consequence of this rotation is that effect sizes and significances cannot be reported for covariates, only variants.
- Parameters
- ds :
Dataset
Dataset containing necessary dependent and independent variables.
- dosage :
Hashable
Name of genetic dosage variable. Defined by
sgkit.variables.dosage_spec
.- covariates :
Hashable
|Sequence
[Hashable
]Union
[Hashable
,Sequence
[Hashable
]] Names of covariate variables (1D or 2D). Defined by
sgkit.variables.covariates_spec
.- traits :
Hashable
|Sequence
[Hashable
]Union
[Hashable
,Sequence
[Hashable
]] Names of trait variables (1D or 2D). Defined by
sgkit.variables.traits_spec
.- add_intercept :
bool
(default:True
) Add intercept term to covariate set, by default True.
- call_genotype :
Hashable
(default:'call_genotype'
) Input variable name holding call_genotype. Defined by
sgkit.variables.call_genotype_spec
.- merge :
bool
(default:True
) If True (the default), merge the input dataset and the computed output variables into a single dataset, otherwise return only the computed output variables. See Dataset merge behavior for more details.
- ds :
Warning
Regression statistics from this implementation are only valid when an intercept is present. The add_intercept flag is a convenience for adding one when not already present, but there is currently no parameterization for intercept-free regression.
Additionally, both covariate and trait arrays will be rechunked to have blocks along the sample (row) dimension but not the column dimension (i.e. they must be tall and skinny).
- Return type
- Returns
Dataset containing (N = num variants, O = num traits):
- variant_linreg_beta[array-like, shape: (N, O)]
Beta values associated with each variant and trait
- variant_linreg_t_value[array-like, shape: (N, O)]
T statistics for each beta
- variant_linreg_p_value[array-like, shape: (N, O)]
P values as float in [0, 1]
References
- [1] Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2009. The Elements
of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition. Springer Science & Business Media.
- [2] Loh, Po-Ru, George Tucker, Brendan K. Bulik-Sullivan, Bjarni J. Vilhjálmsson,
Hilary K. Finucane, Rany M. Salem, Daniel I. Chasman, et al. 2015. “Efficient Bayesian Mixed-Model Analysis Increases Association Power in Large Cohorts.” Nature Genetics 47 (3): 284–90.