sgkit.gwas_linear_regression#
- sgkit.gwas_linear_regression(ds, *, dosage, covariates, traits, add_intercept=True, call_genotype='call_genotype', merge=True)#
Run linear regression to identify continuous trait associations with genetic variants.
This method solves OLS regressions for each variant simultaneously and reports effect statistics as defined in [1]. This is facilitated by the removal of sample (i.e. person/individual) covariates through orthogonal projection of both the genetic variant and phenotype data [2]. A consequence of this rotation is that effect sizes and significances cannot be reported for covariates, only variants.
- Parameters:
- ds
Dataset
Dataset containing necessary dependent and independent variables.
- dosage
Hashable
Name of genetic dosage variable. Defined by
sgkit.variables.call_dosage_spec
.- covariates
Hashable
|Sequence
[Hashable
]Union
[Hashable
,Sequence
[Hashable
]] Names of covariate variables (1D or 2D). Defined by
sgkit.variables.covariates_spec
.- traits
Hashable
|Sequence
[Hashable
]Union
[Hashable
,Sequence
[Hashable
]] Names of trait variables (1D or 2D). Defined by
sgkit.variables.traits_spec
.- add_intercept
bool
(default:True
) Add intercept term to covariate set, by default True.
- call_genotype
Hashable
(default:'call_genotype'
) Input variable name holding call_genotype. Defined by
sgkit.variables.call_genotype_spec
.- merge
bool
(default:True
) If True (the default), merge the input dataset and the computed output variables into a single dataset, otherwise return only the computed output variables. See Dataset merge behavior for more details.
- ds
Warning
Regression statistics from this implementation are only valid when an intercept is present. The add_intercept flag is a convenience for adding one when not already present, but there is currently no parameterization for intercept-free regression.
Additionally, both covariate and trait arrays will be rechunked to have blocks along the sample (row) dimension but not the column dimension (i.e. they must be tall and skinny).
- Return type:
- Returns:
: Dataset containing (N = num variants, O = num traits):
- variant_linreg_beta[array-like, shape: (N, O)]
Beta values associated with each variant and trait
- variant_linreg_t_value[array-like, shape: (N, O)]
T statistics for each beta
- variant_linreg_p_value[array-like, shape: (N, O)]
P values as float in [0, 1]
References
- [1] Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2009. The Elements
of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition. Springer Science & Business Media.
- [2] Loh, Po-Ru, George Tucker, Brendan K. Bulik-Sullivan, Bjarni J. Vilhjálmsson,
Hilary K. Finucane, Rany M. Salem, Daniel I. Chasman, et al. 2015. “Efficient Bayesian Mixed-Model Analysis Increases Association Power in Large Cohorts.” Nature Genetics 47 (3): 284–90.