sgkit.gwas_linear_regression#

sgkit.gwas_linear_regression(ds, *, dosage, covariates, traits, add_intercept=True, call_genotype='call_genotype', merge=True)#

Run linear regression to identify continuous trait associations with genetic variants.

This method solves OLS regressions for each variant simultaneously and reports effect statistics as defined in [1]. This is facilitated by the removal of sample (i.e. person/individual) covariates through orthogonal projection of both the genetic variant and phenotype data [2]. A consequence of this rotation is that effect sizes and significances cannot be reported for covariates, only variants.

Parameters:
ds Dataset

Dataset containing necessary dependent and independent variables.

dosage Hashable

Name of genetic dosage variable. Defined by sgkit.variables.call_dosage_spec.

covariates Hashable | Sequence[Hashable]Union[Hashable, Sequence[Hashable]]

Names of covariate variables (1D or 2D). Defined by sgkit.variables.covariates_spec.

traits Hashable | Sequence[Hashable]Union[Hashable, Sequence[Hashable]]

Names of trait variables (1D or 2D). Defined by sgkit.variables.traits_spec.

add_intercept bool (default: True)

Add intercept term to covariate set, by default True.

call_genotype Hashable (default: 'call_genotype')

Input variable name holding call_genotype. Defined by sgkit.variables.call_genotype_spec.

merge bool (default: True)

If True (the default), merge the input dataset and the computed output variables into a single dataset, otherwise return only the computed output variables. See Dataset merge behavior for more details.

Warning

Regression statistics from this implementation are only valid when an intercept is present. The add_intercept flag is a convenience for adding one when not already present, but there is currently no parameterization for intercept-free regression.

Additionally, both covariate and trait arrays will be rechunked to have blocks along the sample (row) dimension but not the column dimension (i.e. they must be tall and skinny).

Return type:

Dataset

Returns:

: Dataset containing (N = num variants, O = num traits):

variant_linreg_beta[array-like, shape: (N, O)]

Beta values associated with each variant and trait

variant_linreg_t_value[array-like, shape: (N, O)]

T statistics for each beta

variant_linreg_p_value[array-like, shape: (N, O)]

P values as float in [0, 1]

References

  • [1] Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2009. The Elements

    of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition. Springer Science & Business Media.

  • [2] Loh, Po-Ru, George Tucker, Brendan K. Bulik-Sullivan, Bjarni J. Vilhjálmsson,

    Hilary K. Finucane, Rany M. Salem, Daniel I. Chasman, et al. 2015. “Efficient Bayesian Mixed-Model Analysis Increases Association Power in Large Cohorts.” Nature Genetics 47 (3): 284–90.