sgkit.regenie¶
- sgkit.regenie(ds, *, dosage, covariates, traits, variant_contig='variant_contig', variant_block_size=None, sample_block_size=None, alphas=None, add_intercept=True, normalize=False, orthogonalize=False, merge=True, **kwargs)¶
Regenie trait transformation.
REGENIE is a whole-genome regression technique that produces trait estimates for association tests. These estimates are subtracted from trait values and sampling statistics (p-values, standard errors, etc.) are evaluated against the residuals. See the REGENIE preprint [1] for more details. For a simpler technical overview, see [2] for a detailed description of the individual stages and separate regression models involved.
- Parameters
- dosage :
str
str
Name of genetic dosage variable. Defined by
sgkit.variables.dosage_spec
.- covariates :
Hashable
|Sequence
[Hashable
]Union
[Hashable
,Sequence
[Hashable
]] Names of covariate variables (1D or 2D). Defined by
sgkit.variables.covariates_spec
.- traits :
Hashable
|Sequence
[Hashable
]Union
[Hashable
,Sequence
[Hashable
]] Names of trait variables (1D or 2D). Defined by
sgkit.variables.traits_spec
.- variant_contig :
Hashable
Hashable
(default:'variant_contig'
) Name of the variant contig input variable. Definied by
sgkit.variables.variant_contig_spec
.- variant_block_size :
int
|Tuple
[int
, …] |None
Union
[int
,Tuple
[int
, …],None
] (default:None
) Number of variants in each block. If int, this describes the number of variants in each block but the last which may be smaller. If Tuple[int, …], this must describe the desired number of variants in each block individually. Defaults to 1000 or num variants // 2, whichever is smaller.
- sample_block_size :
int
|Tuple
[int
, …] |None
Union
[int
,Tuple
[int
, …],None
] (default:None
) Number of samples in each block. If int, this describes the number of samples in each block but the last which may be smaller. If Tuple[int, …], this must describe the desired number of samples in each block individually. Defaults to 10 sample blocks split roughly across all possible samples or the number of samples, if that number is < 10.
- alphas :
Sequence
[float
] |None
Optional
[Sequence
[float
]] (default:None
) List of alpha values to use for regularization, by default None. If not provided, these will be set automatically based on datasize and apriori heritability assumptions.
- add_intercept :
bool
bool
(default:True
) Whether or not to add intercept to covariates, by default True.
- normalize :
bool
bool
(default:False
) Rescale genotypes, traits, and covariates to have mean 0 and stdev 1, by default False.
- orthogonalize :
bool
bool
(default:False
) Experimental: Remove covariates through orthogonalization of genotypes and traits, by default False.
- merge :
bool
bool
(default:True
) If True (the default), merge the input dataset and the computed output variables into a single dataset, otherwise return only the computed output variables. See Dataset merge behavior for more details.
- dosage :
Warning
Binary traits are not yet supported so all outcomes provided must be continuous.
- Return type
- Returns
A dataset containing the following variables:
- regenie_base_prediction (blocks, alphas, samples, outcomes): Stage 1
predictions from ridge regression reduction. Defined by
sgkit.variables.regenie_base_prediction_spec
.
- regenie_meta_prediction (samples, outcomes): Stage 2 predictions from
the best meta estimator trained on the out-of-sample Stage 1 predictions. Defined by
sgkit.variables.regenie_meta_prediction_spec
.
- regenie_loco_prediction (contigs, samples, outcomes): LOCO predictions
resulting from Stage 2 predictions ignoring effects for variant blocks on held out contigs. This will be absent if the data provided does not contain at least 2 contigs. Defined by
sgkit.variables.regenie_loco_prediction_spec
.
- Raises
ValueError – If dosage, covariates, and trait arrays do not have the same number of samples.
Examples
>>> import numpy as np >>> from sgkit.testing import simulate_genotype_call_dataset >>> from sgkit.stats.regenie import regenie >>> n_variant, n_sample, n_contig, n_covariate, n_trait, seed = 100, 50, 2, 3, 5, 0 >>> rs = np.random.RandomState(seed) >>> ds = simulate_genotype_call_dataset(n_variant=n_variant, n_sample=n_sample, n_contig=n_contig, seed=seed) >>> ds["call_dosage"] = (("variants", "samples"), rs.normal(size=(n_variant, n_sample))) >>> ds["sample_covariate"] = (("samples", "covariates"), rs.normal(size=(n_sample, n_covariate))) >>> ds["sample_trait"] = (("samples", "traits"), rs.normal(size=(n_sample, n_trait))) >>> res = regenie(ds, dosage="call_dosage", covariates="sample_covariate", traits="sample_trait", merge=False) >>> res.compute() <xarray.Dataset> Dimensions: (alphas: 5, blocks: 2, contigs: 2, outcomes: 5, samples: 50) Dimensions without coordinates: alphas, blocks, contigs, outcomes, samples Data variables: regenie_base_prediction (blocks, alphas, samples, outcomes) float64 0.33... regenie_meta_prediction (samples, outcomes) float64 -0.4588 0.78 ... 0.3734 regenie_loco_prediction (contigs, samples, outcomes) float64 0.4886 ... ...
References
[1] - Mbatchou, J., L. Barnard, J. Backman, and A. Marcketta. 2020. “Computationally Efficient Whole Genome Regression for Quantitative and Binary Traits.” bioRxiv. https://www.biorxiv.org/content/10.1101/2020.06.19.162354v2.abstract.
[2] - https://glow.readthedocs.io/en/latest/tertiary/whole-genome-regression.html