sgkit.regenie#

sgkit.regenie(ds, *, dosage, covariates, traits, variant_contig='variant_contig', variant_block_size=None, sample_block_size=None, alphas=None, add_intercept=True, normalize=False, orthogonalize=False, merge=True, **kwargs)#

Regenie trait transformation.

REGENIE is a whole-genome regression technique that produces trait estimates for association tests. These estimates are subtracted from trait values and sampling statistics (p-values, standard errors, etc.) are evaluated against the residuals. See the REGENIE preprint [1] for more details. For a simpler technical overview, see [2] for a detailed description of the individual stages and separate regression models involved.

Parameters
dosage : str

Name of genetic dosage variable. Defined by sgkit.variables.call_dosage_spec.

covariates : Hashable | Sequence[Hashable]Union[Hashable, Sequence[Hashable]]

Names of covariate variables (1D or 2D). Defined by sgkit.variables.covariates_spec.

traits : Hashable | Sequence[Hashable]Union[Hashable, Sequence[Hashable]]

Names of trait variables (1D or 2D). Defined by sgkit.variables.traits_spec.

variant_contig : Hashable (default: 'variant_contig')

Name of the variant contig input variable. Definied by sgkit.variables.variant_contig_spec.

variant_block_size : int | Tuple[int, ...] | NoneUnion[int, Tuple[int, ...], None] (default: None)

Number of variants in each block. If int, this describes the number of variants in each block but the last which may be smaller. If Tuple[int, …], this must describe the desired number of variants in each block individually. Defaults to 1000 or num variants // 2, whichever is smaller.

sample_block_size : int | Tuple[int, ...] | NoneUnion[int, Tuple[int, ...], None] (default: None)

Number of samples in each block. If int, this describes the number of samples in each block but the last which may be smaller. If Tuple[int, …], this must describe the desired number of samples in each block individually. Defaults to 10 sample blocks split roughly across all possible samples or the number of samples, if that number is < 10.

alphas : Sequence[float] | NoneOptional[Sequence[float]] (default: None)

List of alpha values to use for regularization, by default None. If not provided, these will be set automatically based on datasize and apriori heritability assumptions.

add_intercept : bool (default: True)

Whether or not to add intercept to covariates, by default True.

normalize : bool (default: False)

Rescale genotypes, traits, and covariates to have mean 0 and stdev 1, by default False.

orthogonalize : bool (default: False)

Experimental: Remove covariates through orthogonalization of genotypes and traits, by default False.

merge : bool (default: True)

If True (the default), merge the input dataset and the computed output variables into a single dataset, otherwise return only the computed output variables. See Dataset merge behavior for more details.

Warning

Binary traits are not yet supported so all outcomes provided must be continuous.

Return type

Dataset

Returns

A dataset containing the following variables:

  • regenie_base_prediction (blocks, alphas, samples, outcomes): Stage 1

    predictions from ridge regression reduction. Defined by sgkit.variables.regenie_base_prediction_spec.

  • regenie_meta_prediction (samples, outcomes): Stage 2 predictions from

    the best meta estimator trained on the out-of-sample Stage 1 predictions. Defined by sgkit.variables.regenie_meta_prediction_spec.

  • regenie_loco_prediction (contigs, samples, outcomes): LOCO predictions

    resulting from Stage 2 predictions ignoring effects for variant blocks on held out contigs. This will be absent if the data provided does not contain at least 2 contigs. Defined by sgkit.variables.regenie_loco_prediction_spec.

Raises

ValueError – If dosage, covariates, and trait arrays do not have the same number of samples.

Examples

>>> import numpy as np
>>> from sgkit.testing import simulate_genotype_call_dataset
>>> from sgkit.stats.regenie import regenie
>>> n_variant, n_sample, n_contig, n_covariate, n_trait, seed = 100, 50, 2, 3, 5, 0
>>> rs = np.random.RandomState(seed)
>>> ds = simulate_genotype_call_dataset(n_variant=n_variant, n_sample=n_sample, n_contig=n_contig, seed=seed)
>>> ds["call_dosage"] = (("variants", "samples"), rs.normal(size=(n_variant, n_sample)))
>>> ds["sample_covariate"] = (("samples", "covariates"), rs.normal(size=(n_sample, n_covariate)))
>>> ds["sample_trait"] = (("samples", "traits"), rs.normal(size=(n_sample, n_trait)))
>>> res = regenie(ds, dosage="call_dosage", covariates="sample_covariate", traits="sample_trait", merge=False)
>>> res.compute() 
<xarray.Dataset>
Dimensions:                  (blocks: 2, alphas: 5, samples: 50, outcomes: 5,
                              contigs: 2)
Dimensions without coordinates: blocks, alphas, samples, outcomes, contigs
Data variables:
    regenie_base_prediction  (blocks, alphas, samples, outcomes) float64 0.33...
    regenie_meta_prediction  (samples, outcomes) float64 -0.4588 0.78 ... 0.3734
    regenie_loco_prediction  (contigs, samples, outcomes) float64 0.4886 ... ...

References

[1] - Mbatchou, J., L. Barnard, J. Backman, and A. Marcketta. 2020. “Computationally Efficient Whole Genome Regression for Quantitative and Binary Traits.” bioRxiv. https://www.biorxiv.org/content/10.1101/2020.06.19.162354v2.abstract.

[2] - https://glow.readthedocs.io/en/latest/tertiary/whole-genome-regression.html