sgkit.regenie#

sgkit.regenie(ds, *, dosage, covariates, traits, variant_contig='variant_contig', variant_block_size=None, sample_block_size=None, alphas=None, add_intercept=True, normalize=False, orthogonalize=False, merge=True, **kwargs)#

Regenie trait transformation.

REGENIE is a whole-genome regression technique that produces trait estimates for association tests. These estimates are subtracted from trait values and sampling statistics (p-values, standard errors, etc.) are evaluated against the residuals. See the REGENIE preprint [1] for more details. For a simpler technical overview, see [2] for a detailed description of the individual stages and separate regression models involved.

Parameters:
dosage str

Name of genetic dosage variable. Defined by sgkit.variables.call_dosage_spec.

covariates Hashable | Sequence[Hashable]Union[Hashable, Sequence[Hashable]]

Names of covariate variables (1D or 2D). Defined by sgkit.variables.covariates_spec.

traits Hashable | Sequence[Hashable]Union[Hashable, Sequence[Hashable]]

Names of trait variables (1D or 2D). Defined by sgkit.variables.traits_spec.

variant_contig Hashable (default: 'variant_contig')

Name of the variant contig input variable. Definied by sgkit.variables.variant_contig_spec.

variant_block_size int | Tuple[int, ...] | NoneUnion[int, Tuple[int, ...], None] (default: None)

Number of variants in each block. If int, this describes the number of variants in each block but the last which may be smaller. If Tuple[int, …], this must describe the desired number of variants in each block individually. Defaults to 1000 or num variants // 2, whichever is smaller.

sample_block_size int | Tuple[int, ...] | NoneUnion[int, Tuple[int, ...], None] (default: None)

Number of samples in each block. If int, this describes the number of samples in each block but the last which may be smaller. If Tuple[int, …], this must describe the desired number of samples in each block individually. Defaults to 10 sample blocks split roughly across all possible samples or the number of samples, if that number is < 10.

alphas Sequence[float] | NoneOptional[Sequence[float]] (default: None)

List of alpha values to use for regularization, by default None. If not provided, these will be set automatically based on datasize and apriori heritability assumptions.

add_intercept bool (default: True)

Whether or not to add intercept to covariates, by default True.

normalize bool (default: False)

Rescale genotypes, traits, and covariates to have mean 0 and stdev 1, by default False.

orthogonalize bool (default: False)

Experimental: Remove covariates through orthogonalization of genotypes and traits, by default False.

merge bool (default: True)

If True (the default), merge the input dataset and the computed output variables into a single dataset, otherwise return only the computed output variables. See Dataset merge behavior for more details.

Warning

Binary traits are not yet supported so all outcomes provided must be continuous.

Return type:

Dataset

Returns:

: A dataset containing the following variables:

  • regenie_base_prediction (blocks, alphas, samples, outcomes): Stage 1

    predictions from ridge regression reduction. Defined by sgkit.variables.regenie_base_prediction_spec.

  • regenie_meta_prediction (samples, outcomes): Stage 2 predictions from

    the best meta estimator trained on the out-of-sample Stage 1 predictions. Defined by sgkit.variables.regenie_meta_prediction_spec.

  • regenie_loco_prediction (contigs, samples, outcomes): LOCO predictions

    resulting from Stage 2 predictions ignoring effects for variant blocks on held out contigs. This will be absent if the data provided does not contain at least 2 contigs. Defined by sgkit.variables.regenie_loco_prediction_spec.

Raises:

ValueError – If dosage, covariates, and trait arrays do not have the same number of samples.

Examples

>>> import numpy as np
>>> from sgkit.testing import simulate_genotype_call_dataset
>>> from sgkit.stats.regenie import regenie
>>> n_variant, n_sample, n_contig, n_covariate, n_trait, seed = 100, 50, 2, 3, 5, 0
>>> rs = np.random.RandomState(seed)
>>> ds = simulate_genotype_call_dataset(n_variant=n_variant, n_sample=n_sample, n_contig=n_contig, seed=seed)
>>> ds["call_dosage"] = (("variants", "samples"), rs.normal(size=(n_variant, n_sample)))
>>> ds["sample_covariate"] = (("samples", "covariates"), rs.normal(size=(n_sample, n_covariate)))
>>> ds["sample_trait"] = (("samples", "traits"), rs.normal(size=(n_sample, n_trait)))
>>> res = regenie(ds, dosage="call_dosage", covariates="sample_covariate", traits="sample_trait", merge=False)
>>> res.compute() 
<xarray.Dataset> Size: 26kB
Dimensions:                  (blocks: 2, alphas: 5, samples: 50, outcomes: 5,
                              contigs: 2)
Dimensions without coordinates: blocks, alphas, samples, outcomes, contigs
Data variables:
    regenie_base_prediction  (blocks, alphas, samples, outcomes) float64 20kB ...
    regenie_meta_prediction  (samples, outcomes) float64 2kB -0.4588 ... 0.3734
    regenie_loco_prediction  (contigs, samples, outcomes) float64 4kB 0.4886 ...

References

[1] - Mbatchou, J., L. Barnard, J. Backman, and A. Marcketta. 2020. “Computationally Efficient Whole Genome Regression for Quantitative and Binary Traits.” bioRxiv. https://www.biorxiv.org/content/10.1101/2020.06.19.162354v2.abstract.

[2] - https://glow.readthedocs.io/en/latest/tertiary/whole-genome-regression.html