sgkit.hardy_weinberg_test#

sgkit.hardy_weinberg_test(ds, *, genotype_count='variant_genotype_count', ploidy=None, alleles=None, merge=True)#

Exact test for HWE as described in Wigginton et al. 2005 [1].

Parameters:
ds Dataset

Dataset containing genotype calls or precomputed genotype counts.

genotype_count Hashable | NoneOptional[Hashable] (default: 'variant_genotype_count')

Name of variable containing precomputed genotype counts for each variant as described in sgkit.variables.variant_genotype_count_spec. If the variable is not present in ds, it will be computed using count_variant_genotypes() which automatically assigns coordinates to the genotypes dimension.

ploidy int | NoneOptional[int] (default: None)

Genotype ploidy, defaults to ploidy dimension of provided dataset. If the ploidy dimension is not present, then this value must be set explicitly. Currently HWE calculations are only supported for diploid datasets, i.e. ploidy must equal 2.

alleles int | NoneOptional[int] (default: None)

Genotype allele count, defaults to alleles dimension of provided dataset. If the alleles dimension is not present, then this value must be set explicitly. Currently HWE calculations are only supported for biallelic datasets, i.e. alleles must equal 2.

merge bool (default: True)

If True (the default), merge the input dataset and the computed output variables into a single dataset, otherwise return only the computed output variables. See Dataset merge behavior for more details.

Warning

This function is only applicable to diploid, biallelic datasets. The genotype_count array should have three columns corresponding to the genotypes dimension. These columns should have coordinates '0/0', '0/1', and '1/1' which respectively contain counts for homozygous reference, heterozygous, and homozygous alternate genotypes.

Return type:

Dataset

Returns:

: Dataset containing (N = num variants):

variant_hwe_p_value[array-like, shape: (N, O)]

P values from HWE test for each variant as float in [0, 1].

Raises:
  • NotImplementedError – If the dataset is not limited to biallelic, diploid genotypes.

  • ValueError – If the ploidy or number of alleles are not specified and not present as dimensions in the dataset.

  • ValueError – If no coordinates are assigned to the genotypes dimension.

  • KeyError – If the genotypes '0/0', '0/1' or '1/1' are not specified as coordinates of the genotypes dimension.

References

  • [1] Wigginton, Janis E., David J. Cutler, and Goncalo R. Abecasis. 2005.

    “A Note on Exact Tests of Hardy-Weinberg Equilibrium.” American Journal of Human Genetics 76 (5): 887–93.