sgkit.genomic_relationship#

sgkit.genomic_relationship(ds, *, call_dosage='call_dosage', estimator=None, ancestral_frequency=None, ploidy=None, merge=True)#

Compute a genomic relationship matrix (GRM).

Parameters

ds : Dataset: Dataset containing call genotype dosages.
call_dosage : Hashable (default: 'call_dosage'): Input variable name holding call_dosage as defined by sgkit.variables.call_dosage_spec.
estimator : {‘VanRaden’} | NoneOptional[Literal[‘VanRaden’]] (default: None): Specifies a relatedness estimator to use. Currently the only option is "VanRaden" which uses the method described by VanRaden 2008 [1] and generalized to autopolyploids by Ashraf et al 2016 [2] and Bilton 2020 [3].
ancestral_frequency : Hashable | NoneOptional[Hashable] (default: None): Frequency of variant alleles corresponding to call_dosage within the ancestral/base/reference population. These values should range from zero to one. If the sample population was derived under Hardy-Weinberg equilibrium, then the ancestral frequencies may be approximated as the mean dosage of the sample population divided by its ploidy.
ploidy : int | NoneOptional[int] (default: None): Ploidy level of all samples within the dataset. By default this is inferred from the size of the “ploidy” dimension of the dataset.
merge : bool (default: True): If True (the default), merge the input dataset and the computed output variables into a single dataset, otherwise return only the computed output variables. See Dataset merge behavior for more details.

Return type

Dataset

Returns

A dataset containing sgkit.variables.stat_genomic_relationship_spec which is a matrix of pairwise relationships among all samples. The dimensions are named samples_0 and samples_1.

Warning

This function is only applicable to fixed-ploidy, biallelic datasets.

Raises

ValueError – If an unknown estimator is specified.
ValueError – If ploidy is not specified and cannot be inferred.
ValueError – If ancestral_frequency is required but not specified.
ValueError – If ancestral_frequency is the incorrect shape.

Examples

>>> import sgkit as sg
>>> ds = sg.simulate_genotype_call_dataset(n_variant=6, n_sample=3, seed=0)
>>> ds = sg.count_call_alleles(ds)
>>> # use reference allele count as dosage
>>> ds["call_dosage"] = ds.call_allele_count[:,:,0]
>>> ds.call_dosage.values 
    array([[2, 1, 1],
        [1, 1, 1],
        [2, 1, 0],
        [2, 1, 1],
        [1, 0, 0],
        [1, 1, 2]], dtype=uint8)
>>> # use sample population frequency as ancestral frequency
>>> ds["sample_frequency"] = ds.call_dosage.mean(dim="samples") / ds.dims["ploidy"]
>>> ds = sg.genomic_relationship(ds, ancestral_frequency="sample_frequency")
>>> ds.stat_genomic_relationship.values 
    array([[ 0.93617021, -0.21276596, -0.72340426],
        [-0.21276596,  0.17021277,  0.04255319],
        [-0.72340426,  0.04255319,  0.68085106]])

References

[1] - P. M. VanRaden 2008. “Efficient Methods to Compute Genomic Predictions.” Journal of Dairy Science 91 (11): 4414-4423.

[2] - B. H. Ashraf, S. Byrne, D. Fé, A. Czaban, T. Asp, M. G. Pedersen, I. Lenk, N. Roulund, T. Didion, C. S. Jensen, J. Jensen and L. L. Janss 2016. “Estimating genomic heritabilities at the level of family-pool samples of perennial ryegrass using genotyping-by-sequencing” Theoretical Applied Genetics 129: 45-52

[3] - T. Bilton 2020. “Developing statistical methods for genetic analysis of genotypes from genotyping-by-sequencing data” PhD thesis, University of Otago.