sgkit.genomic_relationship#

sgkit.genomic_relationship(ds, *, call_dosage='call_dosage', estimator=None, ancestral_frequency=None, ploidy=None, skipna=False, merge=True)#

Compute a genomic relationship matrix (AKA the GRM or G-matrix).

Parameters
ds : Dataset

Dataset containing call genotype dosages.

call_dosage : Hashable (default: 'call_dosage')

Input variable name holding call_dosage as defined by sgkit.variables.call_dosage_spec.

estimator : {‘VanRaden’} | NoneOptional[Literal[‘VanRaden’]] (default: None)

Specifies a relatedness estimator to use. Currently the only option is "VanRaden" which uses the method described by VanRaden 2008 [1] and generalized to autopolyploids by Ashraf et al 2016 [2] and Bilton 2020 [3].

ancestral_frequency : Hashable | NoneOptional[Hashable] (default: None)

Frequency of variant alleles corresponding to call_dosage within the ancestral/base/reference population. These values should range from zero to one. If the sample population was derived under Hardy-Weinberg equilibrium, then the ancestral frequencies may be approximated as the mean dosage of the sample population divided by its ploidy.

ploidy : int | NoneOptional[int] (default: None)

Ploidy level of all samples within the dataset. By default this is inferred from the size of the “ploidy” dimension of the dataset.

skipna : bool (default: False)

If True, missing (nan) values of ‘call_dosage’ will be skipped so that the relationship between each pair of individuals is estimated using only variants where both samples have non-nan values.

merge : bool (default: True)

If True (the default), merge the input dataset and the computed output variables into a single dataset, otherwise return only the computed output variables. See Dataset merge behavior for more details.

Return type

Dataset

Returns

A dataset containing sgkit.variables.stat_genomic_relationship_spec which is a matrix of pairwise relationships among all samples. The dimensions are named samples_0 and samples_1.

Warning

This function is only applicable to fixed-ploidy, biallelic datasets.

Raises
  • ValueError – If an unknown estimator is specified.

  • ValueError – If ploidy is not specified and cannot be inferred.

  • ValueError – If ancestral_frequency is required but not specified.

  • ValueError – If ancestral_frequency is the incorrect shape.

Examples

Diploid dataset without missing data:

>>> import sgkit as sg
>>> ds = sg.simulate_genotype_call_dataset(n_variant=6, n_sample=3, seed=0)
>>> ds = sg.count_call_alleles(ds)
>>> # use reference allele count as dosage
>>> ds["call_dosage"] = ds.call_allele_count[:,:,0]
>>> ds.call_dosage.values 
    array([[2, 1, 1],
           [1, 1, 1],
           [2, 1, 0],
           [2, 1, 1],
           [1, 0, 0],
           [1, 1, 2]], dtype=uint8)
>>> # use sample population frequency as ancestral frequency
>>> ds["sample_frequency"] = ds.call_dosage.mean(dim="samples") / ds.dims["ploidy"]
>>> ds = sg.genomic_relationship(ds, ancestral_frequency="sample_frequency")
>>> ds.stat_genomic_relationship.values 
    array([[ 0.93617021, -0.21276596, -0.72340426],
           [-0.21276596,  0.17021277,  0.04255319],
           [-0.72340426,  0.04255319,  0.68085106]])

Skipping partial or missing genotype calls:

>>> import sgkit as sg
>>> import xarray as xr
>>> ds = sg.simulate_genotype_call_dataset(
...     n_variant=6,
...     n_sample=4,
...     missing_pct=0.05,
...     seed=0,
... )
>>> ds = sg.count_call_alleles(ds)
>>> ds["call_dosage"] = xr.where(
...     ds.call_genotype_mask.any(dim="ploidy"),
...     np.nan,
...     ds.call_allele_count[:,:,1],  # alternate allele
... )
>>> ds.call_dosage.values  
array([[ 0.,  1.,  1.,  1.],
       [ 1., nan,  0.,  1.],
       [ 2.,  0.,  1.,  1.],
       [ 1.,  2., nan,  1.],
       [ 1.,  0.,  1.,  2.],
       [ 2.,  2.,  0.,  0.]])
>>> ds["sample_frequency"] = ds.call_dosage.mean(
...     dim="samples", skipna=True
... ) / ds.dims["ploidy"]
>>> ds = sg.genomic_relationship(
...     ds, ancestral_frequency="sample_frequency", skipna=True
... )
>>> ds.stat_genomic_relationship.values  
array([[ 0.9744836 , -0.16978417, -0.58417266, -0.33778858],
       [-0.16978417,  1.45323741, -0.47619048, -0.89496403],
       [-0.58417266, -0.47619048,  0.62446043,  0.34820144],
       [-0.33778858, -0.89496403,  0.34820144,  0.79951397]])

Using mean imputation to replace missing genotype calls:

>>> import sgkit as sg
>>> import xarray as xr
>>> ds = sg.simulate_genotype_call_dataset(
...     n_variant=6,
...     n_sample=4,
...     missing_pct=0.05,
...     seed=0,
... )
>>> ds = sg.count_call_alleles(ds)
>>> ds["call_dosage"] = xr.where(
...     ds.call_genotype_mask.any(dim="ploidy"),
...     np.nan,
...     ds.call_allele_count[:,:,1],  # alternate allele
... )
>>> # use mean imputation to replace missing dosage
>>> ds["call_dosage_imputed"] = xr.where(
...     ds.call_genotype_mask.any(dim="ploidy"),
...     ds.call_dosage.mean(dim="samples", skipna=True),
...     ds.call_dosage,
... )
>>> ds.call_dosage_imputed.values  
array([[0.        , 1.        , 1.        , 1.        ],
       [1.        , 0.66666667, 0.        , 1.        ],
       [2.        , 0.        , 1.        , 1.        ],
       [1.        , 2.        , 1.33333333, 1.        ],
       [1.        , 0.        , 1.        , 2.        ],
       [2.        , 2.        , 0.        , 0.        ]])
>>> ds["sample_frequency"] = ds.call_dosage.mean(
...     dim="samples", skipna=True
... ) / ds.dims["ploidy"]
>>> ds = sg.genomic_relationship(
...     ds,
...     call_dosage="call_dosage_imputed",
...     ancestral_frequency="sample_frequency",
... )
>>> ds.stat_genomic_relationship.values  
array([[ 0.9744836 , -0.14337789, -0.49331713, -0.33778858],
       [-0.14337789,  1.2272175 , -0.32806804, -0.75577157],
       [-0.49331713, -0.32806804,  0.527339  ,  0.29404617],
       [-0.33778858, -0.75577157,  0.29404617,  0.79951397]])

References

[1] - P. M. VanRaden 2008. “Efficient Methods to Compute Genomic Predictions.” Journal of Dairy Science 91 (11): 4414-4423.

[2] - B. H. Ashraf, S. Byrne, D. Fé, A. Czaban, T. Asp, M. G. Pedersen, I. Lenk, N. Roulund, T. Didion, C. S. Jensen, J. Jensen and L. L. Janss 2016. “Estimating genomic heritabilities at the level of family-pool samples of perennial ryegrass using genotyping-by-sequencing” Theoretical Applied Genetics 129: 45-52

[3] - T. Bilton 2020. “Developing statistical methods for genetic analysis of genotypes from genotyping-by-sequencing data” PhD thesis, University of Otago.