sgkit.genomic_relationship#

sgkit.genomic_relationship(ds, *, call_dosage='call_dosage', estimator=None, ancestral_frequency=None, ploidy=None, skipna=False, merge=True)#

Compute a genomic relationship matrix (AKA the GRM or G-matrix).

The following estimators are supported:

  • VanRaden: the first estimator described by VanRaden 2008 [1] and generalized to autopolyploids by Ashraf et al 2016 [2] and Bilton 2020 [3].

  • Endelman-Jannink: a shrinkage estimator described by Endelman and Jannick 2012 [4]. This is based on the VanRaden estimator and aims to improve the accuracy of estimated breeding values with low-density markers.

Parameters:
ds Dataset

Dataset containing call genotype dosages.

call_dosage Hashable (default: 'call_dosage')

Input variable name holding call_dosage as defined by sgkit.variables.call_dosage_spec.

estimator Literal['VanRaden', 'Endelman-Jannink'] | NoneOptional[Literal['VanRaden', 'Endelman-Jannink']] (default: None)

Specifies the relatedness estimator to use. Must be one of 'VanRaden' (the default) or 'Endelman-Jannink'.

ancestral_frequency Hashable | NoneOptional[Hashable] (default: None)

Frequency of variant alleles corresponding to call_dosage within the ancestral/base/reference population. These values should range from zero to one. If the sample population was derived under Hardy-Weinberg equilibrium, then the ancestral frequencies may be approximated as the mean dosage of the sample population divided by its ploidy.

ploidy int | NoneOptional[int] (default: None)

Ploidy level of all samples within the dataset. By default this is inferred from the size of the “ploidy” dimension of the dataset.

skipna bool (default: False)

If True, missing (nan) values of ‘call_dosage’ will be skipped so that the relationship between each pair of individuals is estimated using only variants where both samples have non-nan values.

merge bool (default: True)

If True (the default), merge the input dataset and the computed output variables into a single dataset, otherwise return only the computed output variables. See Dataset merge behavior for more details.

Return type:

Dataset

Returns:

: A dataset containing sgkit.variables.stat_genomic_relationship_spec which is a matrix of pairwise relationships among all samples. The dimensions are named samples_0 and samples_1.

Raises:
  • ValueError – If an unknown estimator is specified.

  • ValueError – If ploidy is not specified and cannot be inferred.

  • ValueError – If ancestral_frequency is required but not specified.

  • ValueError – If ancestral_frequency is the incorrect shape.

Note

The shrinkage parameter for the Endelman-Jannink estimator depends upon the total number of variants in the dataset. Monomorphic variants need to be removed from the dataset in order for the resulting estimates to match those found using the default settings in rrBLUP [5].

Warning

This function is only applicable to fixed-ploidy, biallelic datasets.

Examples

Diploid dataset without missing data:

>>> import sgkit as sg
>>> ds = sg.simulate_genotype_call_dataset(n_variant=6, n_sample=3, seed=0)
>>> ds = sg.count_call_alleles(ds)
>>> # use reference allele count as dosage
>>> ds["call_dosage"] = ds.call_allele_count[:,:,0]
>>> ds.call_dosage.values 
    array([[2, 1, 1],
           [1, 1, 1],
           [2, 1, 0],
           [2, 1, 1],
           [1, 0, 0],
           [1, 1, 2]], dtype=uint8)
>>> # use sample population frequency as ancestral frequency
>>> ds["sample_frequency"] = ds.call_dosage.mean(dim="samples") / ds.sizes["ploidy"]
>>> ds = sg.genomic_relationship(ds, ancestral_frequency="sample_frequency")
>>> ds.stat_genomic_relationship.values 
    array([[ 0.93617021, -0.21276596, -0.72340426],
           [-0.21276596,  0.17021277,  0.04255319],
           [-0.72340426,  0.04255319,  0.68085106]])

Skipping partial or missing genotype calls:

>>> import sgkit as sg
>>> import xarray as xr
>>> ds = sg.simulate_genotype_call_dataset(
...     n_variant=6,
...     n_sample=4,
...     missing_pct=0.05,
...     seed=0,
... )
>>> ds = sg.count_call_alleles(ds)
>>> ds["call_dosage"] = xr.where(
...     ds.call_genotype_mask.any(dim="ploidy"),
...     np.nan,
...     ds.call_allele_count[:,:,1],  # alternate allele
... )
>>> ds.call_dosage.values  
array([[ 0.,  1.,  1.,  1.],
       [ 1., nan,  0.,  1.],
       [ 2.,  0.,  1.,  1.],
       [ 1.,  2., nan,  1.],
       [ 1.,  0.,  1.,  2.],
       [ 2.,  2.,  0.,  0.]])
>>> ds["sample_frequency"] = ds.call_dosage.mean(
...     dim="samples", skipna=True
... ) / ds.sizes["ploidy"]
>>> ds = sg.genomic_relationship(
...     ds, ancestral_frequency="sample_frequency", skipna=True
... )
>>> ds.stat_genomic_relationship.values  
array([[ 0.9744836 , -0.16978417, -0.58417266, -0.33778858],
       [-0.16978417,  1.45323741, -0.47619048, -0.89496403],
       [-0.58417266, -0.47619048,  0.62446043,  0.34820144],
       [-0.33778858, -0.89496403,  0.34820144,  0.79951397]])

Using mean imputation to replace missing genotype calls:

>>> import sgkit as sg
>>> import xarray as xr
>>> ds = sg.simulate_genotype_call_dataset(
...     n_variant=6,
...     n_sample=4,
...     missing_pct=0.05,
...     seed=0,
... )
>>> ds = sg.count_call_alleles(ds)
>>> ds["call_dosage"] = xr.where(
...     ds.call_genotype_mask.any(dim="ploidy"),
...     np.nan,
...     ds.call_allele_count[:,:,1],  # alternate allele
... )
>>> # use mean imputation to replace missing dosage
>>> ds["call_dosage_imputed"] = xr.where(
...     ds.call_genotype_mask.any(dim="ploidy"),
...     ds.call_dosage.mean(dim="samples", skipna=True),
...     ds.call_dosage,
... )
>>> ds.call_dosage_imputed.values  
array([[0.        , 1.        , 1.        , 1.        ],
       [1.        , 0.66666667, 0.        , 1.        ],
       [2.        , 0.        , 1.        , 1.        ],
       [1.        , 2.        , 1.33333333, 1.        ],
       [1.        , 0.        , 1.        , 2.        ],
       [2.        , 2.        , 0.        , 0.        ]])
>>> ds["sample_frequency"] = ds.call_dosage.mean(
...     dim="samples", skipna=True
... ) / ds.sizes["ploidy"]
>>> ds = sg.genomic_relationship(
...     ds,
...     call_dosage="call_dosage_imputed",
...     ancestral_frequency="sample_frequency",
... )
>>> ds.stat_genomic_relationship.values  
array([[ 0.9744836 , -0.14337789, -0.49331713, -0.33778858],
       [-0.14337789,  1.2272175 , -0.32806804, -0.75577157],
       [-0.49331713, -0.32806804,  0.527339  ,  0.29404617],
       [-0.33778858, -0.75577157,  0.29404617,  0.79951397]])

References

[1] - P. M. VanRaden 2008. “Efficient Methods to Compute Genomic Predictions.” Journal of Dairy Science 91 (11): 4414-4423.

[2] - B. H. Ashraf, S. Byrne, D. Fé, A. Czaban, T. Asp, M. G. Pedersen, I. Lenk, N. Roulund, T. Didion, C. S. Jensen, J. Jensen and L. L. Janss 2016. “Estimating genomic heritabilities at the level of family-pool samples of perennial ryegrass using genotyping-by-sequencing” Theoretical Applied Genetics 129: 45-52

[3] - T. Bilton 2020. “Developing statistical methods for genetic analysis of genotypes from genotyping-by-sequencing data” PhD thesis, University of Otago.

[4] - J. B. Endelman and J. -L. Jannink 2012. “Shrinkage Estimation of the Realized Relationship Matrix” G3 2: 1405-1413.

[5] - J. B. Endelman “Ridge regression and other kernels for genomic selection with R package” Plant Genome 4: 250-255.