sgkit.genomic_relationship#
- sgkit.genomic_relationship(ds, *, call_dosage='call_dosage', estimator=None, ancestral_frequency=None, ploidy=None, skipna=False, merge=True)#
Compute a genomic relationship matrix (AKA the GRM or G-matrix).
The following estimators are supported:
VanRaden: the first estimator described by VanRaden 2008 [1] and generalized to autopolyploids by Ashraf et al 2016 [2] and Bilton 2020 [3].
Endelman-Jannink: a shrinkage estimator described by Endelman and Jannick 2012 [4]. This is based on the VanRaden estimator and aims to improve the accuracy of estimated breeding values with low-density markers.
- Parameters:
- ds
Dataset
Dataset containing call genotype dosages.
- call_dosage
Hashable
(default:'call_dosage'
) Input variable name holding call_dosage as defined by
sgkit.variables.call_dosage_spec
.- estimator {‘VanRaden’, ‘Endelman-Jannink’} |
None
Optional
[Literal
['VanRaden'
,'Endelman-Jannink'
]] (default:None
) Specifies the relatedness estimator to use. Must be one of
'VanRaden'
(the default) or'Endelman-Jannink'
.- ancestral_frequency
Hashable
|None
Optional
[Hashable
] (default:None
) Frequency of variant alleles corresponding to call_dosage within the ancestral/base/reference population. These values should range from zero to one. If the sample population was derived under Hardy-Weinberg equilibrium, then the ancestral frequencies may be approximated as the mean dosage of the sample population divided by its ploidy.
- ploidy
int
|None
Optional
[int
] (default:None
) Ploidy level of all samples within the dataset. By default this is inferred from the size of the “ploidy” dimension of the dataset.
- skipna
bool
(default:False
) If True, missing (nan) values of ‘call_dosage’ will be skipped so that the relationship between each pair of individuals is estimated using only variants where both samples have non-nan values.
- merge
bool
(default:True
) If True (the default), merge the input dataset and the computed output variables into a single dataset, otherwise return only the computed output variables. See Dataset merge behavior for more details.
- ds
- Return type:
- Returns:
: A dataset containing
sgkit.variables.stat_genomic_relationship_spec
which is a matrix of pairwise relationships among all samples. The dimensions are namedsamples_0
andsamples_1
.- Raises:
ValueError – If an unknown estimator is specified.
ValueError – If ploidy is not specified and cannot be inferred.
ValueError – If ancestral_frequency is required but not specified.
ValueError – If ancestral_frequency is the incorrect shape.
Note
The shrinkage parameter for the Endelman-Jannink estimator depends upon the total number of variants in the dataset. Monomorphic variants need to be removed from the dataset in order for the resulting estimates to match those found using the default settings in rrBLUP [5].
Warning
This function is only applicable to fixed-ploidy, biallelic datasets.
Examples
Diploid dataset without missing data:
>>> import sgkit as sg >>> ds = sg.simulate_genotype_call_dataset(n_variant=6, n_sample=3, seed=0) >>> ds = sg.count_call_alleles(ds) >>> # use reference allele count as dosage >>> ds["call_dosage"] = ds.call_allele_count[:,:,0] >>> ds.call_dosage.values array([[2, 1, 1], [1, 1, 1], [2, 1, 0], [2, 1, 1], [1, 0, 0], [1, 1, 2]], dtype=uint8) >>> # use sample population frequency as ancestral frequency >>> ds["sample_frequency"] = ds.call_dosage.mean(dim="samples") / ds.sizes["ploidy"] >>> ds = sg.genomic_relationship(ds, ancestral_frequency="sample_frequency") >>> ds.stat_genomic_relationship.values array([[ 0.93617021, -0.21276596, -0.72340426], [-0.21276596, 0.17021277, 0.04255319], [-0.72340426, 0.04255319, 0.68085106]])
Skipping partial or missing genotype calls:
>>> import sgkit as sg >>> import xarray as xr >>> ds = sg.simulate_genotype_call_dataset( ... n_variant=6, ... n_sample=4, ... missing_pct=0.05, ... seed=0, ... ) >>> ds = sg.count_call_alleles(ds) >>> ds["call_dosage"] = xr.where( ... ds.call_genotype_mask.any(dim="ploidy"), ... np.nan, ... ds.call_allele_count[:,:,1], # alternate allele ... ) >>> ds.call_dosage.values array([[ 0., 1., 1., 1.], [ 1., nan, 0., 1.], [ 2., 0., 1., 1.], [ 1., 2., nan, 1.], [ 1., 0., 1., 2.], [ 2., 2., 0., 0.]]) >>> ds["sample_frequency"] = ds.call_dosage.mean( ... dim="samples", skipna=True ... ) / ds.sizes["ploidy"] >>> ds = sg.genomic_relationship( ... ds, ancestral_frequency="sample_frequency", skipna=True ... ) >>> ds.stat_genomic_relationship.values array([[ 0.9744836 , -0.16978417, -0.58417266, -0.33778858], [-0.16978417, 1.45323741, -0.47619048, -0.89496403], [-0.58417266, -0.47619048, 0.62446043, 0.34820144], [-0.33778858, -0.89496403, 0.34820144, 0.79951397]])
Using mean imputation to replace missing genotype calls:
>>> import sgkit as sg >>> import xarray as xr >>> ds = sg.simulate_genotype_call_dataset( ... n_variant=6, ... n_sample=4, ... missing_pct=0.05, ... seed=0, ... ) >>> ds = sg.count_call_alleles(ds) >>> ds["call_dosage"] = xr.where( ... ds.call_genotype_mask.any(dim="ploidy"), ... np.nan, ... ds.call_allele_count[:,:,1], # alternate allele ... ) >>> # use mean imputation to replace missing dosage >>> ds["call_dosage_imputed"] = xr.where( ... ds.call_genotype_mask.any(dim="ploidy"), ... ds.call_dosage.mean(dim="samples", skipna=True), ... ds.call_dosage, ... ) >>> ds.call_dosage_imputed.values array([[0. , 1. , 1. , 1. ], [1. , 0.66666667, 0. , 1. ], [2. , 0. , 1. , 1. ], [1. , 2. , 1.33333333, 1. ], [1. , 0. , 1. , 2. ], [2. , 2. , 0. , 0. ]]) >>> ds["sample_frequency"] = ds.call_dosage.mean( ... dim="samples", skipna=True ... ) / ds.sizes["ploidy"] >>> ds = sg.genomic_relationship( ... ds, ... call_dosage="call_dosage_imputed", ... ancestral_frequency="sample_frequency", ... ) >>> ds.stat_genomic_relationship.values array([[ 0.9744836 , -0.14337789, -0.49331713, -0.33778858], [-0.14337789, 1.2272175 , -0.32806804, -0.75577157], [-0.49331713, -0.32806804, 0.527339 , 0.29404617], [-0.33778858, -0.75577157, 0.29404617, 0.79951397]])
References
[1] - P. M. VanRaden 2008. “Efficient Methods to Compute Genomic Predictions.” Journal of Dairy Science 91 (11): 4414-4423.
[2] - B. H. Ashraf, S. Byrne, D. Fé, A. Czaban, T. Asp, M. G. Pedersen, I. Lenk, N. Roulund, T. Didion, C. S. Jensen, J. Jensen and L. L. Janss 2016. “Estimating genomic heritabilities at the level of family-pool samples of perennial ryegrass using genotyping-by-sequencing” Theoretical Applied Genetics 129: 45-52
[3] - T. Bilton 2020. “Developing statistical methods for genetic analysis of genotypes from genotyping-by-sequencing data” PhD thesis, University of Otago.
[4] - J. B. Endelman and J. -L. Jannink 2012. “Shrinkage Estimation of the Realized Relationship Matrix” G3 2: 1405-1413.
[5] - J. B. Endelman “Ridge regression and other kernels for genomic selection with R package” Plant Genome 4: 250-255.