sgkit.genomic_relationship#
- sgkit.genomic_relationship(ds, *, call_dosage='call_dosage', estimator=None, ancestral_frequency=None, ploidy=None, merge=True)#
Compute a genomic relationship matrix (GRM).
- Parameters
- ds :
Dataset
Dataset containing call genotype dosages.
- call_dosage :
Hashable
(default:'call_dosage'
) Input variable name holding call_dosage as defined by
sgkit.variables.call_dosage_spec
.- estimator : {‘VanRaden’} |
None
Optional
[Literal
[‘VanRaden’]] (default:None
) Specifies a relatedness estimator to use. Currently the only option is
"VanRaden"
which uses the method described by VanRaden 2008 [1] and generalized to autopolyploids by Ashraf et al 2016 [2] and Bilton 2020 [3].- ancestral_frequency :
Hashable
|None
Optional
[Hashable
] (default:None
) Frequency of variant alleles corresponding to call_dosage within the ancestral/base/reference population. These values should range from zero to one. If the sample population was derived under Hardy-Weinberg equilibrium, then the ancestral frequencies may be approximated as the mean dosage of the sample population divided by its ploidy.
- ploidy :
int
|None
Optional
[int
] (default:None
) Ploidy level of all samples within the dataset. By default this is inferred from the size of the “ploidy” dimension of the dataset.
- merge :
bool
(default:True
) If True (the default), merge the input dataset and the computed output variables into a single dataset, otherwise return only the computed output variables. See Dataset merge behavior for more details.
- ds :
- Return type
- Returns
A dataset containing
sgkit.variables.stat_genomic_relationship_spec
which is a matrix of pairwise relationships among all samples. The dimensions are namedsamples_0
andsamples_1
.
Warning
This function is only applicable to fixed-ploidy, biallelic datasets.
- Raises
ValueError – If an unknown estimator is specified.
ValueError – If ploidy is not specified and cannot be inferred.
ValueError – If ancestral_frequency is required but not specified.
ValueError – If ancestral_frequency is the incorrect shape.
Examples
>>> import sgkit as sg >>> ds = sg.simulate_genotype_call_dataset(n_variant=6, n_sample=3, seed=0) >>> ds = sg.count_call_alleles(ds) >>> # use reference allele count as dosage >>> ds["call_dosage"] = ds.call_allele_count[:,:,0] >>> ds.call_dosage.values array([[2, 1, 1], [1, 1, 1], [2, 1, 0], [2, 1, 1], [1, 0, 0], [1, 1, 2]], dtype=uint8) >>> # use sample population frequency as ancestral frequency >>> ds["sample_frequency"] = ds.call_dosage.mean(dim="samples") / ds.dims["ploidy"] >>> ds = sg.genomic_relationship(ds, ancestral_frequency="sample_frequency") >>> ds.stat_genomic_relationship.values array([[ 0.93617021, -0.21276596, -0.72340426], [-0.21276596, 0.17021277, 0.04255319], [-0.72340426, 0.04255319, 0.68085106]])
References
[1] - P. M. VanRaden 2008. “Efficient Methods to Compute Genomic Predictions.” Journal of Dairy Science 91 (11): 4414-4423.
[2] - B. H. Ashraf, S. Byrne, D. Fé, A. Czaban, T. Asp, M. G. Pedersen, I. Lenk, N. Roulund, T. Didion, C. S. Jensen, J. Jensen and L. L. Janss 2016. “Estimating genomic heritabilities at the level of family-pool samples of perennial ryegrass using genotyping-by-sequencing” Theoretical Applied Genetics 129: 45-52
[3] - T. Bilton 2020. “Developing statistical methods for genetic analysis of genotypes from genotyping-by-sequencing data” PhD thesis, University of Otago.