sgkit.genomic_relationship#
- sgkit.genomic_relationship(ds, *, call_dosage='call_dosage', estimator=None, ancestral_frequency=None, ploidy=None, merge=True)#
Compute a genomic relationship matrix (GRM).
- Parameters
- ds :
Dataset Dataset containing call genotype dosages.
- call_dosage :
Hashable(default:'call_dosage') Input variable name holding call_dosage as defined by
sgkit.variables.call_dosage_spec.- estimator : {‘VanRaden’} |
NoneOptional[Literal[‘VanRaden’]] (default:None) Specifies a relatedness estimator to use. Currently the only option is
"VanRaden"which uses the method described by VanRaden 2008 [1] and generalized to autopolyploids by Ashraf et al 2016 [2] and Bilton 2020 [3].- ancestral_frequency :
Hashable|NoneOptional[Hashable] (default:None) Frequency of variant alleles corresponding to call_dosage within the ancestral/base/reference population. These values should range from zero to one. If the sample population was derived under Hardy-Weinberg equilibrium, then the ancestral frequencies may be approximated as the mean dosage of the sample population divided by its ploidy.
- ploidy :
int|NoneOptional[int] (default:None) Ploidy level of all samples within the dataset. By default this is inferred from the size of the “ploidy” dimension of the dataset.
- merge :
bool(default:True) If True (the default), merge the input dataset and the computed output variables into a single dataset, otherwise return only the computed output variables. See Dataset merge behavior for more details.
- ds :
- Return type
- Returns
A dataset containing
sgkit.variables.stat_genomic_relationship_specwhich is a matrix of pairwise relationships among all samples. The dimensions are namedsamples_0andsamples_1.
Warning
This function is only applicable to fixed-ploidy, biallelic datasets.
- Raises
ValueError – If an unknown estimator is specified.
ValueError – If ploidy is not specified and cannot be inferred.
ValueError – If ancestral_frequency is required but not specified.
ValueError – If ancestral_frequency is the incorrect shape.
Examples
>>> import sgkit as sg >>> ds = sg.simulate_genotype_call_dataset(n_variant=6, n_sample=3, seed=0) >>> ds = sg.count_call_alleles(ds) >>> # use reference allele count as dosage >>> ds["call_dosage"] = ds.call_allele_count[:,:,0] >>> ds.call_dosage.values array([[2, 1, 1], [1, 1, 1], [2, 1, 0], [2, 1, 1], [1, 0, 0], [1, 1, 2]], dtype=uint8) >>> # use sample population frequency as ancestral frequency >>> ds["sample_frequency"] = ds.call_dosage.mean(dim="samples") / ds.dims["ploidy"] >>> ds = sg.genomic_relationship(ds, ancestral_frequency="sample_frequency") >>> ds.stat_genomic_relationship.values array([[ 0.93617021, -0.21276596, -0.72340426], [-0.21276596, 0.17021277, 0.04255319], [-0.72340426, 0.04255319, 0.68085106]])
References
[1] - P. M. VanRaden 2008. “Efficient Methods to Compute Genomic Predictions.” Journal of Dairy Science 91 (11): 4414-4423.
[2] - B. H. Ashraf, S. Byrne, D. Fé, A. Czaban, T. Asp, M. G. Pedersen, I. Lenk, N. Roulund, T. Didion, C. S. Jensen, J. Jensen and L. L. Janss 2016. “Estimating genomic heritabilities at the level of family-pool samples of perennial ryegrass using genotyping-by-sequencing” Theoretical Applied Genetics 129: 45-52
[3] - T. Bilton 2020. “Developing statistical methods for genetic analysis of genotypes from genotyping-by-sequencing data” PhD thesis, University of Otago.