sgkit.divergence#
- sgkit.divergence(ds, *, cohort_allele_count='cohort_allele_count', merge=True)#
Compute divergence between pairs of cohorts.
The entry at (i, j) is the divergence between for cohort i and cohort j, except for the case where i and j are the same, in which case the entry is the diversity for cohort i.
By default, values of this statistic are calculated per variant. To compute values in windows, call
window_by_position()
orwindow_by_variant()
before calling this function.- Parameters:
- ds
Dataset
Genotype call dataset.
- cohort_allele_count
Hashable
(default:'cohort_allele_count'
) Cohort allele count variable to use or calculate. Defined by
sgkit.variables.cohort_allele_count_spec
. If the variable is not present inds
, it will be computed usingcount_cohort_alleles()
.- merge
bool
(default:True
) If True (the default), merge the input dataset and the computed output variables into a single dataset, otherwise return only the computed output variables. See Dataset merge behavior for more details.
- ds
- Return type:
- Returns:
: A dataset containing the divergence value between pairs of cohorts, as defined by
sgkit.variables.stat_divergence_spec
. Shape (variants, cohorts, cohorts), or (windows, cohorts, cohorts) if windowing information is available.
Warning
This method does not currently support datasets that are chunked along the samples dimension.
Examples
>>> import numpy as np >>> import sgkit as sg >>> import xarray as xr >>> ds = sg.simulate_genotype_call_dataset(n_variant=5, n_sample=4)
>>> # Divide samples into two cohorts >>> sample_cohort = np.repeat([0, 1], ds.sizes["samples"] // 2) >>> ds["sample_cohort"] = xr.DataArray(sample_cohort, dims="samples")
>>> sg.divergence(ds)["stat_divergence"].values array([[[0.5 , 0.5 ], [0.5 , 0.66666667]], [[0.66666667, 0.5 ], [0.5 , 0.5 ]], [[0.66666667, 0.5 ], [0.5 , 0.66666667]], [[0.5 , 0.375 ], [0.375 , 0.5 ]], [[0.5 , 0.625 ], [0.625 , 0.5 ]]])
>>> # Divide into windows of size three (variants) >>> ds = sg.window_by_variant(ds, size=3) >>> sg.divergence(ds)["stat_divergence"].values array([[[1.83333333, 1.5 ], [1.5 , 1.83333333]], [[1. , 1. ], [1. , 1. ]]])