sgkit.diversity#

sgkit.diversity(ds, *, cohort_allele_count='cohort_allele_count', merge=True)#

Compute diversity from cohort allele counts.

By default, values of this statistic are calculated per variant. To compute values in windows, call window_by_position() or window_by_variant() before calling this function.

Parameters:
ds Dataset

Genotype call dataset.

cohort_allele_count Hashable (default: 'cohort_allele_count')

Cohort allele count variable to use or calculate. Defined by sgkit.variables.cohort_allele_count_spec. If the variable is not present in ds, it will be computed using count_cohort_alleles().

merge bool (default: True)

If True (the default), merge the input dataset and the computed output variables into a single dataset, otherwise return only the computed output variables. See Dataset merge behavior for more details.

Return type:

Dataset

Returns:

: A dataset containing the diversity values, as defined by sgkit.variables.stat_diversity_spec. Shape (variants, cohorts), or (windows, cohorts) if windowing information is available.

Warning

This method does not currently support datasets that are chunked along the samples dimension.

Examples

>>> import numpy as np
>>> import sgkit as sg
>>> import xarray as xr
>>> ds = sg.simulate_genotype_call_dataset(n_variant=5, n_sample=4)
>>> # Divide samples into two cohorts
>>> sample_cohort = np.repeat([0, 1], ds.sizes["samples"] // 2)
>>> ds["sample_cohort"] = xr.DataArray(sample_cohort, dims="samples")
>>> sg.diversity(ds)["stat_diversity"].values 
array([[0.5       , 0.66666667],
    [0.66666667, 0.5       ],
    [0.66666667, 0.66666667],
    [0.5       , 0.5       ],
    [0.5       , 0.5       ]])
>>> # Divide into windows of size three (variants)
>>> ds = sg.window_by_variant(ds, size=3)
>>> sg.diversity(ds)["stat_diversity"].values 
array([[1.83333333, 1.83333333],
    [1.        , 1.        ]])