sgkit.Tajimas_D#

sgkit.Tajimas_D(ds, *, variant_allele_count='variant_allele_count', stat_diversity='stat_diversity', merge=True)#

Compute Tajimas’ D for a genotype call dataset.

By default, values of this statistic are calculated per variant. To compute values in windows, call window_by_position() or window_by_variant() before calling this function.

Parameters:
ds Dataset

Genotype call dataset.

variant_allele_count Hashable (default: 'variant_allele_count')

Variant allele count variable to use or calculate. Defined by sgkit.variables.variant_allele_count_spec. If the variable is not present in ds, it will be computed using count_variant_alleles().

stat_diversity Hashable (default: 'stat_diversity')

Diversity variable to use or calculate. Defined by sgkit.variables.stat_diversity_spec. If the variable is not present in ds, it will be computed using diversity().

merge bool (default: True)

If True (the default), merge the input dataset and the computed output variables into a single dataset, otherwise return only the computed output variables. See Dataset merge behavior for more details.

Return type:

Dataset

Returns:

: A dataset containing the Tajimas’ D value, as defined by sgkit.variables.stat_Tajimas_D_spec. Shape (variants, cohorts), or (windows, cohorts) if windowing information is available.

Warning

This method does not currently support datasets that are chunked along the samples dimension.

Examples

>>> import numpy as np
>>> import sgkit as sg
>>> import xarray as xr
>>> ds = sg.simulate_genotype_call_dataset(n_variant=5, n_sample=4)
>>> # Divide samples into two cohorts
>>> sample_cohort = np.repeat([0, 1], ds.sizes["samples"] // 2)
>>> ds["sample_cohort"] = xr.DataArray(sample_cohort, dims="samples")
>>> sg.Tajimas_D(ds)["stat_Tajimas_D"].values 
array([[0.88883234, 2.18459998],
       [2.18459998, 0.88883234],
       [2.18459998, 2.18459998],
       [0.88883234, 0.88883234],
       [0.88883234, 0.88883234]])
>>> # Divide into windows of size three (variants)
>>> ds = sg.window_by_variant(ds, size=3)
>>> sg.Tajimas_D(ds)["stat_Tajimas_D"].values 
array([[2.40517586, 2.40517586],
       [1.10393559, 1.10393559]])