sgkit.Tajimas_D

sgkit.Tajimas_D(ds, *, variant_allele_count='variant_allele_count', stat_diversity='stat_diversity', merge=True)

Compute Tajimas’ D for a genotype call dataset.

By default, values of this statistic are calculated per variant. To compute values in windows, call window() before calling this function.

Parameters
ds : DatasetDataset

Genotype call dataset.

variant_allele_count : HashableHashable (default: 'variant_allele_count')

Variant allele count variable to use or calculate. Defined by sgkit.variables.variant_allele_count_spec. If the variable is not present in ds, it will be computed using count_variant_alleles().

stat_diversity : HashableHashable (default: 'stat_diversity')

Diversity variable to use or calculate. Defined by sgkit.variables.stat_diversity_spec. If the variable is not present in ds, it will be computed using diversity().

merge : boolbool (default: True)

If True (the default), merge the input dataset and the computed output variables into a single dataset, otherwise return only the computed output variables. See Dataset merge behavior for more details.

Return type

DatasetDataset

Returns

A dataset containing the Tajimas’ D value, as defined by sgkit.variables.stat_Tajimas_D_spec. Shape (variants, cohorts), or (windows, cohorts) if windowing information is available.

Warning

This method does not currently support datasets that are chunked along the samples dimension.

Examples

>>> import numpy as np
>>> import sgkit as sg
>>> import xarray as xr
>>> ds = sg.simulate_genotype_call_dataset(n_variant=5, n_sample=4)
>>> # Divide samples into two cohorts
>>> sample_cohort = np.repeat([0, 1], ds.dims["samples"] // 2)
>>> ds["sample_cohort"] = xr.DataArray(sample_cohort, dims="samples")
>>> sg.Tajimas_D(ds)["stat_Tajimas_D"].values 
array([[-3.35891429, -2.96698697],
    [-2.96698697, -3.35891429],
    [-2.96698697, -2.96698697],
    [-3.35891429, -3.35891429],
    [-3.35891429, -3.35891429]])
>>> # Divide into windows of size three (variants)
>>> ds = sg.window(ds, size=3)
>>> sg.Tajimas_D(ds)["stat_Tajimas_D"].values 
array([[-0.22349574, -0.22349574],
    [-2.18313233, -2.18313233]])