sgkit.divergence#

sgkit.divergence(ds, *, cohort_allele_count='cohort_allele_count', merge=True)#

Compute divergence between pairs of cohorts.

The entry at (i, j) is the divergence between for cohort i and cohort j, except for the case where i and j are the same, in which case the entry is the diversity for cohort i.

By default, values of this statistic are calculated per variant. To compute values in windows, call window_by_position() or window_by_variant() before calling this function.

Parameters:
ds Dataset

Genotype call dataset.

cohort_allele_count Hashable (default: 'cohort_allele_count')

Cohort allele count variable to use or calculate. Defined by sgkit.variables.cohort_allele_count_spec. If the variable is not present in ds, it will be computed using count_cohort_alleles().

merge bool (default: True)

If True (the default), merge the input dataset and the computed output variables into a single dataset, otherwise return only the computed output variables. See Dataset merge behavior for more details.

Return type:

Dataset

Returns:

: A dataset containing the divergence value between pairs of cohorts, as defined by sgkit.variables.stat_divergence_spec. Shape (variants, cohorts, cohorts), or (windows, cohorts, cohorts) if windowing information is available.

Warning

This method does not currently support datasets that are chunked along the samples dimension.

Examples

>>> import numpy as np
>>> import sgkit as sg
>>> import xarray as xr
>>> ds = sg.simulate_genotype_call_dataset(n_variant=5, n_sample=4)
>>> # Divide samples into two cohorts
>>> sample_cohort = np.repeat([0, 1], ds.sizes["samples"] // 2)
>>> ds["sample_cohort"] = xr.DataArray(sample_cohort, dims="samples")
>>> sg.divergence(ds)["stat_divergence"].values 
array([[[0.5       , 0.5       ],
        [0.5       , 0.66666667]],

    [[0.66666667, 0.5       ],
        [0.5       , 0.5       ]],

    [[0.66666667, 0.5       ],
        [0.5       , 0.66666667]],

    [[0.5       , 0.375     ],
        [0.375     , 0.5       ]],

    [[0.5       , 0.625     ],
        [0.625     , 0.5       ]]])
>>> # Divide into windows of size three (variants)
>>> ds = sg.window_by_variant(ds, size=3)
>>> sg.divergence(ds)["stat_divergence"].values 
array([[[1.83333333, 1.5       ],
        [1.5       , 1.83333333]],

    [[1.        , 1.        ],
        [1.        , 1.        ]]])