sgkit.Fst#

sgkit.Fst(ds, *, estimator=None, stat_divergence='stat_divergence', merge=True)#

Compute Fst between pairs of cohorts.

By default, values of this statistic are calculated per variant. To compute values in windows, call window_by_position() or window_by_variant() before calling this function.

Parameters:
ds Dataset

Genotype call dataset.

estimator str | NoneOptional[str] (default: None)

Determines the formula to use for computing Fst. If None (the default), or Hudson, Fst is calculated using the method of Hudson (1992) elaborated by Bhatia et al. (2013), (the same estimator as scikit-allel). Other supported estimators include Nei (1986), (the same estimator as tskit).

stat_divergence Hashable (default: 'stat_divergence')

Divergence variable to use or calculate. Defined by sgkit.variables.stat_divergence_spec. If the variable is not present in ds, it will be computed using divergence().

merge bool (default: True)

If True (the default), merge the input dataset and the computed output variables into a single dataset, otherwise return only the computed output variables. See Dataset merge behavior for more details.

Return type:

Dataset

Returns:

: A dataset containing the Fst value between pairs of cohorts, as defined by sgkit.variables.stat_Fst_spec. Shape (variants, cohorts, cohorts), or (windows, cohorts, cohorts) if windowing information is available.

Warning

This method does not currently support datasets that are chunked along the samples dimension.

Examples

>>> import numpy as np
>>> import sgkit as sg
>>> import xarray as xr
>>> ds = sg.simulate_genotype_call_dataset(n_variant=5, n_sample=4)
>>> # Divide samples into two cohorts
>>> sample_cohort = np.repeat([0, 1], ds.sizes["samples"] // 2)
>>> ds["sample_cohort"] = xr.DataArray(sample_cohort, dims="samples")
>>> sg.Fst(ds)["stat_Fst"].values 
array([[[        nan, -0.16666667],
        [-0.16666667,         nan]],

    [[        nan, -0.16666667],
        [-0.16666667,         nan]],

    [[        nan, -0.33333333],
        [-0.33333333,         nan]],

    [[        nan, -0.33333333],
        [-0.33333333,         nan]],

    [[        nan,  0.2       ],
        [ 0.2       ,         nan]]])
>>> # Divide into windows of size three (variants)
>>> ds = sg.window_by_variant(ds, size=3)
>>> sg.Fst(ds)["stat_Fst"].values 
array([[[        nan, -0.22222222],
        [-0.22222222,         nan]],

    [[        nan,  0.        ],
        [ 0.        ,         nan]]])