sgkit.observed_heterozygosity#

sgkit.observed_heterozygosity(ds, *, call_heterozygosity='call_heterozygosity', sample_cohort='sample_cohort', merge=True)#

Compute per cohort observed heterozygosity.

The observed heterozygosity of a cohort is the mean of individual heterozygosity values among all samples of that cohort as described in individual_heterozygosity(). Calls with a nan value for individual heterozygosity are ignored when calculating the cohort mean.

By default, values of this statistic are calculated per variant. To compute values in windows, call window_by_position() or window_by_variant() before calling this function.

Parameters:
ds Dataset

Dataset containing genotype calls.

call_heterozygosity Hashable (default: 'call_heterozygosity')

Input variable name holding call_heterozygosity as defined by sgkit.variables.call_heterozygosity_spec. If the variable is not present in ds, it will be computed using individual_heterozygosity().

sample_cohort Hashable (default: 'sample_cohort')

Input variable name holding sample_cohort as defined by sgkit.variables.sample_cohort_spec.

merge bool (default: True)

If True (the default), merge the input dataset and the computed output variables into a single dataset, otherwise return only the computed output variables. See Dataset merge behavior for more details.

Return type:

Dataset

Returns:

: A dataset containing sgkit.variables.stat_observed_heterozygosity_spec of per cohort observed heterozygosity with shape (variants, cohorts) containing values within the inteval [0, 1] or nan.

Examples

>>> import numpy as np
>>> import sgkit as sg
>>> import xarray as xr
>>> ds = sg.simulate_genotype_call_dataset(n_variant=5, n_sample=4)
>>> # Divide samples into two cohorts
>>> sample_cohort = np.repeat([0, 1], ds.sizes["samples"] // 2)
>>> ds["sample_cohort"] = xr.DataArray(sample_cohort, dims="samples")
>>> sg.observed_heterozygosity(ds)["stat_observed_heterozygosity"].values 
array([[0.5, 1. ],
    [1. , 0.5],
    [0. , 1. ],
    [0.5, 0.5],
    [0.5, 0.5]])
>>> # Divide into windows of size three (variants)
>>> ds = sg.window_by_variant(ds, size=3)
>>> sg.observed_heterozygosity(ds)["stat_observed_heterozygosity"].values 
array([[1.5, 2.5],
    [1. , 1. ]])