sgkit.cohort_allele_frequencies#

sgkit.cohort_allele_frequencies(ds, *, cohort_allele_count='cohort_allele_count', merge=True)#

Compute allele frequencies for each cohort.

Parameters:
ds Dataset

Dataset containing genotype calls.

cohort_allele_count Hashable (default: 'cohort_allele_count')

Input variable name holding cohort_allele_count as defined by sgkit.variables.cohort_allele_count_spec. If the variable is not present in ds, it will be computed using count_cohort_alleles().

merge bool (default: True)

If True (the default), merge the input dataset and the computed output variables into a single dataset, otherwise return only the computed output variables. See Dataset merge behavior for more details.

Return type:

Dataset

Returns:

: A dataset containing sgkit.variables.cohort_allele_frequency_spec of allele frequencies with shape (variants, cohorts, alleles) and values corresponding to the frequency of non-missing occurrences of each allele.

Examples

>>> import numpy as np
>>> import sgkit as sg
>>> import xarray as xr
>>> ds = sg.simulate_genotype_call_dataset(n_variant=5, n_sample=4)
>>> # Divide samples into two cohorts
>>> ds["sample_cohort"] = xr.DataArray(np.repeat([0, 1], ds.sizes["samples"] // 2), dims="samples")
>>> sg.display_genotypes(ds) 
samples    S0   S1   S2   S3
variants
0         0/0  1/0  1/0  0/1
1         1/0  0/1  0/0  1/0
2         1/1  0/0  1/0  0/1
3         1/0  1/1  1/1  1/0
4         1/0  0/0  1/0  1/1
>>> sg.cohort_allele_frequencies(ds)["cohort_allele_frequency"].values 
array([[[0.75, 0.25],
        [0.5 , 0.5 ]],

        [[0.5 , 0.5 ],
        [0.75, 0.25]],

        [[0.5 , 0.5 ],
        [0.5 , 0.5 ]],

        [[0.25, 0.75],
        [0.25, 0.75]],

        [[0.75, 0.25],
        [0.25, 0.75]]])