sgkit.identity_by_state#

sgkit.identity_by_state(ds, *, call_genotype='call_genotype', call_allele_frequency='call_allele_frequency', method='frequencies', skipna=True, merge=True)#

Compute identity by state (IBS) probabilities between all pairs of samples.

The IBS probability between a pair of individuals is the probability that a randomly drawn allele from the first individual is identical in state with a randomly drawn allele from the second individual at a single random locus.

Parameters:
ds Dataset

Dataset containing call genotype alleles.

call_genotype Hashable (default: 'call_genotype')

Input variable name holding call_genotype as defined by sgkit.variables.call_allele_frequency_spec. This variable is only required for the “matching” method.

call_allele_frequency Hashable (default: 'call_allele_frequency')

Input variable name holding call_allele_frequency as defined by sgkit.variables.call_allele_frequency_spec. This variable is only required for the “frequencies” method. If the variable is not present in ds, it will be computed using call_allele_frequencies().

method {‘frequencies’, ‘matching’}Literal['frequencies', 'matching'] (default: 'frequencies')

The method used for IBS estimation. Defaults to “frequencies” which calculates IBS probabilities by matrix multiplication of call allele frequencies which is more efficient when the alleles dimension is small. The “matching” method calculates IBS probabilities directly from the call genotypes and is more efficient when the alleles dimension is large.

skipna bool (default: True)

If True (the default), missing (nan) allele frequencies will be skipped.

merge bool (default: True)

If True (the default), merge the input dataset and the computed output variables into a single dataset, otherwise return only the computed output variables. See Dataset merge behavior for more details.

Return type:

Dataset

Returns:

: A dataset containing sgkit.variables.stat_identity_by_state_spec which is a matrix of pairwise IBS probabilities among all samples. The dimensions are named samples_0 and samples_1.

Examples

>>> import sgkit as sg
>>> ds = sg.simulate_genotype_call_dataset(n_variant=2, n_sample=3, seed=2)
>>> sg.display_genotypes(ds) 
samples    S0   S1   S2
variants
0         0/0  1/1  1/0
1         1/1  1/1  1/0
>>> sg.identity_by_state(ds)["stat_identity_by_state"].values 
array([[1. , 0.5, 0.5],
       [0.5, 1. , 0.5],
       [0.5, 0.5, 0.5]])