sgkit.identity_by_state#
- sgkit.identity_by_state(ds, *, call_genotype='call_genotype', call_allele_frequency='call_allele_frequency', method='frequencies', skipna=True, merge=True)#
Compute identity by state (IBS) probabilities between all pairs of samples.
The IBS probability between a pair of individuals is the probability that a randomly drawn allele from the first individual is identical in state with a randomly drawn allele from the second individual at a single random locus.
- Parameters:
- ds
Dataset
Dataset containing call genotype alleles.
- call_genotype
Hashable
(default:'call_genotype'
) Input variable name holding call_genotype as defined by
sgkit.variables.call_allele_frequency_spec
. This variable is only required for the “matching” method.- call_allele_frequency
Hashable
(default:'call_allele_frequency'
) Input variable name holding call_allele_frequency as defined by
sgkit.variables.call_allele_frequency_spec
. This variable is only required for the “frequencies” method. If the variable is not present inds
, it will be computed usingcall_allele_frequencies()
.- method {‘frequencies’, ‘matching’}
Literal
['frequencies'
,'matching'
] (default:'frequencies'
) The method used for IBS estimation. Defaults to “frequencies” which calculates IBS probabilities by matrix multiplication of call allele frequencies which is more efficient when the alleles dimension is small. The “matching” method calculates IBS probabilities directly from the call genotypes and is more efficient when the alleles dimension is large.
- skipna
bool
(default:True
) If True (the default), missing (nan) allele frequencies will be skipped.
- merge
bool
(default:True
) If True (the default), merge the input dataset and the computed output variables into a single dataset, otherwise return only the computed output variables. See Dataset merge behavior for more details.
- ds
- Return type:
- Returns:
: A dataset containing
sgkit.variables.stat_identity_by_state_spec
which is a matrix of pairwise IBS probabilities among all samples. The dimensions are namedsamples_0
andsamples_1
.
Examples
>>> import sgkit as sg >>> ds = sg.simulate_genotype_call_dataset(n_variant=2, n_sample=3, seed=2) >>> sg.display_genotypes(ds) samples S0 S1 S2 variants 0 0/0 1/1 1/0 1 1/1 1/1 1/0 >>> sg.identity_by_state(ds)["stat_identity_by_state"].values array([[1. , 0.5, 0.5], [0.5, 1. , 0.5], [0.5, 0.5, 0.5]])