sgkit.filter_partial_calls#

sgkit.filter_partial_calls(ds, *, call_genotype='call_genotype', merge=True)#

Replace partial genotype calls with missing values.

Parameters:
ds Dataset

Dataset containing genotype calls.

call_genotype Hashable (default: 'call_genotype')

Input variable name holding call_genotype as defined by sgkit.variables.call_genotype_spec

merge bool (default: True)

If True (the default), merge the input dataset and the computed output variables into a single dataset, otherwise return only the computed output variables. See Dataset merge behavior for more details.

Return type:

Dataset

Returns:

: Dataset containing sgkit.variables.call_genotype_complete_spec and sgkit.variables.call_genotype_complete_mask_spec in which partial genotype calls are replaced with completely missing genotype calls.

Examples

>>> import sgkit as sg
>>> from sgkit.testing import simulate_genotype_call_dataset
>>> ds = simulate_genotype_call_dataset(n_variant=4, n_sample=2, seed=1, missing_pct=0.3)
>>> sg.display_genotypes(ds) 
samples    S0   S1
variants
0         ./0  ./.
1         ./0  1/1
2         0/1  ./0
3         ./0  0/0
>>> ds2 = filter_partial_calls(ds)
>>> ds2['call_genotype'] = ds2['call_genotype_complete']
>>> ds2['call_genotype_mask'] = ds2['call_genotype_complete_mask']
>>> sg.display_genotypes(ds2) 
samples    S0   S1
variants
0         ./.  ./.
1         ./.  1/1
2         0/1  ./.
3         ./.  0/0

Notes

The returned dataset will still contain the initial call_genotype and call_genotype_mask variables. Many sgkit functions will default to using call_genotype and/or call_genotype_mask, hence it is necessary to overwrite these variables (see the example) or explicitly pass the new variables as function arguments in order to remove partial calls from futher analysis.