sgkit.ld_prune#

sgkit.ld_prune(ds, *, dosage='call_dosage', threshold=0.2, variant_score=None)#

Prune variants in linkage disequilibrium (LD).

This method uses a sparse LD matrix to find a maximally independent set (MIS) of variants, and returns a dataset containing only those variants.

No information about which variants are pruned is returned by this method. Consider using sgkit.ld_matrix() and sgkit.maximal_independent_set() to get more insight into the variants that are pruned.

Note: This result is not a true MIS if variant_score is provided and comparisons are based on minor allele frequency or anything else that is not identical for all variants.

Parameters:
ds Dataset

Dataset containing genotype dosages. Must already be windowed with window_by_position() or window_by_variant().

dosage Hashable (default: 'call_dosage')

Name of genetic dosage variable. Defined by sgkit.variables.call_dosage_spec.

threshold float (default: 0.2)

R2 threshold below which no variant pairs will be returned. This should almost always be something at least slightly above 0 to avoid the large density very near zero LD present in most datasets.

variant_score Hashable | NoneOptional[Hashable] (default: None)

Optional name of variable to use to prioritize variant selection (e.g. minor allele frequency). Defaults to None. Defined by sgkit.variables.variant_score_spec.

Return type:

Dataset

Returns:

: A dataset where the variants in linkage disequilibrium have been removed.

Raises:

ValueError – If the dataset is not windowed.

Examples

>>> import numpy as np
>>> import sgkit as sg
>>> ds = sg.simulate_genotype_call_dataset(n_variant=10, n_sample=4)
>>> ds.sizes["variants"]
10
>>> # Calculate dosage
>>> ds["call_dosage"] = ds["call_genotype"].sum(dim="ploidy")
>>> # Divide into windows of size five (variants)
>>> ds = sg.window_by_variant(ds, size=5)
>>> pruned_ds = sg.ld_prune(ds)
>>> pruned_ds.sizes["variants"]
6