sgkit.ld_prune#

sgkit.ld_prune(ds, *, dosage='call_dosage', threshold=0.2, variant_score=None)#

Prune variants in linkage disequilibrium (LD).

This method uses a sparse LD matrix to find a maximally independent set (MIS) of variants, and returns a dataset containing only those variants.

No information about which variants are pruned is returned by this method. Consider using sgkit.ld_matrix() and sgkit.maximal_independent_set() to get more insight into the variants that are pruned.

Note: This result is not a true MIS if variant_score is provided and comparisons are based on minor allele frequency or anything else that is not identical for all variants.

Parameters
ds Dataset

Dataset containing genotype dosages. Must already be windowed with window_by_position() or window_by_variant().

dosage Hashable (default: 'call_dosage')

Name of genetic dosage variable. Defined by sgkit.variables.call_dosage_spec.

threshold float (default: 0.2)

R2 threshold below which no variant pairs will be returned. This should almost always be something at least slightly above 0 to avoid the large density very near zero LD present in most datasets.

variant_score Hashable | NoneOptional[Hashable] (default: None)

Optional name of variable to use to prioritize variant selection (e.g. minor allele frequency). Defaults to None. Defined by sgkit.variables.variant_score_spec.

Return type

Dataset

Returns

A dataset where the variants in linkage disequilibrium have been removed.

Raises

ValueError – If the dataset is not windowed.

Examples

>>> import numpy as np
>>> import sgkit as sg
>>> ds = sg.simulate_genotype_call_dataset(n_variant=10, n_sample=4)
>>> ds.sizes["variants"]
10
>>> # Calculate dosage
>>> ds["call_dosage"] = ds["call_genotype"].sum(dim="ploidy")
>>> # Divide into windows of size five (variants)
>>> ds = sg.window_by_variant(ds, size=5)
>>> pruned_ds = sg.ld_prune(ds)
>>> pruned_ds.sizes["variants"]
6