sgkit.ld_prune#
- sgkit.ld_prune(ds, *, dosage='call_dosage', threshold=0.2, variant_score=None)#
Prune variants in linkage disequilibrium (LD).
This method uses a sparse LD matrix to find a maximally independent set (MIS) of variants, and returns a dataset containing only those variants.
No information about which variants are pruned is returned by this method. Consider using
sgkit.ld_matrix()
andsgkit.maximal_independent_set()
to get more insight into the variants that are pruned.Note: This result is not a true MIS if
variant_score
is provided and comparisons are based on minor allele frequency or anything else that is not identical for all variants.- Parameters:
- ds
Dataset
Dataset containing genotype dosages. Must already be windowed with
window_by_position()
orwindow_by_variant()
.- dosage
Hashable
(default:'call_dosage'
) Name of genetic dosage variable. Defined by
sgkit.variables.call_dosage_spec
.- threshold
float
(default:0.2
) R2 threshold below which no variant pairs will be returned. This should almost always be something at least slightly above 0 to avoid the large density very near zero LD present in most datasets.
- variant_score
Hashable
|None
Optional
[Hashable
] (default:None
) Optional name of variable to use to prioritize variant selection (e.g. minor allele frequency). Defaults to None. Defined by
sgkit.variables.variant_score_spec
.
- ds
- Return type:
- Returns:
: A dataset where the variants in linkage disequilibrium have been removed.
- Raises:
ValueError – If the dataset is not windowed.
Examples
>>> import numpy as np >>> import sgkit as sg >>> ds = sg.simulate_genotype_call_dataset(n_variant=10, n_sample=4) >>> ds.sizes["variants"] 10
>>> # Calculate dosage >>> ds["call_dosage"] = ds["call_genotype"].sum(dim="ploidy")
>>> # Divide into windows of size five (variants) >>> ds = sg.window_by_variant(ds, size=5)
>>> pruned_ds = sg.ld_prune(ds) >>> pruned_ds.sizes["variants"] 6