sgkit.ld_matrix#

sgkit.ld_matrix(ds, *, dosage='call_dosage', threshold=None, variant_score=None)#

Compute a sparse linkage disequilibrium (LD) matrix.

This method computes the Rogers Huff R2 value for each pair of variants in a window, and returns those that exceed the provided threshold, as a sparse matrix dataframe.

Parameters:
ds Dataset

Dataset containing genotype dosages. Must already be windowed with window_by_position() or window_by_variant().

dosage Hashable (default: 'call_dosage')

Name of genetic dosage variable. Defined by sgkit.variables.call_dosage_spec.

threshold float | NoneOptional[float] (default: None)

R2 threshold below which no variant pairs will be returned. This should almost always be something at least slightly above 0 to avoid the large density very near zero LD present in most datasets.

variant_score Hashable | NoneOptional[Hashable] (default: None)

Optional name of variable to use to prioritize variant selection (e.g. minor allele frequency). Defaults to None. Defined by sgkit.variables.variant_score_spec.

Return type:

DataFrame

Returns:

: Upper triangle (including diagonal) of LD matrix as COO in dataframe. Fields:

  • i: Row (variant) index 1

  • j: Row (variant) index 2

  • value: R2 value

  • cmp: If variant_score is provided, this is 1, 0, or -1 indicating whether or not i > j (1), i < j (-1), or i == j (0)

Raises:

ValueError – If the dataset is not windowed.