sgkit.maximal_independent_set#

sgkit.maximal_independent_set(df)#

Compute a maximal independent set of variants.

This method is based on the PLINK algorithm that selects independent vertices from a graph implied by excessive LD between variants.

For an outline of this process, see this discussion.

Parameters:
df DataFrame

Dataframe containing a sparse matrix of R2 values, typically from sgkit.ld_matrix().

Warning

This algorithm will materialize the whole input dataframe (the LD matrix) in memory.

Raises:

ValueError – If i and j are not sorted ascending (and in that order)

Return type:

Dataset

Returns:

: A dataset containing the indexes of variants to drop, as defined by sgkit.variables.ld_prune_index_to_drop_spec.