sgkit.window_by_position#

sgkit.window_by_position(ds, *, size, step=None, offset=0, variant_contig='variant_contig', variant_position='variant_position', window_start_position=None, merge=True)#

Add window information to a dataset, measured by distance along the genome.

Windows are defined over the variants dimension, and are used by some downstream functions to calculate statistics for each window. Windows never span contigs.

Parameters
ds : Dataset

Genotype call dataset.

size : int

The window size, measured by number of base pairs.

step : int | NoneOptional[int] (default: None)

The distance, measured by number of base pairs, between start positions of windows. May only be set if window_start_position is None. Defaults to size.

offset : int (default: 0)

The window offset, measured by number of base pairs. Defaults to no offset. For centered windows, use a negative offset that is half the window size.

variant_contig : Hashable (default: 'variant_contig')

Name of variable containing variant contig indexes. Defined by sgkit.variables.variant_contig_spec.

variant_position : Hashable (default: 'variant_position')

Name of variable containing variant positions. Must be monotonically increasing within a contig. Defined by sgkit.variables.variant_position_spec.

window_start_position : Hashable | NoneOptional[Hashable] (default: None)

Optional name of variable to use to define window starts, defined by position in the genome. Defaults to None, which means start positions are at multiples of step, and shifted by offset.

merge : bool (default: True)

If True (the default), merge the input dataset and the computed output variables into a single dataset, otherwise return only the computed output variables. See Dataset merge behavior for more details.

Return type

Dataset

Returns

A dataset containing the following variables:

Raises

ValueError – If both of step and window_start_position have been specified.

Examples

>>> import sgkit as sg
>>> ds = sg.simulate_genotype_call_dataset(n_variant=10, n_sample=2, n_contig=2)
>>> ds["variant_position"] = (["variants"], np.array([1, 4, 6, 8, 12, 1, 21, 25, 40, 55]))
>>> ds.variant_contig.values
array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])
>>> ds.variant_position.values
array([ 1,  4,  6,  8, 12,  1, 21, 25, 40, 55])
>>> # Contiguous equally-spaced windows, each 10 base pairs in size
>>> # and using offset to start windows at 1
>>> ds_w1 = sg.window_by_position(ds, size=10, offset=1, merge=False)
>>> ds_w1
<xarray.Dataset>
Dimensions:        (windows: 8)
Dimensions without coordinates: windows
Data variables:
    window_contig  (windows) int64 0 0 1 1 1 1 1 1
    window_start   (windows) int64 0 4 5 6 6 8 9 9
    window_stop    (windows) int64 4 5 6 6 8 9 9 10
>>> [ds.variant_position.values[i:j] for i, j in zip(ds_w1.window_start.values, ds_w1.window_stop.values) if i != j] 
[array([1, 4, 6, 8]),
 array([12]),
 array([1]),
 array([21, 25]),
 array([40]),
 array([55])]
>>> # Windows centered around positions defined by a variable (variant_position),
>>> # each 10 base pairs in size. Also known as "locus windows".
>>> ds_w2 = sg.window_by_position(ds, size=10, offset=-5, window_start_position="variant_position", merge=False)
>>> ds_w2
<xarray.Dataset>
Dimensions:        (windows: 10)
Dimensions without coordinates: windows
Data variables:
    window_contig  (windows) int64 0 0 0 0 0 1 1 1 1 1
    window_start   (windows) int64 0 0 0 1 3 5 6 6 8 9
    window_stop    (windows) int64 2 4 4 5 5 6 8 8 9 10
>>> [ds.variant_position.values[i:j] for i, j in zip(ds_w2.window_start.values, ds_w2.window_stop.values)] 
[array([1, 4]),
 array([1, 4, 6, 8]),
 array([1, 4, 6, 8]),
 array([ 4,  6,  8, 12]),
 array([ 8, 12]),
 array([1]),
 array([21, 25]),
 array([21, 25]),
 array([40]),
 array([55])]