sgkit.window_by_position#
- sgkit.window_by_position(ds, *, size, step=None, offset=0, variant_contig='variant_contig', variant_position='variant_position', window_start_position=None, merge=True)#
 Add window information to a dataset, measured by distance along the genome.
Windows are defined over the
variantsdimension, and are used by some downstream functions to calculate statistics for each window. Windows never span contigs.- Parameters:
 - ds 
Dataset Genotype call dataset.
- size 
int The window size, measured by number of base pairs.
- step 
int|NoneOptional[int] (default:None) The distance, measured by number of base pairs, between start positions of windows. May only be set if
window_start_positionis None. Defaults tosize.- offset 
int(default:0) The window offset, measured by number of base pairs. Defaults to no offset. For centered windows, use a negative offset that is half the window size.
- variant_contig 
Hashable(default:'variant_contig') Name of variable containing variant contig indexes. Defined by
sgkit.variables.variant_contig_spec.- variant_position 
Hashable(default:'variant_position') Name of variable containing variant positions. Must be monotonically increasing within a contig. Defined by
sgkit.variables.variant_position_spec.- window_start_position 
Hashable|NoneOptional[Hashable] (default:None) Optional name of variable to use to define window starts, defined by position in the genome. Defaults to None, which means start positions are at multiples of
step, and shifted byoffset.- merge 
bool(default:True) If True (the default), merge the input dataset and the computed output variables into a single dataset, otherwise return only the computed output variables. See Dataset merge behavior for more details.
- ds 
 - Return type:
 - Returns:
 : A dataset containing the following variables:
sgkit.variables.window_contig_spec(windows): The index values of window contigs.sgkit.variables.window_start_spec(windows): The index values of window start positions.sgkit.variables.window_stop_spec(windows): The index values of window stop positions.
- Raises:
 ValueError – If both of
stepandwindow_start_positionhave been specified.
Examples
>>> import sgkit as sg >>> ds = sg.simulate_genotype_call_dataset(n_variant=10, n_sample=2, n_contig=2) >>> ds["variant_position"] = (["variants"], np.array([1, 4, 6, 8, 12, 1, 21, 25, 40, 55])) >>> ds.variant_contig.values array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1]) >>> ds.variant_position.values array([ 1, 4, 6, 8, 12, 1, 21, 25, 40, 55])
>>> # Contiguous equally-spaced windows, each 10 base pairs in size >>> # and using offset to start windows at 1 >>> ds_w1 = sg.window_by_position(ds, size=10, offset=1, merge=False) >>> ds_w1 <xarray.Dataset> Size: 192B Dimensions: (windows: 8) Dimensions without coordinates: windows Data variables: window_contig (windows) int64 64B 0 0 1 1 1 1 1 1 window_start (windows) int64 64B 0 4 5 6 6 8 9 9 window_stop (windows) int64 64B 4 5 6 6 8 9 9 10 >>> [ds.variant_position.values[i:j] for i, j in zip(ds_w1.window_start.values, ds_w1.window_stop.values) if i != j] [array([1, 4, 6, 8]), array([12]), array([1]), array([21, 25]), array([40]), array([55])]
>>> # Windows centered around positions defined by a variable (variant_position), >>> # each 10 base pairs in size. Also known as "locus windows". >>> ds_w2 = sg.window_by_position(ds, size=10, offset=-5, window_start_position="variant_position", merge=False) >>> ds_w2 <xarray.Dataset> Size: 240B Dimensions: (windows: 10) Dimensions without coordinates: windows Data variables: window_contig (windows) int64 80B 0 0 0 0 0 1 1 1 1 1 window_start (windows) int64 80B 0 0 0 1 3 5 6 6 8 9 window_stop (windows) int64 80B 2 4 4 5 5 6 8 8 9 10 >>> [ds.variant_position.values[i:j] for i, j in zip(ds_w2.window_start.values, ds_w2.window_stop.values)] [array([1, 4]), array([1, 4, 6, 8]), array([1, 4, 6, 8]), array([ 4, 6, 8, 12]), array([ 8, 12]), array([1]), array([21, 25]), array([21, 25]), array([40]), array([55])]