sgkit.window_by_position#
- sgkit.window_by_position(ds, *, size, step=None, offset=0, variant_contig='variant_contig', variant_position='variant_position', window_start_position=None, merge=True)#
Add window information to a dataset, measured by distance along the genome.
Windows are defined over the
variants
dimension, and are used by some downstream functions to calculate statistics for each window. Windows never span contigs.- Parameters:
- ds
Dataset
Genotype call dataset.
- size
int
The window size, measured by number of base pairs.
- step
int
|None
Optional
[int
] (default:None
) The distance, measured by number of base pairs, between start positions of windows. May only be set if
window_start_position
is None. Defaults tosize
.- offset
int
(default:0
) The window offset, measured by number of base pairs. Defaults to no offset. For centered windows, use a negative offset that is half the window size.
- variant_contig
Hashable
(default:'variant_contig'
) Name of variable containing variant contig indexes. Defined by
sgkit.variables.variant_contig_spec
.- variant_position
Hashable
(default:'variant_position'
) Name of variable containing variant positions. Must be monotonically increasing within a contig. Defined by
sgkit.variables.variant_position_spec
.- window_start_position
Hashable
|None
Optional
[Hashable
] (default:None
) Optional name of variable to use to define window starts, defined by position in the genome. Defaults to None, which means start positions are at multiples of
step
, and shifted byoffset
.- merge
bool
(default:True
) If True (the default), merge the input dataset and the computed output variables into a single dataset, otherwise return only the computed output variables. See Dataset merge behavior for more details.
- ds
- Return type:
- Returns:
: A dataset containing the following variables:
sgkit.variables.window_contig_spec
(windows): The index values of window contigs.sgkit.variables.window_start_spec
(windows): The index values of window start positions.sgkit.variables.window_stop_spec
(windows): The index values of window stop positions.
- Raises:
ValueError – If both of
step
andwindow_start_position
have been specified.
Examples
>>> import sgkit as sg >>> ds = sg.simulate_genotype_call_dataset(n_variant=10, n_sample=2, n_contig=2) >>> ds["variant_position"] = (["variants"], np.array([1, 4, 6, 8, 12, 1, 21, 25, 40, 55])) >>> ds.variant_contig.values array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1]) >>> ds.variant_position.values array([ 1, 4, 6, 8, 12, 1, 21, 25, 40, 55])
>>> # Contiguous equally-spaced windows, each 10 base pairs in size >>> # and using offset to start windows at 1 >>> ds_w1 = sg.window_by_position(ds, size=10, offset=1, merge=False) >>> ds_w1 <xarray.Dataset> Size: 192B Dimensions: (windows: 8) Dimensions without coordinates: windows Data variables: window_contig (windows) int64 64B 0 0 1 1 1 1 1 1 window_start (windows) int64 64B 0 4 5 6 6 8 9 9 window_stop (windows) int64 64B 4 5 6 6 8 9 9 10 >>> [ds.variant_position.values[i:j] for i, j in zip(ds_w1.window_start.values, ds_w1.window_stop.values) if i != j] [array([1, 4, 6, 8]), array([12]), array([1]), array([21, 25]), array([40]), array([55])]
>>> # Windows centered around positions defined by a variable (variant_position), >>> # each 10 base pairs in size. Also known as "locus windows". >>> ds_w2 = sg.window_by_position(ds, size=10, offset=-5, window_start_position="variant_position", merge=False) >>> ds_w2 <xarray.Dataset> Size: 240B Dimensions: (windows: 10) Dimensions without coordinates: windows Data variables: window_contig (windows) int64 80B 0 0 0 0 0 1 1 1 1 1 window_start (windows) int64 80B 0 0 0 1 3 5 6 6 8 9 window_stop (windows) int64 80B 2 4 4 5 5 6 8 8 9 10 >>> [ds.variant_position.values[i:j] for i, j in zip(ds_w2.window_start.values, ds_w2.window_stop.values)] [array([1, 4]), array([1, 4, 6, 8]), array([1, 4, 6, 8]), array([ 4, 6, 8, 12]), array([ 8, 12]), array([1]), array([21, 25]), array([21, 25]), array([40]), array([55])]