How do I …#

Create a test dataset?#

Call simulate_genotype_call_dataset() to create a test xarray.Dataset:

In [1]: import sgkit as sg

In [2]: ds = sg.simulate_genotype_call_dataset(n_variant=100, n_sample=50, n_contig=23, missing_pct=.1)

Look at the dataset summary?#

Print using the xarray.Dataset repr:

In [3]: ds
Out[3]: 
<xarray.Dataset>
Dimensions:             (variants: 100, alleles: 2, samples: 50, ploidy: 2)
Dimensions without coordinates: variants, alleles, samples, ploidy
Data variables:
    variant_contig      (variants) int64 0 0 0 0 0 1 1 ... 21 21 21 22 22 22 22
    variant_position    (variants) int64 0 1 2 3 4 0 1 2 3 ... 3 0 1 2 3 0 1 2 3
    variant_allele      (variants, alleles) |S1 b'T' b'C' b'C' ... b'T' b'A'
    sample_id           (samples) <U3 'S0' 'S1' 'S2' 'S3' ... 'S47' 'S48' 'S49'
    call_genotype       (variants, samples, ploidy) int8 0 0 1 0 1 ... 1 0 1 0 0
    call_genotype_mask  (variants, samples, ploidy) bool False False ... False
Attributes:
    contigs:  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, ...
    source:   sgkit-0.4.1.dev20+g839eb9a9

Get the values for a variable in a dataset?#

Call xarray.Variable.values:

In [4]: ds.variant_contig.values
Out[4]: 
array([ 0,  0,  0,  0,  0,  1,  1,  1,  1,  1,  2,  2,  2,  2,  2,  3,  3,
        3,  3,  3,  4,  4,  4,  4,  4,  5,  5,  5,  5,  5,  6,  6,  6,  6,
        6,  7,  7,  7,  7,  7,  8,  8,  8,  8,  9,  9,  9,  9, 10, 10, 10,
       10, 11, 11, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13, 14, 14, 14, 14,
       15, 15, 15, 15, 16, 16, 16, 16, 17, 17, 17, 17, 18, 18, 18, 18, 19,
       19, 19, 19, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 22])

In [5]: ds["variant_contig"].values # equivalent alternative
Out[5]: 
array([ 0,  0,  0,  0,  0,  1,  1,  1,  1,  1,  2,  2,  2,  2,  2,  3,  3,
        3,  3,  3,  4,  4,  4,  4,  4,  5,  5,  5,  5,  5,  6,  6,  6,  6,
        6,  7,  7,  7,  7,  7,  8,  8,  8,  8,  9,  9,  9,  9, 10, 10, 10,
       10, 11, 11, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13, 14, 14, 14, 14,
       15, 15, 15, 15, 16, 16, 16, 16, 17, 17, 17, 17, 18, 18, 18, 18, 19,
       19, 19, 19, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 22])

Warning

Calling values materializes a variable’s data in memory, so is only suitable for small datasets.

Find the definition for a variable in a dataset?#

Use the comment attribute on the variable:

In [6]: ds.variant_contig.comment
Out[6]: 'Index corresponding to contig name for each variant. In some less common\nscenarios, this may also be equivalent to the contig names if the data\ngenerating process used contig names that were also integers.'

All the variables defined in sgkit are documented on the Variables API page.

Look at the genotypes?#

Call display_genotypes():

In [7]: sg.display_genotypes(ds, max_variants=10)
Out[7]: 
samples    S0   S1   S2   S3   S4  ...  S45  S46  S47  S48  S49
variants                           ...                         
0         0/0  1/0  1/0  0/1  1/0  ...  1/1  0/0  1/0  0/0  1/1
1         1/1  1/0  1/.  ./0  1/0  ...  1/1  0/1  1/0  1/1  1/0
2         0/1  1/1  1/1  1/0  1/1  ...  0/0  0/1  0/0  0/0  1/1
3         1/1  0/0  1/1  ./1  0/1  ...  0/1  1/0  0/1  0/.  0/.
4         1/0  0/1  0/1  0/1  0/0  ...  1/0  1/1  0/0  1/.  1/0
...       ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...
7         1/1  1/1  ./0  1/1  0/1  ...  0/0  0/.  1/0  1/0  0/1
8         1/.  1/0  ./0  0/1  1/0  ...  0/1  1/.  0/0  1/0  0/0
9         0/1  0/0  0/0  0/1  0/0  ...  0/1  0/1  1/0  1/0  0/0
10        1/1  0/0  ./1  1/0  0/0  ...  0/0  0/0  1/1  0/1  1/0
11        1/1  0/.  0/0  0/1  1/.  ...  1/0  0/.  0/1  0/1  0/0

[100 rows x 50 columns]

Subset the variables?#

Use Xarray’s pandas-like method for selecting variables:

In [8]: ds[["variant_contig", "variant_position", "variant_allele"]]
Out[8]: 
<xarray.Dataset>
Dimensions:           (variants: 100, alleles: 2)
Dimensions without coordinates: variants, alleles
Data variables:
    variant_contig    (variants) int64 0 0 0 0 0 1 1 1 ... 21 21 21 22 22 22 22
    variant_position  (variants) int64 0 1 2 3 4 0 1 2 3 4 ... 3 0 1 2 3 0 1 2 3
    variant_allele    (variants, alleles) |S1 b'T' b'C' b'C' ... b'T' b'T' b'A'
Attributes:
    contigs:  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, ...
    source:   sgkit-0.4.1.dev20+g839eb9a9

Alternatively, you can drop variables that you want to remove:

In [9]: ds.drop_vars(["variant_contig", "variant_position", "variant_allele"])
Out[9]: 
<xarray.Dataset>
Dimensions:             (samples: 50, variants: 100, ploidy: 2)
Dimensions without coordinates: samples, variants, ploidy
Data variables:
    sample_id           (samples) <U3 'S0' 'S1' 'S2' 'S3' ... 'S47' 'S48' 'S49'
    call_genotype       (variants, samples, ploidy) int8 0 0 1 0 1 ... 1 0 1 0 0
    call_genotype_mask  (variants, samples, ploidy) bool False False ... False
Attributes:
    contigs:  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, ...
    source:   sgkit-0.4.1.dev20+g839eb9a9

Subset to a genomic range?#

Set an index on the dataset, then call xarray.Dataset.sel():

In [10]: ds.set_index(variants=("variant_contig", "variant_position")).sel(variants=(0, slice(2, 4)))
Out[10]: 
<xarray.Dataset>
Dimensions:             (variants: 3, alleles: 2, samples: 50, ploidy: 2)
Coordinates:
  * variants            (variants) MultiIndex
  - variant_contig      (variants) int64 0 0 0
  - variant_position    (variants) int64 2 3 4
Dimensions without coordinates: alleles, samples, ploidy
Data variables:
    variant_allele      (variants, alleles) |S1 b'T' b'G' b'G' b'G' b'C' b'G'
    sample_id           (samples) <U3 'S0' 'S1' 'S2' 'S3' ... 'S47' 'S48' 'S49'
    call_genotype       (variants, samples, ploidy) int8 0 1 1 1 1 ... 1 -1 1 0
    call_genotype_mask  (variants, samples, ploidy) bool False False ... False
Attributes:
    contigs:  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, ...
    source:   sgkit-0.4.1.dev20+g839eb9a9

An API to make this easier is under discussion. Please add your requirements to https://github.com/pystatgen/sgkit/pull/658.

Get the list of samples?#

Get the values for the sample_id variable:

In [11]: ds.sample_id.values
Out[11]: 
array(['S0', 'S1', 'S2', 'S3', 'S4', 'S5', 'S6', 'S7', 'S8', 'S9', 'S10',
       'S11', 'S12', 'S13', 'S14', 'S15', 'S16', 'S17', 'S18', 'S19',
       'S20', 'S21', 'S22', 'S23', 'S24', 'S25', 'S26', 'S27', 'S28',
       'S29', 'S30', 'S31', 'S32', 'S33', 'S34', 'S35', 'S36', 'S37',
       'S38', 'S39', 'S40', 'S41', 'S42', 'S43', 'S44', 'S45', 'S46',
       'S47', 'S48', 'S49'], dtype='<U3')

Subset the samples?#

Call xarray.Dataset.sel() and xarray.DataArray.isin():

In [12]: ds.sel(samples=ds.sample_id.isin(["S30", "S32"]))
Out[12]: 
<xarray.Dataset>
Dimensions:             (variants: 100, alleles: 2, samples: 2, ploidy: 2)
Dimensions without coordinates: variants, alleles, samples, ploidy
Data variables:
    variant_contig      (variants) int64 0 0 0 0 0 1 1 ... 21 21 21 22 22 22 22
    variant_position    (variants) int64 0 1 2 3 4 0 1 2 3 ... 3 0 1 2 3 0 1 2 3
    variant_allele      (variants, alleles) |S1 b'T' b'C' b'C' ... b'T' b'A'
    sample_id           (samples) <U3 'S30' 'S32'
    call_genotype       (variants, samples, ploidy) int8 0 -1 0 0 0 ... 1 1 0 0
    call_genotype_mask  (variants, samples, ploidy) bool False True ... False
Attributes:
    contigs:  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, ...
    source:   sgkit-0.4.1.dev20+g839eb9a9

Define a new variable based on others?#

Use Xarray’s dictionary like methods, or xarray.Dataset.assign():

In [13]: ds["pos0"] = ds.variant_position - 1

In [14]: ds.assign(pos0 = ds.variant_position - 1) # alternative
Out[14]: 
<xarray.Dataset>
Dimensions:             (variants: 100, alleles: 2, samples: 50, ploidy: 2)
Dimensions without coordinates: variants, alleles, samples, ploidy
Data variables:
    variant_contig      (variants) int64 0 0 0 0 0 1 1 ... 21 21 21 22 22 22 22
    variant_position    (variants) int64 0 1 2 3 4 0 1 2 3 ... 3 0 1 2 3 0 1 2 3
    variant_allele      (variants, alleles) |S1 b'T' b'C' b'C' ... b'T' b'A'
    sample_id           (samples) <U3 'S0' 'S1' 'S2' 'S3' ... 'S47' 'S48' 'S49'
    call_genotype       (variants, samples, ploidy) int8 0 0 1 0 1 ... 1 0 1 0 0
    call_genotype_mask  (variants, samples, ploidy) bool False False ... False
    pos0                (variants) int64 -1 0 1 2 3 -1 0 1 ... -1 0 1 2 -1 0 1 2
Attributes:
    contigs:  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, ...
    source:   sgkit-0.4.1.dev20+g839eb9a9

Get summary stats?#

Call sample_stats() or variant_stats() as appropriate:

In [15]: sg.sample_stats(ds)
Out[15]: 
<xarray.Dataset>
Dimensions:             (samples: 50, variants: 100, alleles: 2, ploidy: 2)
Dimensions without coordinates: samples, variants, alleles, ploidy
Data variables: (12/13)
    sample_n_called     (samples) int64 81 78 80 77 88 82 ... 76 80 77 90 81 81
    sample_call_rate    (samples) float64 0.81 0.78 0.8 0.77 ... 0.9 0.81 0.81
    sample_n_het        (samples) int64 40 37 40 39 41 33 ... 47 37 40 49 50 37
    sample_n_hom_ref    (samples) int64 20 22 19 21 21 23 ... 9 19 20 17 19 24
    sample_n_hom_alt    (samples) int64 21 19 21 17 26 26 ... 20 24 17 24 12 20
    sample_n_non_ref    (samples) int64 61 56 61 56 67 59 ... 67 61 57 73 62 57
    ...                  ...
    variant_position    (variants) int64 0 1 2 3 4 0 1 2 3 ... 3 0 1 2 3 0 1 2 3
    variant_allele      (variants, alleles) |S1 b'T' b'C' b'C' ... b'T' b'A'
    sample_id           (samples) <U3 'S0' 'S1' 'S2' 'S3' ... 'S47' 'S48' 'S49'
    call_genotype       (variants, samples, ploidy) int8 0 0 1 0 1 ... 1 0 1 0 0
    call_genotype_mask  (variants, samples, ploidy) bool False False ... False
    pos0                (variants) int64 -1 0 1 2 3 -1 0 1 ... -1 0 1 2 -1 0 1 2
Attributes:
    contigs:  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, ...
    source:   sgkit-0.4.1.dev20+g839eb9a9

In [16]: sg.variant_stats(ds)
Out[16]: 
<xarray.Dataset>
Dimensions:                   (variants: 100, alleles: 2, samples: 50, ploidy: 2)
Dimensions without coordinates: variants, alleles, samples, ploidy
Data variables: (12/16)
    variant_n_called          (variants) int64 40 42 43 40 41 ... 42 38 43 43 40
    variant_call_rate         (variants) float64 0.8 0.84 0.86 ... 0.86 0.86 0.8
    variant_n_het             (variants) int64 22 30 18 26 29 ... 23 22 26 21 19
    variant_n_hom_ref         (variants) int64 11 3 13 3 6 8 ... 6 10 9 12 8 13
    variant_n_hom_alt         (variants) int64 7 9 12 11 6 17 ... 15 9 7 5 14 8
    variant_n_non_ref         (variants) int64 29 39 30 37 35 ... 32 29 31 35 27
    ...                        ...
    variant_position          (variants) int64 0 1 2 3 4 0 1 2 ... 1 2 3 0 1 2 3
    variant_allele            (variants, alleles) |S1 b'T' b'C' ... b'T' b'A'
    sample_id                 (samples) <U3 'S0' 'S1' 'S2' ... 'S47' 'S48' 'S49'
    call_genotype             (variants, samples, ploidy) int8 0 0 1 0 ... 1 0 0
    call_genotype_mask        (variants, samples, ploidy) bool False ... False
    pos0                      (variants) int64 -1 0 1 2 3 -1 0 ... 1 2 -1 0 1 2
Attributes:
    contigs:  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, ...
    source:   sgkit-0.4.1.dev20+g839eb9a9

Filter variants?#

Call xarray.Dataset.sel() on the variants dimension:

In [17]: ds2 = sg.hardy_weinberg_test(ds)

In [18]: ds2.sel(variants=(ds2.variant_hwe_p_value > 1e-2))
Out[18]: 
<xarray.Dataset>
Dimensions:              (variants: 99, alleles: 2, samples: 50, ploidy: 2)
Dimensions without coordinates: variants, alleles, samples, ploidy
Data variables:
    variant_hwe_p_value  (variants) float64 dask.array<chunksize=(99,), meta=np.ndarray>
    variant_n_het        (variants) int64 22 18 26 29 18 15 ... 23 22 26 21 19
    variant_n_hom_ref    (variants) int64 11 13 3 6 8 13 9 ... 11 6 10 9 12 8 13
    variant_n_hom_alt    (variants) int64 7 12 11 6 17 10 14 ... 9 15 9 7 5 14 8
    variant_n_non_ref    (variants) int64 29 30 37 35 35 25 ... 32 29 31 35 27
    variant_contig       (variants) int64 0 0 0 0 1 1 1 ... 21 21 21 22 22 22 22
    variant_position     (variants) int64 0 2 3 4 0 1 2 3 4 ... 0 1 2 3 0 1 2 3
    variant_allele       (variants, alleles) |S1 b'T' b'C' b'T' ... b'T' b'A'
    sample_id            (samples) <U3 'S0' 'S1' 'S2' 'S3' ... 'S47' 'S48' 'S49'
    call_genotype        (variants, samples, ploidy) int8 0 0 1 0 1 ... 0 1 0 0
    call_genotype_mask   (variants, samples, ploidy) bool False False ... False
    pos0                 (variants) int64 -1 1 2 3 -1 0 1 2 ... 0 1 2 -1 0 1 2
Attributes:
    contigs:  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, ...
    source:   sgkit-0.4.1.dev20+g839eb9a9

Note

Filtering causes an eager Xarray computation.

Find which new variables were added by a method?#

Use xarray.Dataset.data_vars to compare the new dataset variables to the old:

In [19]: ds2 = sg.sample_stats(ds)

In [20]: set(ds2.data_vars) - set(ds.data_vars)
Out[20]: 
{'sample_call_rate',
 'sample_n_called',
 'sample_n_het',
 'sample_n_hom_alt',
 'sample_n_hom_ref',
 'sample_n_non_ref'}

Save results to a Zarr file?#

Call save_dataset():

In [21]: sg.save_dataset(ds, "ds.zarr")

Note

Zarr datasets must have equal-sized chunks (except for the final chunk, which may be smaller), so you may have to rechunk the dataset first.

Load a dataset from Zarr?#

Call load_dataset():

In [22]: ds = sg.load_dataset("ds.zarr")