How do I …#

Create a test dataset?#

Call simulate_genotype_call_dataset() to create a test xarray.Dataset:

In [1]: import sgkit as sg

In [2]: ds = sg.simulate_genotype_call_dataset(n_variant=100, n_sample=50, n_contig=23, missing_pct=.1)

Look at the dataset summary?#

Print using the xarray.Dataset repr:

In [3]: ds
Out[3]: 
<xarray.Dataset> Size: 23kB
Dimensions:             (contigs: 23, variants: 100, alleles: 2, samples: 50,
                         ploidy: 2)
Dimensions without coordinates: contigs, variants, alleles, samples, ploidy
Data variables:
    contig_id           (contigs) <U2 184B '0' '1' '2' '3' ... '20' '21' '22'
    variant_contig      (variants) int64 800B 0 0 0 0 0 1 ... 21 21 22 22 22 22
    variant_position    (variants) int64 800B 0 1 2 3 4 0 1 2 ... 1 2 3 0 1 2 3
    variant_allele      (variants, alleles) |S1 200B b'T' b'C' ... b'T' b'A'
    sample_id           (samples) <U3 600B 'S0' 'S1' 'S2' ... 'S47' 'S48' 'S49'
    call_genotype       (variants, samples, ploidy) int8 10kB 0 0 1 0 ... 1 0 0
    call_genotype_mask  (variants, samples, ploidy) bool 10kB False ... False
Attributes:
    contigs:  ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', ...
    source:   sgkit-unknown

Get the values for a variable in a dataset?#

Call xarray.Variable.values:

In [4]: ds.variant_contig.values
Out[4]: 
array([ 0,  0,  0,  0,  0,  1,  1,  1,  1,  1,  2,  2,  2,  2,  2,  3,  3,
        3,  3,  3,  4,  4,  4,  4,  4,  5,  5,  5,  5,  5,  6,  6,  6,  6,
        6,  7,  7,  7,  7,  7,  8,  8,  8,  8,  9,  9,  9,  9, 10, 10, 10,
       10, 11, 11, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13, 14, 14, 14, 14,
       15, 15, 15, 15, 16, 16, 16, 16, 17, 17, 17, 17, 18, 18, 18, 18, 19,
       19, 19, 19, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 22])

In [5]: ds["variant_contig"].values # equivalent alternative
Out[5]: 
array([ 0,  0,  0,  0,  0,  1,  1,  1,  1,  1,  2,  2,  2,  2,  2,  3,  3,
        3,  3,  3,  4,  4,  4,  4,  4,  5,  5,  5,  5,  5,  6,  6,  6,  6,
        6,  7,  7,  7,  7,  7,  8,  8,  8,  8,  9,  9,  9,  9, 10, 10, 10,
       10, 11, 11, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13, 14, 14, 14, 14,
       15, 15, 15, 15, 16, 16, 16, 16, 17, 17, 17, 17, 18, 18, 18, 18, 19,
       19, 19, 19, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 22])

Warning

Calling values materializes a variable’s data in memory, so is only suitable for small datasets.

Find the definition for a variable in a dataset?#

Use the comment attribute on the variable:

In [6]: ds.variant_contig.comment
Out[6]: 'Index corresponding to contig name for each variant. In some less common\nscenarios, this may also be equivalent to the contig names if the data\ngenerating process used contig names that were also integers.'

All the variables defined in sgkit are documented on the Variables API page.

Look at the genotypes?#

Call display_genotypes():

In [7]: sg.display_genotypes(ds, max_variants=10)
Out[7]: 
samples    S0   S1   S2   S3   S4  ...  S45  S46  S47  S48  S49
variants                           ...                         
0         0/0  1/0  1/0  0/1  1/0  ...  1/1  0/0  1/0  0/0  1/1
1         1/1  1/0  1/.  ./0  1/0  ...  1/1  0/1  1/0  1/1  1/0
2         0/1  1/1  1/1  1/0  1/1  ...  0/0  0/1  0/0  0/0  1/1
3         1/1  0/0  1/1  ./1  0/1  ...  0/1  1/0  0/1  0/.  0/.
4         1/0  0/1  0/1  0/1  0/0  ...  1/0  1/1  0/0  1/.  1/0
...       ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...
6         1/1  1/1  ./0  1/1  0/1  ...  0/0  0/.  1/0  1/0  0/1
7         1/.  1/0  ./0  0/1  1/0  ...  0/1  1/.  0/0  1/0  0/0
8         0/1  0/0  0/0  0/1  0/0  ...  0/1  0/1  1/0  1/0  0/0
9         1/1  0/0  ./1  1/0  0/0  ...  0/0  0/0  1/1  0/1  1/0
10        1/1  0/.  0/0  0/1  1/.  ...  1/0  0/.  0/1  0/1  0/0

[100 rows x 50 columns]

Subset the variables?#

Use Xarray’s pandas-like method for selecting variables:

In [8]: ds[["variant_contig", "variant_position", "variant_allele"]]
Out[8]: 
<xarray.Dataset> Size: 2kB
Dimensions:           (variants: 100, alleles: 2)
Dimensions without coordinates: variants, alleles
Data variables:
    variant_contig    (variants) int64 800B 0 0 0 0 0 1 1 ... 21 21 22 22 22 22
    variant_position  (variants) int64 800B 0 1 2 3 4 0 1 2 ... 0 1 2 3 0 1 2 3
    variant_allele    (variants, alleles) |S1 200B b'T' b'C' b'C' ... b'T' b'A'
Attributes:
    contigs:  ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', ...
    source:   sgkit-unknown

Alternatively, you can drop variables that you want to remove:

In [9]: ds.drop_vars(["variant_contig", "variant_position", "variant_allele"])
Out[9]: 
<xarray.Dataset> Size: 21kB
Dimensions:             (contigs: 23, samples: 50, variants: 100, ploidy: 2)
Dimensions without coordinates: contigs, samples, variants, ploidy
Data variables:
    contig_id           (contigs) <U2 184B '0' '1' '2' '3' ... '20' '21' '22'
    sample_id           (samples) <U3 600B 'S0' 'S1' 'S2' ... 'S47' 'S48' 'S49'
    call_genotype       (variants, samples, ploidy) int8 10kB 0 0 1 0 ... 1 0 0
    call_genotype_mask  (variants, samples, ploidy) bool 10kB False ... False
Attributes:
    contigs:  ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', ...
    source:   sgkit-unknown

Subset to a genomic range?#

Set an index on the dataset, then call xarray.Dataset.sel():

In [10]: ds.set_index(variants=("variant_contig", "variant_position")).sel(variants=(0, slice(2, 4)))
Out[10]: 
<xarray.Dataset> Size: 1kB
Dimensions:             (contigs: 23, variants: 3, alleles: 2, samples: 50,
                         ploidy: 2)
Coordinates:
  * variants            (variants) object 24B MultiIndex
  * variant_contig      (variants) int64 24B 0 0 0
  * variant_position    (variants) int64 24B 2 3 4
Dimensions without coordinates: contigs, alleles, samples, ploidy
Data variables:
    contig_id           (contigs) <U2 184B '0' '1' '2' '3' ... '20' '21' '22'
    variant_allele      (variants, alleles) |S1 6B b'T' b'G' b'G' b'G' b'C' b'G'
    sample_id           (samples) <U3 600B 'S0' 'S1' 'S2' ... 'S47' 'S48' 'S49'
    call_genotype       (variants, samples, ploidy) int8 300B 0 1 1 1 ... -1 1 0
    call_genotype_mask  (variants, samples, ploidy) bool 300B False ... False
Attributes:
    contigs:  ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', ...
    source:   sgkit-unknown

An API to make this easier is under discussion. Please add your requirements to sgkit-dev/sgkit#658.

Get the list of samples?#

Get the values for the sample_id variable:

In [11]: ds.sample_id.values
Out[11]: 
array(['S0', 'S1', 'S2', 'S3', 'S4', 'S5', 'S6', 'S7', 'S8', 'S9', 'S10',
       'S11', 'S12', 'S13', 'S14', 'S15', 'S16', 'S17', 'S18', 'S19',
       'S20', 'S21', 'S22', 'S23', 'S24', 'S25', 'S26', 'S27', 'S28',
       'S29', 'S30', 'S31', 'S32', 'S33', 'S34', 'S35', 'S36', 'S37',
       'S38', 'S39', 'S40', 'S41', 'S42', 'S43', 'S44', 'S45', 'S46',
       'S47', 'S48', 'S49'], dtype='<U3')

Subset the samples?#

Call xarray.Dataset.sel() and xarray.DataArray.isin():

In [12]: ds.sel(samples=ds.sample_id.isin(["S30", "S32"]))
Out[12]: 
<xarray.Dataset> Size: 3kB
Dimensions:             (contigs: 23, variants: 100, alleles: 2, samples: 2,
                         ploidy: 2)
Dimensions without coordinates: contigs, variants, alleles, samples, ploidy
Data variables:
    contig_id           (contigs) <U2 184B '0' '1' '2' '3' ... '20' '21' '22'
    variant_contig      (variants) int64 800B 0 0 0 0 0 1 ... 21 21 22 22 22 22
    variant_position    (variants) int64 800B 0 1 2 3 4 0 1 2 ... 1 2 3 0 1 2 3
    variant_allele      (variants, alleles) |S1 200B b'T' b'C' ... b'T' b'A'
    sample_id           (samples) <U3 24B 'S30' 'S32'
    call_genotype       (variants, samples, ploidy) int8 400B 0 -1 0 0 ... 1 0 0
    call_genotype_mask  (variants, samples, ploidy) bool 400B False ... False
Attributes:
    contigs:  ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', ...
    source:   sgkit-unknown

Define a new variable based on others?#

Use Xarray’s dictionary like methods, or xarray.Dataset.assign():

In [13]: ds["pos0"] = ds.variant_position - 1

In [14]: ds.assign(pos0 = ds.variant_position - 1) # alternative
Out[14]: 
<xarray.Dataset> Size: 23kB
Dimensions:             (contigs: 23, variants: 100, alleles: 2, samples: 50,
                         ploidy: 2)
Dimensions without coordinates: contigs, variants, alleles, samples, ploidy
Data variables:
    contig_id           (contigs) <U2 184B '0' '1' '2' '3' ... '20' '21' '22'
    variant_contig      (variants) int64 800B 0 0 0 0 0 1 ... 21 21 22 22 22 22
    variant_position    (variants) int64 800B 0 1 2 3 4 0 1 2 ... 1 2 3 0 1 2 3
    variant_allele      (variants, alleles) |S1 200B b'T' b'C' ... b'T' b'A'
    sample_id           (samples) <U3 600B 'S0' 'S1' 'S2' ... 'S47' 'S48' 'S49'
    call_genotype       (variants, samples, ploidy) int8 10kB 0 0 1 0 ... 1 0 0
    call_genotype_mask  (variants, samples, ploidy) bool 10kB False ... False
    pos0                (variants) int64 800B -1 0 1 2 3 -1 0 ... 0 1 2 -1 0 1 2
Attributes:
    contigs:  ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', ...
    source:   sgkit-unknown

Get summary stats?#

Call sample_stats() or variant_stats() as appropriate:

In [15]: sg.sample_stats(ds)
Out[15]: 
<xarray.Dataset> Size: 26kB
Dimensions:             (samples: 50, contigs: 23, variants: 100, alleles: 2,
                         ploidy: 2)
Dimensions without coordinates: samples, contigs, variants, alleles, ploidy
Data variables: (12/14)
    sample_n_called     (samples) int64 400B dask.array<chunksize=(50,), meta=np.ndarray>
    sample_call_rate    (samples) float64 400B dask.array<chunksize=(50,), meta=np.ndarray>
    sample_n_het        (samples) int64 400B dask.array<chunksize=(50,), meta=np.ndarray>
    sample_n_hom_ref    (samples) int64 400B dask.array<chunksize=(50,), meta=np.ndarray>
    sample_n_hom_alt    (samples) int64 400B dask.array<chunksize=(50,), meta=np.ndarray>
    sample_n_non_ref    (samples) int64 400B dask.array<chunksize=(50,), meta=np.ndarray>
    ...                  ...
    variant_position    (variants) int64 800B 0 1 2 3 4 0 1 2 ... 1 2 3 0 1 2 3
    variant_allele      (variants, alleles) |S1 200B b'T' b'C' ... b'T' b'A'
    sample_id           (samples) <U3 600B 'S0' 'S1' 'S2' ... 'S47' 'S48' 'S49'
    call_genotype       (variants, samples, ploidy) int8 10kB 0 0 1 0 ... 1 0 0
    call_genotype_mask  (variants, samples, ploidy) bool 10kB False ... False
    pos0                (variants) int64 800B -1 0 1 2 3 -1 0 ... 0 1 2 -1 0 1 2
Attributes:
    contigs:  ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', ...
    source:   sgkit-unknown

In [16]: sg.variant_stats(ds)
Out[16]: 
<xarray.Dataset> Size: 32kB
Dimensions:                   (variants: 100, alleles: 2, contigs: 23,
                               samples: 50, ploidy: 2)
Dimensions without coordinates: variants, alleles, contigs, samples, ploidy
Data variables: (12/17)
    variant_n_called          (variants) int64 800B dask.array<chunksize=(100,), meta=np.ndarray>
    variant_call_rate         (variants) float64 800B dask.array<chunksize=(100,), meta=np.ndarray>
    variant_n_het             (variants) int64 800B dask.array<chunksize=(100,), meta=np.ndarray>
    variant_n_hom_ref         (variants) int64 800B dask.array<chunksize=(100,), meta=np.ndarray>
    variant_n_hom_alt         (variants) int64 800B dask.array<chunksize=(100,), meta=np.ndarray>
    variant_n_non_ref         (variants) int64 800B dask.array<chunksize=(100,), meta=np.ndarray>
    ...                        ...
    variant_position          (variants) int64 800B 0 1 2 3 4 0 ... 2 3 0 1 2 3
    variant_allele            (variants, alleles) |S1 200B b'T' b'C' ... b'A'
    sample_id                 (samples) <U3 600B 'S0' 'S1' 'S2' ... 'S48' 'S49'
    call_genotype             (variants, samples, ploidy) int8 10kB 0 0 ... 0 0
    call_genotype_mask        (variants, samples, ploidy) bool 10kB False ......
    pos0                      (variants) int64 800B -1 0 1 2 3 -1 ... 2 -1 0 1 2
Attributes:
    contigs:  ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', ...
    source:   sgkit-unknown

Filter variants?#

Call xarray.Dataset.sel() on the variants dimension:

In [17]: ds2 = sg.hardy_weinberg_test(ds)

In [18]: ds2.sel(variants=(ds2.variant_hwe_p_value > 1e-2).compute())
Out[18]: 
<xarray.Dataset> Size: 26kB
Dimensions:                 (variants: 99, genotypes: 3, contigs: 23,
                             alleles: 2, samples: 50, ploidy: 2)
Coordinates:
  * genotypes               (genotypes) <U3 36B '0/0' '0/1' '1/1'
Dimensions without coordinates: variants, contigs, alleles, samples, ploidy
Data variables:
    variant_hwe_p_value     (variants) float64 792B dask.array<chunksize=(99,), meta=np.ndarray>
    variant_genotype_count  (variants, genotypes) uint64 2kB dask.array<chunksize=(99, 3), meta=np.ndarray>
    genotype_id             (genotypes) <U3 36B dask.array<chunksize=(3,), meta=np.ndarray>
    contig_id               (contigs) <U2 184B '0' '1' '2' ... '20' '21' '22'
    variant_contig          (variants) int64 792B 0 0 0 0 1 1 ... 21 22 22 22 22
    variant_position        (variants) int64 792B 0 2 3 4 0 1 2 ... 2 3 0 1 2 3
    variant_allele          (variants, alleles) |S1 198B b'T' b'C' ... b'T' b'A'
    sample_id               (samples) <U3 600B 'S0' 'S1' 'S2' ... 'S48' 'S49'
    call_genotype           (variants, samples, ploidy) int8 10kB 0 0 1 ... 0 0
    call_genotype_mask      (variants, samples, ploidy) bool 10kB False ... F...
    pos0                    (variants) int64 792B -1 1 2 3 -1 0 ... 1 2 -1 0 1 2
Attributes:
    contigs:  ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', ...
    source:   sgkit-unknown

Note

The call to compute is needed to avoid an Xarray error.

Find which new variables were added by a method?#

Use xarray.Dataset.data_vars to compare the new dataset variables to the old:

In [19]: ds2 = sg.sample_stats(ds)

In [20]: set(ds2.data_vars) - set(ds.data_vars)
Out[20]: 
{'sample_call_rate',
 'sample_n_called',
 'sample_n_het',
 'sample_n_hom_alt',
 'sample_n_hom_ref',
 'sample_n_non_ref'}

Save results to a Zarr file?#

Call save_dataset():

In [21]: sg.save_dataset(ds, "ds.zarr")

Note

Zarr datasets must have equal-sized chunks (except for the final chunk, which may be smaller), so you may have to rechunk the dataset first.

Load a dataset from Zarr?#

Call load_dataset():

In [22]: ds = sg.load_dataset("ds.zarr")