How do I …#
Create a test dataset?#
Call simulate_genotype_call_dataset()
to create a test xarray.Dataset
:
In [1]: import sgkit as sg
In [2]: ds = sg.simulate_genotype_call_dataset(n_variant=100, n_sample=50, n_contig=23, missing_pct=.1)
Look at the dataset summary?#
Print using the xarray.Dataset
repr
:
In [3]: ds
Out[3]:
<xarray.Dataset>
Dimensions: (contigs: 23, variants: 100, alleles: 2, samples: 50,
ploidy: 2)
Dimensions without coordinates: contigs, variants, alleles, samples, ploidy
Data variables:
contig_id (contigs) <U2 '0' '1' '2' '3' ... '19' '20' '21' '22'
variant_contig (variants) int64 0 0 0 0 0 1 1 ... 21 21 21 22 22 22 22
variant_position (variants) int64 0 1 2 3 4 0 1 2 3 ... 3 0 1 2 3 0 1 2 3
variant_allele (variants, alleles) |S1 b'T' b'C' b'C' ... b'T' b'A'
sample_id (samples) <U3 'S0' 'S1' 'S2' 'S3' ... 'S47' 'S48' 'S49'
call_genotype (variants, samples, ploidy) int8 0 0 1 0 1 ... 1 0 1 0 0
call_genotype_mask (variants, samples, ploidy) bool False False ... False
Attributes:
contigs: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', ...
source: sgkit-0.8.1.dev8+g155fbb76.d20240624
Get the values for a variable in a dataset?#
Call xarray.Variable.values
:
In [4]: ds.variant_contig.values
Out[4]:
array([ 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3,
3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6, 6, 6,
6, 7, 7, 7, 7, 7, 8, 8, 8, 8, 9, 9, 9, 9, 10, 10, 10,
10, 11, 11, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13, 14, 14, 14, 14,
15, 15, 15, 15, 16, 16, 16, 16, 17, 17, 17, 17, 18, 18, 18, 18, 19,
19, 19, 19, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 22])
In [5]: ds["variant_contig"].values # equivalent alternative
Out[5]:
array([ 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3,
3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6, 6, 6,
6, 7, 7, 7, 7, 7, 8, 8, 8, 8, 9, 9, 9, 9, 10, 10, 10,
10, 11, 11, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13, 14, 14, 14, 14,
15, 15, 15, 15, 16, 16, 16, 16, 17, 17, 17, 17, 18, 18, 18, 18, 19,
19, 19, 19, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 22])
Warning
Calling values
materializes a variable’s data in memory, so is only suitable for small datasets.
Find the definition for a variable in a dataset?#
Use the comment
attribute on the variable:
In [6]: ds.variant_contig.comment
Out[6]: 'Index corresponding to contig name for each variant. In some less common\nscenarios, this may also be equivalent to the contig names if the data\ngenerating process used contig names that were also integers.'
All the variables defined in sgkit are documented on the Variables API page.
Look at the genotypes?#
Call display_genotypes()
:
In [7]: sg.display_genotypes(ds, max_variants=10)
Out[7]:
samples S0 S1 S2 S3 S4 ... S45 S46 S47 S48 S49
variants ...
0 0/0 1/0 1/0 0/1 1/0 ... 1/1 0/0 1/0 0/0 1/1
1 1/1 1/0 1/. ./0 1/0 ... 1/1 0/1 1/0 1/1 1/0
2 0/1 1/1 1/1 1/0 1/1 ... 0/0 0/1 0/0 0/0 1/1
3 1/1 0/0 1/1 ./1 0/1 ... 0/1 1/0 0/1 0/. 0/.
4 1/0 0/1 0/1 0/1 0/0 ... 1/0 1/1 0/0 1/. 1/0
... ... ... ... ... ... ... ... ... ... ... ...
6 1/1 1/1 ./0 1/1 0/1 ... 0/0 0/. 1/0 1/0 0/1
7 1/. 1/0 ./0 0/1 1/0 ... 0/1 1/. 0/0 1/0 0/0
8 0/1 0/0 0/0 0/1 0/0 ... 0/1 0/1 1/0 1/0 0/0
9 1/1 0/0 ./1 1/0 0/0 ... 0/0 0/0 1/1 0/1 1/0
10 1/1 0/. 0/0 0/1 1/. ... 1/0 0/. 0/1 0/1 0/0
[100 rows x 50 columns]
Subset the variables?#
Use Xarray’s pandas-like method for selecting variables:
In [8]: ds[["variant_contig", "variant_position", "variant_allele"]]
Out[8]:
<xarray.Dataset>
Dimensions: (variants: 100, alleles: 2)
Dimensions without coordinates: variants, alleles
Data variables:
variant_contig (variants) int64 0 0 0 0 0 1 1 1 ... 21 21 21 22 22 22 22
variant_position (variants) int64 0 1 2 3 4 0 1 2 3 4 ... 3 0 1 2 3 0 1 2 3
variant_allele (variants, alleles) |S1 b'T' b'C' b'C' ... b'T' b'T' b'A'
Attributes:
contigs: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', ...
source: sgkit-0.8.1.dev8+g155fbb76.d20240624
Alternatively, you can drop variables that you want to remove:
In [9]: ds.drop_vars(["variant_contig", "variant_position", "variant_allele"])
Out[9]:
<xarray.Dataset>
Dimensions: (contigs: 23, samples: 50, variants: 100, ploidy: 2)
Dimensions without coordinates: contigs, samples, variants, ploidy
Data variables:
contig_id (contigs) <U2 '0' '1' '2' '3' ... '19' '20' '21' '22'
sample_id (samples) <U3 'S0' 'S1' 'S2' 'S3' ... 'S47' 'S48' 'S49'
call_genotype (variants, samples, ploidy) int8 0 0 1 0 1 ... 1 0 1 0 0
call_genotype_mask (variants, samples, ploidy) bool False False ... False
Attributes:
contigs: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', ...
source: sgkit-0.8.1.dev8+g155fbb76.d20240624
Subset to a genomic range?#
Set an index on the dataset, then call xarray.Dataset.sel()
:
In [10]: ds.set_index(variants=("variant_contig", "variant_position")).sel(variants=(0, slice(2, 4)))
Out[10]:
<xarray.Dataset>
Dimensions: (contigs: 23, variants: 3, alleles: 2, samples: 50,
ploidy: 2)
Coordinates:
* variants (variants) object MultiIndex
* variant_contig (variants) int64 0 0 0
* variant_position (variants) int64 2 3 4
Dimensions without coordinates: contigs, alleles, samples, ploidy
Data variables:
contig_id (contigs) <U2 '0' '1' '2' '3' ... '19' '20' '21' '22'
variant_allele (variants, alleles) |S1 b'T' b'G' b'G' b'G' b'C' b'G'
sample_id (samples) <U3 'S0' 'S1' 'S2' 'S3' ... 'S47' 'S48' 'S49'
call_genotype (variants, samples, ploidy) int8 0 1 1 1 1 ... 1 -1 1 0
call_genotype_mask (variants, samples, ploidy) bool False False ... False
Attributes:
contigs: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', ...
source: sgkit-0.8.1.dev8+g155fbb76.d20240624
An API to make this easier is under discussion. Please add your requirements to sgkit-dev/sgkit#658.
Get the list of samples?#
Get the values for the sample_id
variable:
In [11]: ds.sample_id.values
Out[11]:
array(['S0', 'S1', 'S2', 'S3', 'S4', 'S5', 'S6', 'S7', 'S8', 'S9', 'S10',
'S11', 'S12', 'S13', 'S14', 'S15', 'S16', 'S17', 'S18', 'S19',
'S20', 'S21', 'S22', 'S23', 'S24', 'S25', 'S26', 'S27', 'S28',
'S29', 'S30', 'S31', 'S32', 'S33', 'S34', 'S35', 'S36', 'S37',
'S38', 'S39', 'S40', 'S41', 'S42', 'S43', 'S44', 'S45', 'S46',
'S47', 'S48', 'S49'], dtype='<U3')
Subset the samples?#
Call xarray.Dataset.sel()
and xarray.DataArray.isin()
:
In [12]: ds.sel(samples=ds.sample_id.isin(["S30", "S32"]))
Out[12]:
<xarray.Dataset>
Dimensions: (contigs: 23, variants: 100, alleles: 2, samples: 2,
ploidy: 2)
Dimensions without coordinates: contigs, variants, alleles, samples, ploidy
Data variables:
contig_id (contigs) <U2 '0' '1' '2' '3' ... '19' '20' '21' '22'
variant_contig (variants) int64 0 0 0 0 0 1 1 ... 21 21 21 22 22 22 22
variant_position (variants) int64 0 1 2 3 4 0 1 2 3 ... 3 0 1 2 3 0 1 2 3
variant_allele (variants, alleles) |S1 b'T' b'C' b'C' ... b'T' b'A'
sample_id (samples) <U3 'S30' 'S32'
call_genotype (variants, samples, ploidy) int8 0 -1 0 0 0 ... 1 1 0 0
call_genotype_mask (variants, samples, ploidy) bool False True ... False
Attributes:
contigs: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', ...
source: sgkit-0.8.1.dev8+g155fbb76.d20240624
Define a new variable based on others?#
Use Xarray’s dictionary like methods, or xarray.Dataset.assign()
:
In [13]: ds["pos0"] = ds.variant_position - 1
In [14]: ds.assign(pos0 = ds.variant_position - 1) # alternative
Out[14]:
<xarray.Dataset>
Dimensions: (contigs: 23, variants: 100, alleles: 2, samples: 50,
ploidy: 2)
Dimensions without coordinates: contigs, variants, alleles, samples, ploidy
Data variables:
contig_id (contigs) <U2 '0' '1' '2' '3' ... '19' '20' '21' '22'
variant_contig (variants) int64 0 0 0 0 0 1 1 ... 21 21 21 22 22 22 22
variant_position (variants) int64 0 1 2 3 4 0 1 2 3 ... 3 0 1 2 3 0 1 2 3
variant_allele (variants, alleles) |S1 b'T' b'C' b'C' ... b'T' b'A'
sample_id (samples) <U3 'S0' 'S1' 'S2' 'S3' ... 'S47' 'S48' 'S49'
call_genotype (variants, samples, ploidy) int8 0 0 1 0 1 ... 1 0 1 0 0
call_genotype_mask (variants, samples, ploidy) bool False False ... False
pos0 (variants) int64 -1 0 1 2 3 -1 0 1 ... -1 0 1 2 -1 0 1 2
Attributes:
contigs: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', ...
source: sgkit-0.8.1.dev8+g155fbb76.d20240624
Get summary stats?#
Call sample_stats()
or variant_stats()
as appropriate:
In [15]: sg.sample_stats(ds)
Out[15]:
<xarray.Dataset>
Dimensions: (samples: 50, contigs: 23, variants: 100, alleles: 2,
ploidy: 2)
Dimensions without coordinates: samples, contigs, variants, alleles, ploidy
Data variables: (12/14)
sample_n_called (samples) int64 dask.array<chunksize=(50,), meta=np.ndarray>
sample_call_rate (samples) float64 dask.array<chunksize=(50,), meta=np.ndarray>
sample_n_het (samples) int64 dask.array<chunksize=(50,), meta=np.ndarray>
sample_n_hom_ref (samples) int64 dask.array<chunksize=(50,), meta=np.ndarray>
sample_n_hom_alt (samples) int64 dask.array<chunksize=(50,), meta=np.ndarray>
sample_n_non_ref (samples) int64 dask.array<chunksize=(50,), meta=np.ndarray>
... ...
variant_position (variants) int64 0 1 2 3 4 0 1 2 3 ... 3 0 1 2 3 0 1 2 3
variant_allele (variants, alleles) |S1 b'T' b'C' b'C' ... b'T' b'A'
sample_id (samples) <U3 'S0' 'S1' 'S2' 'S3' ... 'S47' 'S48' 'S49'
call_genotype (variants, samples, ploidy) int8 0 0 1 0 1 ... 1 0 1 0 0
call_genotype_mask (variants, samples, ploidy) bool False False ... False
pos0 (variants) int64 -1 0 1 2 3 -1 0 1 ... -1 0 1 2 -1 0 1 2
Attributes:
contigs: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', ...
source: sgkit-0.8.1.dev8+g155fbb76.d20240624
In [16]: sg.variant_stats(ds)
Out[16]:
<xarray.Dataset>
Dimensions: (variants: 100, alleles: 2, contigs: 23,
samples: 50, ploidy: 2)
Dimensions without coordinates: variants, alleles, contigs, samples, ploidy
Data variables: (12/17)
variant_n_called (variants) int64 dask.array<chunksize=(100,), meta=np.ndarray>
variant_call_rate (variants) float64 dask.array<chunksize=(100,), meta=np.ndarray>
variant_n_het (variants) int64 dask.array<chunksize=(100,), meta=np.ndarray>
variant_n_hom_ref (variants) int64 dask.array<chunksize=(100,), meta=np.ndarray>
variant_n_hom_alt (variants) int64 dask.array<chunksize=(100,), meta=np.ndarray>
variant_n_non_ref (variants) int64 dask.array<chunksize=(100,), meta=np.ndarray>
... ...
variant_position (variants) int64 0 1 2 3 4 0 1 2 ... 1 2 3 0 1 2 3
variant_allele (variants, alleles) |S1 b'T' b'C' ... b'T' b'A'
sample_id (samples) <U3 'S0' 'S1' 'S2' ... 'S47' 'S48' 'S49'
call_genotype (variants, samples, ploidy) int8 0 0 1 0 ... 1 0 0
call_genotype_mask (variants, samples, ploidy) bool False ... False
pos0 (variants) int64 -1 0 1 2 3 -1 0 ... 1 2 -1 0 1 2
Attributes:
contigs: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', ...
source: sgkit-0.8.1.dev8+g155fbb76.d20240624
Filter variants?#
Call xarray.Dataset.sel()
on the variants
dimension:
In [17]: ds2 = sg.hardy_weinberg_test(ds)
In [18]: ds2.sel(variants=(ds2.variant_hwe_p_value > 1e-2).compute())
Out[18]:
<xarray.Dataset>
Dimensions: (variants: 99, genotypes: 3, contigs: 23,
alleles: 2, samples: 50, ploidy: 2)
Coordinates:
* genotypes (genotypes) <U3 '0/0' '0/1' '1/1'
Dimensions without coordinates: variants, contigs, alleles, samples, ploidy
Data variables:
variant_hwe_p_value (variants) float64 dask.array<chunksize=(99,), meta=np.ndarray>
variant_genotype_count (variants, genotypes) uint64 dask.array<chunksize=(99, 3), meta=np.ndarray>
genotype_id (genotypes) <U3 dask.array<chunksize=(3,), meta=np.ndarray>
contig_id (contigs) <U2 '0' '1' '2' '3' ... '20' '21' '22'
variant_contig (variants) int64 0 0 0 0 1 1 1 ... 21 21 22 22 22 22
variant_position (variants) int64 0 2 3 4 0 1 2 3 ... 0 1 2 3 0 1 2 3
variant_allele (variants, alleles) |S1 b'T' b'C' b'T' ... b'T' b'A'
sample_id (samples) <U3 'S0' 'S1' 'S2' ... 'S47' 'S48' 'S49'
call_genotype (variants, samples, ploidy) int8 0 0 1 0 ... 0 1 0 0
call_genotype_mask (variants, samples, ploidy) bool False ... False
pos0 (variants) int64 -1 1 2 3 -1 0 1 ... 0 1 2 -1 0 1 2
Attributes:
contigs: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', ...
source: sgkit-0.8.1.dev8+g155fbb76.d20240624
Note
The call to compute
is needed to avoid an Xarray error.
Find which new variables were added by a method?#
Use xarray.Dataset.data_vars
to compare the new dataset variables to the old:
In [19]: ds2 = sg.sample_stats(ds)
In [20]: set(ds2.data_vars) - set(ds.data_vars)
Out[20]:
{'sample_call_rate',
'sample_n_called',
'sample_n_het',
'sample_n_hom_alt',
'sample_n_hom_ref',
'sample_n_non_ref'}
Save results to a Zarr file?#
Call save_dataset()
:
In [21]: sg.save_dataset(ds, "ds.zarr")
Note
Zarr datasets must have equal-sized chunks (except for the final chunk, which may be smaller), so you may have to rechunk the dataset first.
Load a dataset from Zarr?#
Call load_dataset()
:
In [22]: ds = sg.load_dataset("ds.zarr")