User Guide¶
Table of contents:
IO¶
PLINK¶
The sgkit.io.plink.read_plink() function loads a single PLINK dataset as Dask
arrays within an xarray.Dataset from bed, bim, and fam files.
PLINK IO support is an “extra” feature within sgkit and requires additional dependencies. To install sgkit with PLINK support using pip:
$ pip install --pre 'sgkit[plink]'
VCF¶
The sgkit.io.vcf.vcf_to_zarr() function converts one or more VCF files to
Zarr files stored in sgkit’s Xarray data representation, which can then be opened
as a xarray.Dataset.
See Reading VCF for installation instructions, and details on using VCF in sgkit.
Converting genetic data to Zarr¶
TODO: Describe the process and motivation for converting genetic file formats to Zarr prior to analysis
Working with cloud-native data¶
TODO: Show how to read/write Zarr (and VCF?) data in cloud storage
Deploying sgkit on a cluster¶
TODO: Create a tutorial on running sgkit at scale
Adding custom data to a Dataset¶
TODO: Show how something like sample metadata can be joined to an existing Xarray dataset. Also briefly explain indexing and uniqueness within Xarray/Pandas, since this is critical for understanding joins.
Using GPUs¶
TODO: Show CuPy examples
Custom Computations¶
TODO: Finish explaining how Numba works and how users might apply it
Here is an example that demonstrates an alt allele count:
In [1]: import numba
In [2]: import sgkit as sg
In [3]: import numpy as np
In [4]: ds = sg.simulate_genotype_call_dataset(5, 3, missing_pct=.2)
In [5]: def alt_allele_count(gt):
...: out = np.full(gt.shape[:2], -1, dtype=np.int64)
...: for i, j in np.ndindex(*out.shape):
...: if np.all(gt[i, j] >= 0):
...: out[i, j] = np.sum(gt[i, j] > 0)
...: return out
...:
In [6]: numba.njit(alt_allele_count)(ds.call_genotype.values)
Out[6]:
array([[ 0, 1, 1],
[ 1, 1, -1],
[-1, 1, 2],
[ 0, -1, -1],
[ 1, 2, 2]])