User Guide¶
Table of contents:
IO¶
PLINK¶
The sgkit.io.plink.read_plink()
function loads a single PLINK dataset as Dask
arrays within an xarray.Dataset
from bed
, bim
, and fam
files.
PLINK IO support is an “extra” feature within sgkit and requires additional dependencies. To install sgkit with PLINK support using pip:
$ pip install --pre 'sgkit[plink]'
VCF¶
The sgkit.io.vcf.vcf_to_zarr()
function converts one or more VCF files to
Zarr files stored in sgkit’s Xarray data representation, which can then be opened
as a xarray.Dataset
.
See Reading VCF for installation instructions, and details on using VCF in sgkit.
Converting genetic data to Zarr¶
TODO: Describe the process and motivation for converting genetic file formats to Zarr prior to analysis
Working with cloud-native data¶
TODO: Show how to read/write Zarr (and VCF?) data in cloud storage
Deploying sgkit on a cluster¶
TODO: Create a tutorial on running sgkit at scale
Adding custom data to a Dataset¶
TODO: Show how something like sample metadata can be joined to an existing Xarray dataset. Also briefly explain indexing and uniqueness within Xarray/Pandas, since this is critical for understanding joins.
Using GPUs¶
TODO: Show CuPy examples
Custom Computations¶
TODO: Finish explaining how Numba works and how users might apply it
Here is an example that demonstrates an alt allele count:
In [1]: import numba
In [2]: import sgkit as sg
In [3]: import numpy as np
In [4]: ds = sg.simulate_genotype_call_dataset(5, 3, missing_pct=.2)
In [5]: def alt_allele_count(gt):
...: out = np.full(gt.shape[:2], -1, dtype=np.int64)
...: for i, j in np.ndindex(*out.shape):
...: if np.all(gt[i, j] >= 0):
...: out[i, j] = np.sum(gt[i, j] > 0)
...: return out
...:
In [6]: numba.njit(alt_allele_count)(ds.call_genotype.values)
Out[6]:
array([[ 0, 1, 1],
[ 1, 1, -1],
[-1, 1, 2],
[ 0, -1, -1],
[ 1, 2, 2]])