User Guide

IO

VCF

The sgkit.io.vcf.vcf_to_zarr() function converts one or more VCF files to Zarr files stored in sgkit’s Xarray data representation, which can then be opened as a xarray.Dataset.

See Reading VCF for installation instructions, and details on using VCF in sgkit.

Converting genetic data to Zarr

TODO: Describe the process and motivation for converting genetic file formats to Zarr prior to analysis

Working with cloud-native data

TODO: Show how to read/write Zarr (and VCF?) data in cloud storage

Deploying sgkit on a cluster

TODO: Create a tutorial on running sgkit at scale

Adding custom data to a Dataset

TODO: Show how something like sample metadata can be joined to an existing Xarray dataset. Also briefly explain indexing and uniqueness within Xarray/Pandas, since this is critical for understanding joins.

PCA

TODO: Describe the upstream tools for PCA (i.e. those in dask-ml/scikit-learn)

Using GPUs

TODO: Show CuPy examples

Custom Computations

TODO: Finish explaining how Numba works and how users might apply it

Here is an example that demonstrates an alt allele count:

In [1]: import numba

In [2]: import sgkit as sg

In [3]: import numpy as np

In [4]: ds = sg.simulate_genotype_call_dataset(5, 3, missing_pct=.2)

In [5]: def alt_allele_count(gt):
   ...:     out = np.full(gt.shape[:2], -1, dtype=np.int64)
   ...:     for i, j in np.ndindex(*out.shape):
   ...:         if np.all(gt[i, j] >= 0):
   ...:             out[i, j] = np.sum(gt[i, j] > 0)
   ...:     return out
   ...: 

In [6]: numba.njit(alt_allele_count)(ds.call_genotype.values)
Out[6]: 
array([[ 0,  1,  1],
       [ 1,  1, -1],
       [-1,  1,  2],
       [ 0, -1, -1],
       [ 1,  2,  2]])