.. _getting_started: ********************** Getting Started ********************** .. contents:: Table of contents: :local: Installation ------------ You can install sgkit with pip. Python 3.9, 3.10, or 3.11 is required:: $ pip install sgkit Alternatively, you can use conda:: $ conda install -c conda-forge sgkit .. Overview -------- Sgkit is a general purpose toolkit for quantitative and population genetics. The primary goal of sgkit is to take advantage of powerful tools in the `PyData ecosystem `_ to facilitate interactive analysis of large-scale datasets. The main libraries we use are: - `Xarray `_: N-D labeling for arrays and datasets - `Dask `_: Parallel computing on chunked arrays - `Zarr `_: Storage for chunked arrays - `Numpy `_: N-D in-memory array manipulation - `Pandas `_: Tabular data frame manipulation - `CuPy `_: Numpy-like array interface for CUDA GPUs Data structures --------------- Sgkit uses Xarray `Dataset `_ objects to model genetic data. Users are free to manipulate quantities within these objects as they see fit and a set of conventions for variable names, dimensions and underlying data types is provided to aid in workflow standardization. The example below illustrates a ``Dataset`` format that would result from an assay expressible as `VCF `_, `PLINK `_ or `BGEN `_. This is a guideline however, and a ``Dataset`` seen in practice might include many more or fewer variables and dimensions. .. This image was generated as an export from https://docs.google.com/drawings/d/1NheB6LCvvkB4C0nAoSFwoYVZ3mtOPaseGmg_mZvcQ8I/edit?usp=sharing .. image:: _static/data-structures-xarray.jpg :width: 600 :align: center The worked examples below show how to access and visualize data like this using Xarray. They also demonstrate several of the sgkit conventions in place for representing common genetic quantities. .. ipython:: python import sgkit as sg ds = sg.simulate_genotype_call_dataset(n_variant=1000, n_sample=250, n_contig=23, missing_pct=.1) ds The presence of a single-nucleotide variant (SNV) is indicated above by the ``call_genotype`` variable, which contains an integer value corresponding to the index of the associated allele present (i.e. index into the ``variant_allele`` variable) for a sample at a given genome coordinate and chromosome. Every sgkit variable has a set of fixed semantics like this. For more information on this specific variable and any others, see :ref:`genetic_variables`. Data subsets can be accessed as shown here, using the features described in `Indexing and Selecting Xarray Data `_: .. ipython:: python import sgkit as sg ds = sg.simulate_genotype_call_dataset(n_variant=100, n_sample=50, n_contig=23, missing_pct=.1) # Subset the entire dataset to the first 10 variants/samples ds.isel(variants=slice(10), samples=slice(10)) # Subset to a specific set of variables ds[['variant_allele', 'call_genotype']] # Extract a single variable ds.call_genotype[:3, :3] # Access the array underlying a single variable (this would return dask.array.Array if chunked) ds.call_genotype.data[:3, :3] # Access the alleles corresponding to the calls for the first variant and sample allele_indexes = ds.call_genotype[0, 0] allele_indexes ds.variant_allele[0, allele_indexes] # Get a single item from an array as a Python scalar ds.sample_id.item(0) Larger subsets of data can be visualized and/or summarized through various sgkit utilities as well as the Pandas/Xarray integration: .. ipython:: python import sgkit as sg ds = sg.simulate_genotype_call_dataset(n_variant=1000, n_sample=250, missing_pct=.1) # Show genotype calls with domain-specific display logic sg.display_genotypes(ds, max_variants=8, max_samples=8) # A naive version of the above is also possible using only Xarray/Pandas and # illustrates the flexibility that comes from being able to transition into # and out of array/dataframe representations easily (ds.call_genotype[:5, :5].to_series() .unstack().where(lambda df: df >= 0, None).fillna('.') .astype(str).apply('/'.join, axis=1).unstack()) # Show call rate distribution for each variant using Pandas df = ~ds.call_genotype_mask.to_dataframe() df.head(5) call_rates = df.groupby('variants').mean() call_rates @savefig call_rate_example.png width=6in height=3in call_rates.plot(kind='hist', bins=24, title='Call Rate Distribution', figsize=(6, 3)) This last example alludes to representations of missing data that are explained further in :ref:`missing_data`. .. _genetic_methods: Genetic methods --------------- Genetic methods in sgkit are nearly always applied to individual ``Dataset`` objects. For a full list of available methods, see :ref:`api_methods`. In this example, the ``variant_stats`` method is applied to a dataset to compute a number of statistics across samples for each individual variant: .. ipython:: python import sgkit as sg ds = sg.simulate_genotype_call_dataset(n_variant=100, n_sample=50, missing_pct=.1) sg.variant_stats(ds, merge=False) There are two ways that the results of every function are handled -- either they are merged with the provided dataset or they are returned in a separate dataset. See :ref:`dataset_merge` for more details. .. _missing_data: Missing data ------------ Missing data in sgkit is represented using a sentinel value within data arrays (``-1`` in integer arrays and ``NaN`` in float arrays) as well as a companion boolean mask array (``True`` where data is missing). These sentinel values are handled transparently in most sgkit functions and where this isn't possible, limitations related to it are documented along with potential workarounds. This example demonstrates one such function where missing calls are ignored: .. ipython:: python import sgkit as sg ds = sg.simulate_genotype_call_dataset(n_variant=1, n_sample=4, n_ploidy=2, missing_pct=.3, seed=4) ds.call_genotype # Here, you can see that the missing calls above are not included in the allele counts sg.count_variant_alleles(ds).variant_allele_count A primary design goal in sgkit is to facilitate ad hoc analysis. There are many useful functions in the library but they are not enough on their own to accomplish many analyses. To that end, it is often helpful to be able to handle missing data in your own functions or exploratory summaries. Both the sentinel values and the boolean mask array help make this possible, where the sentinel values are typically more useful when implementing compiled operations and the boolean mask array is easier to use in a higher level API like Xarray or Numpy. Only advanced users would likely ever need to worry about compiling their own functions (see :ref:`custom_computations` for more details). Using Xarray functions and the boolean mask is generally enough to accomplish most tasks, and this mask is often more efficient to operate on due to its high on-disk compression ratio. This example shows how it can be used in the context of doing something simple like counting heterozygous calls: .. ipython:: python import sgkit as sg import xarray as xr ds = sg.simulate_genotype_call_dataset(n_variant=1, n_sample=4, n_ploidy=2, missing_pct=.2, seed=2) # This array contains the allele indexes called for a sample ds.call_genotype # This array represents only locations where the above calls are missing ds.call_genotype_mask # Determine which calls are heterozygous is_heterozygous = (ds.call_genotype[..., 0] != ds.call_genotype[..., 1]) is_heterozygous # Count the number of heterozygous samples for the lone variant is_heterozygous.sum().item(0) # This is almost correct except that the calls for the first sample aren't # really heterozygous, one of them is just missing. Conditional logic like # this can be used to isolate those values and replace them in the result: xr.where(ds.call_genotype_mask.any(dim='ploidy'), False, is_heterozygous).sum().item(0) # Now the result is correct -- only the third sample is heterozygous so the count should be 1. # This how many sgkit functions handle missing data internally: sg.variant_stats(ds).variant_n_het.values.item(0) Windowing --------- It is common to compute statistics in windows along the genome. Some :ref:`api_methods` in sgkit are "windowing aware" and will compute values for windows defined in a dataset. If no windows are defined then the values will typically be computed for each variant. It is therefore important to define windows *before* computing statistics on a dataset. Windows are intervals that span the ``variants`` dimension in a dataset, and they are defined using the :func:`sgkit.window_by_variant`, :func:`sgkit.window_by_position`, and :func:`sgkit.window_by_genome` functions. The first function, :func:`sgkit.window_by_variant`, is the simplest and produces windows with a fixed number of variants in each window, determined by the ``size`` argument. An optional ``step`` argument may be provided to control the spacing between windows. By default, it is the same as the ``size``, giving contiguous windows. The second function, :func:`sgkit.window_by_position`, produces windows whose size is measured by genomic position (base pairs). See the API documentation for usage and examples. The third function, :func:`sgkit.window_by_genome`, produces a single window spanning the whole genome, which can be used to compute whole-genome statistics. This example shows the effect of computing the diversity statistic: first with no windows defined, then with windows. .. ipython:: python :okwarning: import sgkit as sg import xarray as xr ds = sg.simulate_genotype_call_dataset(n_variant=100, n_sample=50) # Define a single cohort for all samples ds["sample_cohort"] = xr.DataArray(np.full(ds.dims['samples'], 0), dims="samples") # The diversity statistic is computed for every variant since no windows are defined sg.diversity(ds, merge=False) # Define windows of size 20 variants. This creates a new dimension called `windows`, and # some new variables for internal use. ds = sg.window_by_variant(ds, size=20) # The diversity statistic is now computed for every window sg.diversity(ds, merge=False) Cohorts ------- During analysis we often want to be able to group samples into populations, and compute statistics based on these groups. Groups of samples are referred to as *cohorts* in sgkit. Cohorts are defined by a mapping from samples to cohort index. The following example creates a ``sample_cohort`` variable to group a dataset of ten samples into three cohorts. Note that first value is ``-1``, which means the corresponding sample is not in any of the three cohorts, and will be ignored when computing cohort statistics. .. ipython:: python :okwarning: import sgkit as sg import xarray as xr ds = sg.simulate_genotype_call_dataset(n_variant=100, n_sample=10) ds["sample_cohort"] = xr.DataArray(np.array([-1, 0, 1, 1, 1, 1, 0, 2, 2, 2]), dims="samples") Typically the ``sample_cohort`` variable is derived from a dataframe that has the sample/cohort mapping. Cohort-level statistics can have repeated ``cohorts`` dimensions. :func:`sgkit.Fst`, for example, produces statistics for *pairs* of cohorts, which is represented as a variable with dimensions ``(windows, cohorts_0, cohorts_1)``, making it possible to read off the value of the statistic for any pair of cohorts. It's convenient to name cohorts, to avoid errors that can occur when using index values. This example shows how to give cohorts names. .. ipython:: python :okwarning: ds = sg.window_by_variant(ds, size=20) ds = sg.Fst(ds) cohort_names = ["Africa", "Asia", "Europe"] ds = ds.assign_coords({"cohorts_0": cohort_names, "cohorts_1": cohort_names}) ds.stat_Fst.sel(cohorts_0="Africa", cohorts_1="Asia").values Methods that work with cohorts will, by default, operate over all cohorts at once. Sometimes however you might only want to run the computation for a subset of cohorts, in which case you can explicitly specify the cohorts when calling the function. Chaining operations ------------------- `Method chaining `_ is a common practice with Python data tools that improves code readability and reduces the probability of introducing accidental namespace collisions. Sgkit functions are compatible with this idiom by default and this example shows to use it in conjunction with Xarray and Pandas operations in a single pipeline: .. ipython:: python :okwarning: import sgkit as sg ds = sg.simulate_genotype_call_dataset(n_variant=100, n_sample=50, missing_pct=.1) # Use `pipe` to apply a single sgkit function to a dataset ds_qc = ds.pipe(sg.variant_stats).drop_dims('samples') ds_qc # Show statistics for one of the arrays to be used as a filter ds_qc.variant_call_rate.to_series().describe() # Build a pipeline that filters on call rate and computes Fst between two cohorts # for windows of size 20 variants ( ds # Add and compute call rate and other statistics .pipe(sg.variant_stats).compute() # Apply filter to include variants present across > 80% of samples .pipe(lambda ds: ds.sel(variants=ds.variant_call_rate > .8)) # Create windows of size 20 variants .pipe(lambda ds: sg.window_by_variant(ds, size=20)) # Assign a "cohort" variable that splits samples into two groups .assign(sample_cohort=np.repeat([0, 1], ds.dims['samples'] // 2)) # Compute Fst between the groups .pipe(sg.Fst) # Extract the Fst values for cohort pairs .stat_Fst.values ) This is possible because sgkit functions nearly always take a ``Dataset`` as the first argument, create new variables, and then merge these new variables into a copy of the provided dataset in the returned value. See :ref:`dataset_merge` for more details. Chunked arrays -------------- Chunked arrays are required when working on large datasets. Libraries for managing chunked arrays such as `Dask Array `_ and `Zarr `_ make it possible to implement blockwise algorithms that operate on subsets of arrays (in parallel) without ever requiring them to fit entirely in memory. By design, they behave almost identically to in-memory (typically Numpy) arrays within Xarray and can be interchanged freely when provided to sgkit functions. The most notable difference in behavior though is that operations on chunked arrays are `evaluated lazily `_. This means that if an Xarray ``Dataset`` contains only chunked arrays, no actual computations will be performed until one of the following occurs: - `Dataset.compute `_ is called - `DataArray.compute `_ is called - The ``DataArray.values`` attribute is referenced - Individual dask arrays are retrieved through the ``DataArray.data`` attribute and forced to evaluate via `Client.compute `_, `dask.array.Array.compute `_ or by coercing them to another array type (e.g. using `np.asarray `_) This example shows a few of these features: .. ipython:: python import sgkit as sg ds = sg.simulate_genotype_call_dataset(n_variant=100, n_sample=50, missing_pct=.1) # Chunk our original in-memory dataset using a blocksize of 50 in all dimensions. ds = ds.chunk(chunks=50) ds # Show the chunked array representing base pair position ds.variant_position # Call compute via the dask.array API ds.variant_position.data.compute()[:5] # Coerce to numpy via Xarray ds.variant_position.values[:5] # Compute without unboxing from xarray.DataArray ds.variant_position.compute()[:5] Unlike this simplified example, real datasets often contain a mixture of chunked and unchunked arrays. Sgkit will often load smaller arrays directly into memory while leaving large arrays chunked as a trade-off between convenience and resource usage. This can always be modified by users though and sgkit functions that operate on a ``Dataset`` should work regardless of the underlying array backend. See `Parallel computing with Dask in Xarray `_ for more examples and information, as well as the Dask tutorials on `delayed array execution `_ and `lazy execution in Dask graphs `_.