GWAS Tutorial#

This notebook is an sgkit port of Hail’s GWAS Tutorial, which demonstrates how to run a genome-wide SNP association test. Readers are encouraged to read the Hail tutorial alongside this one for more background, and to see the motivation behind some of the steps.

Note that some of the results do not exactly match the output from Hail. Also, since sgkit is still a 0.x release, its API is still subject to non-backwards compatible changes.

import sgkit as sg

Before using sgkit, we import some standard Python libraries and set the Xarray display options to not show all the attributes in a dataset by default.

import numpy as np
import pandas as pd
import xarray as xr
xr.set_options(display_expand_attrs=False, display_expand_data_vars=True);

Download public 1000 Genomes data#

We use the same small (20MB) portion of the public 1000 Genomes data that Hail uses.

First, download the file locally:

from pathlib import Path
import requests

if not Path("1kg.vcf.bgz").exists():
    response = requests.get("https://storage.googleapis.com/sgkit-data/tutorial/1kg.vcf.bgz")
    with open("1kg.vcf.bgz", "wb") as f:
        f.write(response.content)
if not Path("1kg.vcf.bgz.tbi").exists():
    response = requests.get("https://storage.googleapis.com/sgkit-data/tutorial/1kg.vcf.bgz.tbi")
    with open("1kg.vcf.bgz.tbi", "wb") as f:
        f.write(response.content)

Importing data from VCF#

Next, convert the VCF file to Zarr using the vcf2zarr command in bio2zarr, stored on the local filesystem in a directory called 1kg.vcz.

%%bash
vcf2zarr explode --force 1kg.vcf.bgz 1kg.icf
# vcf2zarr mkschema 1kg.icf > 1kg.schema.json # then edit 1kg.schema.json by hand
vcf2zarr encode --force -s 1kg.schema.json 1kg.icf 1kg.vcz
    Scan:   0%|          | 0.00/1.00 [00:00<?, ?files/s]
    Scan:   0%|          | 0.00/1.00 [00:00<?, ?files/s]
    Scan:   0%|          | 0.00/1.00 [00:00<?, ?files/s]
    Scan:   0%|          | 0.00/1.00 [00:00<?, ?files/s][W::bcf_hdr_check_sanity] PL should be declared as Number=G
    Scan:   0%|          | 0.00/1.00 [00:00<?, ?files/s]
    Scan:   0%|          | 0.00/1.00 [00:00<?, ?files/s]
    Scan:   0%|          | 0.00/1.00 [00:00<?, ?files/s]
    Scan: 100%|██████████| 1.00/1.00 [00:00<00:00, 1.57files/s]
 Explode:   0%|          | 0.00/10.9k [00:00<?, ?vars/s]
 Explode:   0%|          | 0.00/10.9k [00:00<?, ?vars/s]
 Explode:   0%|          | 0.00/10.9k [00:00<?, ?vars/s]
 Explode:   0%|          | 0.00/10.9k [00:00<?, ?vars/s][W::bcf_hdr_check_sanity] PL should be declared as Number=G
 Explode:   0%|          | 0.00/10.9k [00:00<?, ?vars/s]
 Explode:   0%|          | 0.00/10.9k [00:00<?, ?vars/s]
 Explode:   0%|          | 50.0/10.9k [00:00<00:21, 495vars/s]
 Explode:   2%|▏         | 263/10.9k [00:00<00:07, 1.34kvars/s]
 Explode:   4%|▍         | 477/10.9k [00:00<00:06, 1.63kvars/s]
 Explode:   6%|▋         | 691/10.9k [00:00<00:05, 1.77kvars/s]
 Explode:   8%|▊         | 903/10.9k [00:01<00:05, 1.85kvars/s]
 Explode:   9%|▉         | 1.02k/10.9k [00:02<00:17, 559vars/s]
 Explode:  11%|█▏        | 1.24k/10.9k [00:02<00:14, 674vars/s]
 Explode:  13%|█▎        | 1.46k/10.9k [00:02<00:12, 783vars/s]
 Explode:  15%|█▌        | 1.67k/10.9k [00:02<00:10, 885vars/s]
 Explode:  17%|█▋        | 1.82k/10.9k [00:03<00:16, 546vars/s]
 Explode:  19%|█▊        | 2.03k/10.9k [00:03<00:14, 623vars/s]
 Explode:  21%|██        | 2.25k/10.9k [00:03<00:12, 698vars/s]
 Explode:  23%|██▎       | 2.46k/10.9k [00:03<00:10, 776vars/s]
 Explode:  24%|██▍       | 2.64k/10.9k [00:04<00:14, 561vars/s]
 Explode:  26%|██▌       | 2.78k/10.9k [00:05<00:15, 511vars/s]
 Explode:  28%|██▊       | 3.00k/10.9k [00:05<00:13, 573vars/s]
 Explode:  29%|██▉       | 3.15k/10.9k [00:05<00:15, 496vars/s]
 Explode:  31%|███       | 3.37k/10.9k [00:06<00:13, 555vars/s]
 Explode:  33%|███▎      | 3.59k/10.9k [00:06<00:11, 615vars/s]
 Explode:  35%|███▍      | 3.78k/10.9k [00:06<00:10, 666vars/s]
 Explode:  36%|███▌      | 3.93k/10.9k [00:07<00:13, 527vars/s]
 Explode:  38%|███▊      | 4.16k/10.9k [00:07<00:11, 584vars/s]
 Explode:  40%|████      | 4.38k/10.9k [00:07<00:10, 642vars/s]
 Explode:  42%|████▏     | 4.53k/10.9k [00:08<00:12, 521vars/s]
 Explode:  44%|████▎     | 4.76k/10.9k [00:08<00:10, 576vars/s]
 Explode:  46%|████▌     | 4.98k/10.9k [00:08<00:09, 632vars/s]
 Explode:  47%|████▋     | 5.14k/10.9k [00:09<00:10, 532vars/s]
 Explode:  49%|████▉     | 5.37k/10.9k [00:09<00:09, 586vars/s]
 Explode:  51%|█████     | 5.52k/10.9k [00:10<00:11, 472vars/s]
 Explode:  53%|█████▎    | 5.75k/10.9k [00:10<00:09, 521vars/s]
 Explode:  54%|█████▍    | 5.92k/10.9k [00:10<00:08, 557vars/s]
 Explode:  56%|█████▌    | 6.08k/10.9k [00:11<00:09, 500vars/s]
 Explode:  58%|█████▊    | 6.30k/10.9k [00:11<00:08, 551vars/s]
 Explode:  59%|█████▉    | 6.45k/10.9k [00:11<00:07, 581vars/s]
 Explode:  61%|██████    | 6.60k/10.9k [00:12<00:08, 495vars/s]
 Explode:  62%|██████▏   | 6.78k/10.9k [00:12<00:07, 534vars/s]
 Explode:  64%|██████▍   | 6.99k/10.9k [00:12<00:06, 581vars/s]
 Explode:  66%|██████▌   | 7.13k/10.9k [00:13<00:07, 490vars/s]
 Explode:  68%|██████▊   | 7.35k/10.9k [00:13<00:06, 541vars/s]
 Explode:  70%|██████▉   | 7.57k/10.9k [00:13<00:05, 593vars/s]
 Explode:  71%|███████   | 7.74k/10.9k [00:14<00:06, 508vars/s]
 Explode:  73%|███████▎  | 7.92k/10.9k [00:14<00:05, 547vars/s]
 Explode:  74%|███████▍  | 8.07k/10.9k [00:14<00:05, 497vars/s]
 Explode:  75%|███████▌  | 8.21k/10.9k [00:15<00:05, 466vars/s]
 Explode:  77%|███████▋  | 8.33k/10.9k [00:15<00:05, 443vars/s]
 Explode:  79%|███████▊  | 8.55k/10.9k [00:16<00:04, 493vars/s]
 Explode:  80%|███████▉  | 8.68k/10.9k [00:16<00:04, 448vars/s]
 Explode:  82%|████████▏ | 8.90k/10.9k [00:16<00:03, 498vars/s]
 Explode:  83%|████████▎ | 9.03k/10.9k [00:17<00:04, 450vars/s]
 Explode:  85%|████████▌ | 9.26k/10.9k [00:17<00:03, 501vars/s]
 Explode:  86%|████████▋ | 9.39k/10.9k [00:18<00:03, 455vars/s]
 Explode:  88%|████████▊ | 9.62k/10.9k [00:18<00:02, 505vars/s]
 Explode:  90%|████████▉ | 9.76k/10.9k [00:18<00:02, 467vars/s]
 Explode:  92%|█████████▏| 9.97k/10.9k [00:18<00:01, 514vars/s]
 Explode:  93%|█████████▎| 10.1k/10.9k [00:19<00:01, 468vars/s]
 Explode:  95%|█████████▍| 10.3k/10.9k [00:19<00:01, 507vars/s]
 Explode:  96%|█████████▌| 10.4k/10.9k [00:20<00:01, 465vars/s]
 Explode:  97%|█████████▋| 10.5k/10.9k [00:20<00:00, 438vars/s]
 Explode:  98%|█████████▊| 10.6k/10.9k [00:21<00:00, 414vars/s]
 Explode: 100%|█████████▉| 10.9k/10.9k [00:21<00:00, 465vars/s]
 Explode: 100%|██████████| 10.9k/10.9k [00:21<00:00, 510vars/s]
  Encode:   0%|          | 0.00/28.2M [00:00<?, ?B/s]
  Encode:   0%|          | 0.00/28.2M [00:00<?, ?B/s]
  Encode:   0%|          | 0.00/28.2M [00:00<?, ?B/s]
  Encode:   0%|          | 0.00/28.2M [00:00<?, ?B/s]
  Encode:   0%|          | 90.0k/28.2M [00:00<00:31, 889kB/s]
  Encode:   1%|          | 270k/28.2M [00:00<00:22, 1.23MB/s]
  Encode:   1%|          | 350k/28.2M [00:00<00:25, 1.08MB/s]
  Encode:  21%|██▏       | 6.03M/28.2M [00:01<00:02, 7.86MB/s]
  Encode:  31%|███▏      | 8.87M/28.2M [00:01<00:02, 9.38MB/s]
  Encode:  42%|████▏     | 11.7M/28.2M [00:01<00:01, 10.3MB/s]
  Encode:  62%|██████▏   | 17.4M/28.2M [00:01<00:00, 11.8MB/s]
  Encode:  92%|█████████▏| 25.9M/28.2M [00:01<00:00, 17.7MB/s]
  Encode: 100%|██████████| 28.2M/28.2M [00:02<00:00, 15.7MB/s]
  Encode: 100%|██████████| 28.2M/28.2M [00:02<00:00, 12.4MB/s]
Finalise:   0%|          | 0.00/13.0 [00:00<?, ?array/s]
Finalise: 100%|██████████| 13.0/13.0 [00:00<00:00, 1.26karray/s]

We used the vcf2zarr explode command to first convert the VCF to an “intermediate columnar format” (ICF), then the vcf2zarr encode command to convert the ICF to Zarr, which by convention is stored in a directory with a vcz extension.

Note that we specified a JSON schema file that was created with the vcf2zarr mkschema command (commented out above), then edited to drop some fields that are not needed for this tutorial (such as FORMAT/PL). It was also updated to change the call_AD field’s third dimension to be alleles, which was not set by vcf2zarr since the dataset we are using defines FORMAT/AD as . which means “unknown”, rather than R.

For more information about using vcf2zarr, see the tutorial in the bio2zarr documentation.

Now the data has been written as Zarr, all downstream operations on will be much faster. Note that sgkit uses an Xarray dataset to represent the VCF data, where Hail uses MatrixTable.

ds = sg.load_dataset("1kg.vcz")

Getting to know our data#

To start with we’ll look at some summary data from the dataset.

The simplest thing is to look at the dimensions and data variables in the Xarray dataset.

ds
<xarray.Dataset> Size: 28MB
Dimensions:               (variants: 10879, samples: 284, alleles: 2,
                           ploidy: 2, contigs: 84, filters: 1)
Dimensions without coordinates: variants, samples, alleles, ploidy, contigs,
                                filters
Data variables: (12/17)
    call_AD               (variants, samples, alleles) int8 6MB dask.array<chunksize=(10000, 284, 2), meta=np.ndarray>
    call_DP               (variants, samples) int8 3MB dask.array<chunksize=(10000, 284), meta=np.ndarray>
    call_GQ               (variants, samples) int8 3MB dask.array<chunksize=(10000, 284), meta=np.ndarray>
    call_genotype         (variants, samples, ploidy) int8 6MB dask.array<chunksize=(10000, 284, 2), meta=np.ndarray>
    call_genotype_mask    (variants, samples, ploidy) bool 6MB dask.array<chunksize=(10000, 284, 2), meta=np.ndarray>
    call_genotype_phased  (variants, samples) bool 3MB dask.array<chunksize=(10000, 284), meta=np.ndarray>
    ...                    ...
    variant_contig        (variants) int8 11kB dask.array<chunksize=(10000,), meta=np.ndarray>
    variant_filter        (variants, filters) bool 11kB dask.array<chunksize=(10000, 1), meta=np.ndarray>
    variant_id            (variants) object 87kB dask.array<chunksize=(10000,), meta=np.ndarray>
    variant_id_mask       (variants) bool 11kB dask.array<chunksize=(10000,), meta=np.ndarray>
    variant_position      (variants) int32 44kB dask.array<chunksize=(10000,), meta=np.ndarray>
    variant_quality       (variants) float32 44kB dask.array<chunksize=(10000,), meta=np.ndarray>
Attributes: (3)

Next we’ll use display_genotypes to show the the first and last few variants and samples.

Note: sgkit does not store the contig names in an easily accessible form, so we compute a variable variant_contig_name in the same dataset storing them for later use, and set an index so we can see the variant name, position, and ID.

ds["variant_contig_name"] = ds.contig_id[ds.variant_contig]
ds2 = ds.set_index({"variants": ("variant_contig_name", "variant_position", "variant_id")})
sg.display_genotypes(ds2, max_variants=10, max_samples=5)
samples HG00096 HG00099 ... NA21133 NA21143
variants
(1, 904165, .) 0/0 0/0 ... 0/0 0/0
(1, 909917, .) 0/0 0/0 ... 0/0 0/0
(1, 986963, .) 0/0 0/0 ... 0/0 0/0
(1, 1563691, .) ./. 0/0 ... 0/0 0/0
(1, 1707740, .) 0/1 0/1 ... 0/1 0/0
... ... ... ... ... ...
(X, 152660491, .) ./. 0/0 ... 1/1 0/0
(X, 153031688, .) 0/0 0/0 ... 0/0 0/0
(X, 153674876, .) 0/0 0/0 ... 0/0 0/0
(X, 153706320, .) ./. 0/0 ... 0/0 0/0
(X, 154087368, .) 0/0 1/1 ... 1/1 1/1

10879 rows x 284 columns

We can show the alleles too.

Note: this needs work to make it easier to do

df_variant = ds[[v for v in ds.data_vars if v.startswith("variant_")]].to_dataframe()
df_variant.groupby(["variant_contig_name", "variant_position", "variant_id"]).agg({"variant_allele": lambda x: list(x)}).head(5)
variant_allele
variant_contig_name variant_position variant_id
1 904165 . [G, A]
909917 . [G, A]
986963 . [C, T]
1563691 . [T, G]
1707740 . [T, G]

Show the first five sample IDs by referencing the dataset variable directly:

ds.sample_id[:5].values
array(['HG00096', 'HG00099', 'HG00105', 'HG00118', 'HG00129'],
      dtype=object)

Adding column fields#

Xarray datasets can have any number of variables added to them, possibly loaded from different sources. Next we’ll take a text file (CSV) containing annotations, and use it to annotate the samples in the dataset.

First we load the annotation data using regular Pandas.

ANNOTATIONS_FILE = "https://storage.googleapis.com/sgkit-gwas-tutorial/1kg_annotations.txt"
df = pd.read_csv(ANNOTATIONS_FILE, sep="\t", index_col="Sample")
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 3500 entries, HG00096 to NA21144
Data columns (total 5 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Population           3500 non-null   object
 1   SuperPopulation      3500 non-null   object
 2   isFemale             3500 non-null   bool  
 3   PurpleHair           3500 non-null   bool  
 4   CaffeineConsumption  3500 non-null   int64 
dtypes: bool(2), int64(1), object(2)
memory usage: 116.2+ KB
df
Population SuperPopulation isFemale PurpleHair CaffeineConsumption
Sample
HG00096 GBR EUR False False 4
HG00097 GBR EUR True True 4
HG00098 GBR EUR False False 5
HG00099 GBR EUR True False 4
HG00100 GBR EUR True False 5
... ... ... ... ... ...
NA21137 GIH SAS True False 1
NA21141 GIH SAS True True 2
NA21142 GIH SAS True True 2
NA21143 GIH SAS True True 5
NA21144 GIH SAS True False 3

3500 rows × 5 columns

To join the annotation data with the genetic data, we convert it to Xarray, then do a join.

ds_annotations = pd.DataFrame.to_xarray(df).rename({"Sample":"samples"})
ds = ds.set_index({"samples": "sample_id"})
ds = ds.merge(ds_annotations, join="left")
ds = ds.reset_index("samples").reset_coords(drop=True)
ds
<xarray.Dataset> Size: 28MB
Dimensions:               (samples: 284, variants: 10879, alleles: 2,
                           ploidy: 2, contigs: 84, filters: 1)
Dimensions without coordinates: samples, variants, alleles, ploidy, contigs,
                                filters
Data variables: (12/22)
    call_AD               (variants, samples, alleles) int8 6MB dask.array<chunksize=(10000, 284, 2), meta=np.ndarray>
    call_DP               (variants, samples) int8 3MB dask.array<chunksize=(10000, 284), meta=np.ndarray>
    call_GQ               (variants, samples) int8 3MB dask.array<chunksize=(10000, 284), meta=np.ndarray>
    call_genotype         (variants, samples, ploidy) int8 6MB dask.array<chunksize=(10000, 284, 2), meta=np.ndarray>
    call_genotype_mask    (variants, samples, ploidy) bool 6MB dask.array<chunksize=(10000, 284, 2), meta=np.ndarray>
    call_genotype_phased  (variants, samples) bool 3MB dask.array<chunksize=(10000, 284), meta=np.ndarray>
    ...                    ...
    variant_contig_name   (variants) object 87kB dask.array<chunksize=(10000,), meta=np.ndarray>
    Population            (samples) object 2kB 'GBR' 'GBR' 'GBR' ... 'GIH' 'GIH'
    SuperPopulation       (samples) object 2kB 'EUR' 'EUR' 'EUR' ... 'SAS' 'SAS'
    isFemale              (samples) bool 284B False True False ... False True
    PurpleHair            (samples) bool 284B False False False ... True True
    CaffeineConsumption   (samples) int64 2kB 4 4 4 3 6 2 4 2 ... 6 4 6 4 6 5 5
Attributes: (3)

Query functions#

We can look at some statistics of the data by converting the relevant dataset variable to a Pandas series, then using its built-in summary functions. Annotation data is usually small enough to load into memory, which is why it’s OK using Pandas here.

Here’s the population distribution by continent:

ds_annotations.SuperPopulation.to_series().value_counts()
SuperPopulation
AFR    1018
EUR     669
SAS     661
EAS     617
AMR     535
Name: count, dtype: int64

The distribution of the CaffeineConsumption variable:

ds_annotations.CaffeineConsumption.to_series().describe()
count    3500.000000
mean        3.983714
std         1.702349
min        -1.000000
25%         3.000000
50%         4.000000
75%         5.000000
max        10.000000
Name: CaffeineConsumption, dtype: float64

There are far fewer samples in our dataset than the full 1000 genomes dataset, as we can see from the following queries.

len(ds_annotations.samples)
3500
len(ds.samples)
284
ds.SuperPopulation.to_series().value_counts()
SuperPopulation
AFR    76
EAS    72
SAS    55
EUR    47
AMR    34
Name: count, dtype: int64
ds.CaffeineConsumption.to_series().describe()
count    284.000000
mean       4.415493
std        1.580549
min        0.000000
25%        3.000000
50%        4.000000
75%        5.000000
max        9.000000
Name: CaffeineConsumption, dtype: float64

Here’s an example of doing an ad hoc query to uncover a biological insight from the data: calculate the counts of each of the 12 possible unique SNPs (4 choices for the reference base * 3 choices for the alternate base).

df_variant.groupby(["variant_contig_name", "variant_position", "variant_id"])["variant_allele"].apply(tuple).value_counts()
variant_allele
(C, T)    2418
(G, A)    2367
(A, G)    1929
(T, C)    1864
(C, A)     494
(G, T)     477
(T, G)     466
(A, C)     451
(C, G)     150
(G, C)     111
(T, A)      77
(A, T)      75
Name: count, dtype: int64

Often we want to plot the data, to get a feel for how it’s distributed. Xarray has some convenience functions for plotting, which we use here to show the distribution of the DP field.

dp = ds.call_DP.where(ds.call_DP >= 0) # filter out missing
dp.attrs["long_name"] = "DP"
xr.plot.hist(dp, range=(0, 30), bins=30, size=8, edgecolor="black");
../_images/89e7c1127c722aa58cec5a1499e7ca38e59ddfccbd8d03b6dd6147a90d0cf67c.png

Quality control#

QC is the process of filtering out poor quality data before running an analysis. This is usually an iterative process.

The sample_stats function in sgkit computes a collection of useful metrics for each sample and stores them in new variables. (The Hail equivalent is sample_qc.)

Here’s the dataset before running sample_stats.

ds
<xarray.Dataset> Size: 28MB
Dimensions:               (samples: 284, variants: 10879, alleles: 2,
                           ploidy: 2, contigs: 84, filters: 1)
Dimensions without coordinates: samples, variants, alleles, ploidy, contigs,
                                filters
Data variables: (12/22)
    call_AD               (variants, samples, alleles) int8 6MB dask.array<chunksize=(10000, 284, 2), meta=np.ndarray>
    call_DP               (variants, samples) int8 3MB dask.array<chunksize=(10000, 284), meta=np.ndarray>
    call_GQ               (variants, samples) int8 3MB dask.array<chunksize=(10000, 284), meta=np.ndarray>
    call_genotype         (variants, samples, ploidy) int8 6MB dask.array<chunksize=(10000, 284, 2), meta=np.ndarray>
    call_genotype_mask    (variants, samples, ploidy) bool 6MB dask.array<chunksize=(10000, 284, 2), meta=np.ndarray>
    call_genotype_phased  (variants, samples) bool 3MB dask.array<chunksize=(10000, 284), meta=np.ndarray>
    ...                    ...
    variant_contig_name   (variants) object 87kB dask.array<chunksize=(10000,), meta=np.ndarray>
    Population            (samples) object 2kB 'GBR' 'GBR' 'GBR' ... 'GIH' 'GIH'
    SuperPopulation       (samples) object 2kB 'EUR' 'EUR' 'EUR' ... 'SAS' 'SAS'
    isFemale              (samples) bool 284B False True False ... False True
    PurpleHair            (samples) bool 284B False False False ... True True
    CaffeineConsumption   (samples) int64 2kB 4 4 4 3 6 2 4 2 ... 6 4 6 4 6 5 5
Attributes: (3)

We can see the new variables (with names beginning sample_) after we run sample_stats:

ds = sg.sample_stats(ds)
ds
<xarray.Dataset> Size: 28MB
Dimensions:               (samples: 284, variants: 10879, alleles: 2,
                           ploidy: 2, contigs: 84, filters: 1)
Dimensions without coordinates: samples, variants, alleles, ploidy, contigs,
                                filters
Data variables: (12/28)
    sample_n_called       (samples) int64 2kB dask.array<chunksize=(284,), meta=np.ndarray>
    sample_call_rate      (samples) float64 2kB dask.array<chunksize=(284,), meta=np.ndarray>
    sample_n_het          (samples) int64 2kB dask.array<chunksize=(284,), meta=np.ndarray>
    sample_n_hom_ref      (samples) int64 2kB dask.array<chunksize=(284,), meta=np.ndarray>
    sample_n_hom_alt      (samples) int64 2kB dask.array<chunksize=(284,), meta=np.ndarray>
    sample_n_non_ref      (samples) int64 2kB dask.array<chunksize=(284,), meta=np.ndarray>
    ...                    ...
    variant_contig_name   (variants) object 87kB dask.array<chunksize=(10000,), meta=np.ndarray>
    Population            (samples) object 2kB 'GBR' 'GBR' 'GBR' ... 'GIH' 'GIH'
    SuperPopulation       (samples) object 2kB 'EUR' 'EUR' 'EUR' ... 'SAS' 'SAS'
    isFemale              (samples) bool 284B False True False ... False True
    PurpleHair            (samples) bool 284B False False False ... True True
    CaffeineConsumption   (samples) int64 2kB 4 4 4 3 6 2 4 2 ... 6 4 6 4 6 5 5
Attributes: (3)

We can plot the metrics next.

ds.sample_call_rate.attrs["long_name"] = "Sample call rate"
xr.plot.hist(ds.sample_call_rate, range=(.88,1), bins=50, size=8, edgecolor="black");
../_images/47a6aeb585048f056ed210319cd9bb62568a502453f5ce1f8dea252fbf9d4e4e.png
gq = ds.call_GQ.where(ds.call_GQ >= 0) # filter out missing
sample_gq_mean = gq.mean(dim="variants")
sample_gq_mean.attrs["long_name"] = "Mean Sample GQ"
xr.plot.hist(sample_gq_mean, range=(10,70), bins=60, size=8, edgecolor="black");
../_images/e5d5aae341e58ece4247136043cc6f1c2759dfb4c760bc24f52626e3c5fa7488.png
dp = ds.call_DP.where(ds.call_DP >= 0) # filter out missing
sample_dp_mean = dp.mean(dim="variants")
sample_dp_mean.attrs["long_name"] = "Mean Sample DP"
ds["sample_dp_mean"] = sample_dp_mean # add new data array to dataset
# Following does not work with recent versions of xarray, see https://github.com/sgkit-dev/sgkit/issues/934
#ds.plot.scatter(x="sample_dp_mean", y="sample_call_rate", size=8, s=10);

The following removes outliers using an arbitrary cutoff.

ds = ds.sel(samples=((ds.sample_dp_mean >= 4) & (ds.sample_call_rate >= 0.97)).compute())
print(f"After filter, {len(ds.samples)}/284 samples remain.")
After filter, 250/284 samples remain.

Genotype QC is more complicated. First we calculate a variable ab, which is the fraction of reads that were the alternate allele.

# fill rows with nan where no alternate alleles were read or where sum of reads is 0
ad1 = ds.call_AD.sel(dict(alleles=1)).pipe(lambda v: v.where(v >= 0))
adsum = ds.call_AD.sum(dim="alleles").pipe(lambda v: v.where(v != 0))
# compute alternate allele read fraction
ab = ad1 / adsum

Then we can use the ab variable in the filter condition, to filter homozygous reference calls with >10% alternate reads, homozygous alternate calls with >10% reference reads, or heterozygote calls without a ref/alt balance near 50%.

GT = ds.call_genotype
hom_ref = (GT == 0).all(dim="ploidy")
het = GT[..., 0] != GT[..., 1]
hom_alt = ((GT > 0) & (GT[..., 0] == GT)).all(dim="ploidy")
filter_condition_ab = ((hom_ref & (ab <= 0.1)) |
                        (het & (ab >= 0.25) & (ab <= 0.75)) |
                        (hom_alt & (ab >= 0.9)))
filter_mask = xr.where(ds.call_genotype_mask, True, filter_condition_ab)
fraction_filtered = GT.where(~filter_mask).count().values / GT.size
print(f"Filtering {fraction_filtered * 100:.2f}% entries out of downstream analysis.")
Filtering 3.65% entries out of downstream analysis.

Note: genotype QC is filtering out slightly different numbers of entries compared to the Hail tutorial.

Variant QC is similar. This time we use the variant_stats function, but we won’t do any filtering on these variables.

ds = sg.variant_stats(ds)
ds
<xarray.Dataset> Size: 26MB
Dimensions:                   (variants: 10879, alleles: 2, samples: 250,
                               ploidy: 2, contigs: 84, filters: 1)
Dimensions without coordinates: variants, alleles, samples, ploidy, contigs,
                                filters
Data variables: (12/38)
    variant_n_called          (variants) int64 87kB dask.array<chunksize=(10000,), meta=np.ndarray>
    variant_call_rate         (variants) float64 87kB dask.array<chunksize=(10000,), meta=np.ndarray>
    variant_n_het             (variants) int64 87kB dask.array<chunksize=(10000,), meta=np.ndarray>
    variant_n_hom_ref         (variants) int64 87kB dask.array<chunksize=(10000,), meta=np.ndarray>
    variant_n_hom_alt         (variants) int64 87kB dask.array<chunksize=(10000,), meta=np.ndarray>
    variant_n_non_ref         (variants) int64 87kB dask.array<chunksize=(10000,), meta=np.ndarray>
    ...                        ...
    Population                (samples) object 2kB 'GBR' 'GBR' ... 'GIH' 'GIH'
    SuperPopulation           (samples) object 2kB 'EUR' 'EUR' ... 'SAS' 'SAS'
    isFemale                  (samples) bool 250B False True ... False True
    PurpleHair                (samples) bool 250B False False ... True True
    CaffeineConsumption       (samples) int64 2kB 4 4 4 3 6 2 2 ... 4 6 4 6 5 5
    sample_dp_mean            (samples) float32 1kB dask.array<chunksize=(250,), meta=np.ndarray>
Attributes: (3)

Let’s do a GWAS!#

First we need to restrict to common variants (1% cutoff), and not far from Hardy-Weinberg equilibrium.

ds = sg.hardy_weinberg_test(ds)

(The warning is telling us that some variables are being regenerated, and can be safely ignored.)

ds
<xarray.Dataset> Size: 26MB
Dimensions:                   (variants: 10879, genotypes: 3, alleles: 2,
                               samples: 250, ploidy: 2, contigs: 84, filters: 1)
Coordinates:
  * genotypes                 (genotypes) <U3 36B '0/0' '0/1' '1/1'
Dimensions without coordinates: variants, alleles, samples, ploidy, contigs,
                                filters
Data variables: (12/41)
    variant_hwe_p_value       (variants) float64 87kB dask.array<chunksize=(10000,), meta=np.ndarray>
    variant_genotype_count    (variants, genotypes) uint64 261kB dask.array<chunksize=(10000, 3), meta=np.ndarray>
    genotype_id               (genotypes) <U3 36B dask.array<chunksize=(3,), meta=np.ndarray>
    variant_n_called          (variants) int64 87kB dask.array<chunksize=(10000,), meta=np.ndarray>
    variant_call_rate         (variants) float64 87kB dask.array<chunksize=(10000,), meta=np.ndarray>
    variant_n_het             (variants) int64 87kB dask.array<chunksize=(10000,), meta=np.ndarray>
    ...                        ...
    Population                (samples) object 2kB 'GBR' 'GBR' ... 'GIH' 'GIH'
    SuperPopulation           (samples) object 2kB 'EUR' 'EUR' ... 'SAS' 'SAS'
    isFemale                  (samples) bool 250B False True ... False True
    PurpleHair                (samples) bool 250B False False ... True True
    CaffeineConsumption       (samples) int64 2kB 4 4 4 3 6 2 2 ... 4 6 4 6 5 5
    sample_dp_mean            (samples) float32 1kB dask.array<chunksize=(250,), meta=np.ndarray>
Attributes: (3)
ds = ds.sel(variants=((ds.variant_allele_frequency[:,1] > 0.01) & (ds.variant_hwe_p_value > 1e-6)).compute())

Note: again, the number of variants is different to the Hail tutorial, but the final results work out to be very similar.

print(f"Samples: {len(ds.samples)}  Variants: {len(ds.variants)}")
Samples: 250  Variants: 8394

Run a linear regression of dosage (number of alt alleles) against the CaffeineConsumption trait.

ds["call_dosage"] = ds.call_genotype.sum(dim="ploidy")
ds_lr = sg.gwas_linear_regression(ds, dosage="call_dosage",
                                  add_intercept=True, covariates=[], traits=["CaffeineConsumption"])
ds_lr
<xarray.Dataset> Size: 37MB
Dimensions:                   (variants: 8394, traits: 1, genotypes: 3,
                               alleles: 2, samples: 250, ploidy: 2,
                               contigs: 84, filters: 1)
Coordinates:
  * genotypes                 (genotypes) <U3 36B '0/0' '0/1' '1/1'
Dimensions without coordinates: variants, traits, alleles, samples, ploidy,
                                contigs, filters
Data variables: (12/45)
    variant_linreg_beta       (variants, traits) float64 67kB dask.array<chunksize=(5439, 1), meta=np.ndarray>
    variant_linreg_t_value    (variants, traits) float64 67kB dask.array<chunksize=(5439, 1), meta=np.ndarray>
    variant_linreg_p_value    (variants, traits) float64 67kB dask.array<chunksize=(5439, 1), meta=np.ndarray>
    variant_hwe_p_value       (variants) float64 67kB dask.array<chunksize=(5439,), meta=np.ndarray>
    variant_genotype_count    (variants, genotypes) uint64 201kB dask.array<chunksize=(5439, 3), meta=np.ndarray>
    genotype_id               (genotypes) <U3 36B dask.array<chunksize=(3,), meta=np.ndarray>
    ...                        ...
    SuperPopulation           (samples) object 2kB 'EUR' 'EUR' ... 'SAS' 'SAS'
    isFemale                  (samples) bool 250B False True ... False True
    PurpleHair                (samples) bool 250B False False ... True True
    CaffeineConsumption       (samples) int64 2kB 4 4 4 3 6 2 2 ... 4 6 4 6 5 5
    sample_dp_mean            (samples) float32 1kB dask.array<chunksize=(250,), meta=np.ndarray>
    call_dosage               (variants, samples) int64 17MB dask.array<chunksize=(5439, 250), meta=np.ndarray>
Attributes: (3)

You can see that new variables have been added for variant_linreg_p_value, variant_linreg_t_value, and variant_linreg_beta.

Since sgkit doesn’t have any plotting utilities, we implement Manhattan plots and QQ plots here using Seaborn.

import seaborn as sns
%matplotlib inline
def manhattan_plot(ds):
    df = ds[["variant_contig_name", "variant_contig", "variant_position", "variant_linreg_p_value"]].to_dataframe()
    df["variant_linreg_log_p_value"] = -np.log10(df["variant_linreg_p_value"])
    df = df.astype({"variant_position": np.int64}) # to avoid overflow in cumulative_pos
    
    # from https://github.com/mojones/video_notebooks/blob/master/Manhattan%20plots%20in%20Python.ipynb, cell 20
    running_pos = 0
    cumulative_pos = []
    for chrom, group_df in df.groupby("variant_contig"):  
        cumulative_pos.append(group_df["variant_position"] + running_pos)
        running_pos += group_df["variant_position"].max()
    df["cumulative_pos"] = pd.concat(cumulative_pos)
    
    df["color group"] = df["variant_contig"].apply(lambda x : "A" if x % 2 == 0 else "B")
    
    g = sns.relplot(
        data = df,
        x = "cumulative_pos",
        y = "variant_linreg_log_p_value",
        hue = "color group",
        palette = ["blue", "orange"],
        linewidth=0,
        s=10,
        legend=None,
        aspect=3
    )
    g.ax.set_xlabel("Chromosome")
    g.ax.set_xticks(df.groupby("variant_contig")["cumulative_pos"].median())
    g.ax.set_xticklabels(df["variant_contig_name"].unique())
manhattan_plot(ds_lr)
../_images/fd08fafb6dee4283801d7a42af87a522eada25daf5476009c68b2aa5f8cbb85d.png
import math
import matplotlib.pyplot as plt

def qq_plot(ds):
    p = ds["variant_linreg_p_value"].squeeze().values
    p.sort()
    n = len(p)
    expected_p = -np.log10(np.arange(1, n + 1) / n)
    observed_p = -np.log10(p)
    max_val = math.ceil(max(np.max(expected_p), np.max(observed_p)))

    df = pd.DataFrame({"Expected -log10(p)": expected_p, "Observed -log10(p)": observed_p})

    fig, ax = plt.subplots(figsize=(12, 12));
    g = sns.scatterplot(data=df, x="Expected -log10(p)", y="Observed -log10(p)", ax=ax, linewidth=0)

    x_pred = np.linspace(0, max_val, 50)
    sns.lineplot(x=x_pred, y=x_pred, ax=ax)
    
    g.set(xlim=(0, max_val), ylim=(0, max_val))
qq_plot(ds_lr)
../_images/7b2a34e13082432cf923ce2ccdc48a5a0acd0ea8f32c297384b1e6ff52ea2a8b.png

Confounded!#

As explained in the Hail tutorial, the data contains a confounder, so it is necessary to include ancestry as a covariate in the linear regression.

Rather than just use the reported ancestry, it’s better to use principal components from running a PCA on the data.

ds_pca = sg.stats.pca.count_call_alternate_alleles(ds)
# To run PCA we need to filter out variants with any missing alt allele counts
# Or where the counts are zero for all samples
variant_mask = (((ds_pca.call_alternate_allele_count < 0).any(dim="samples")) | \
    (ds_pca.call_alternate_allele_count.std(dim="samples") <= 0.0)).compute()
ds_pca = ds_pca.sel(variants=~variant_mask)
ds_pca = sg.pca(ds_pca)
ds_pca.sample_pca_projection.values
array([[ -8.453594 , -26.128805 , -11.008168 , ..., -14.8012085,
         24.537296 ,  -1.0794808],
       [ -9.496218 , -26.31961  , -10.116499 , ...,   1.6827672,
          7.6817017,  -5.957277 ],
       [ -7.873404 , -25.404314 ,  -9.85919  , ...,   4.382873 ,
         -9.368455 ,   6.384331 ],
       ...,
       [-10.974406 , -11.576627 ,  20.124643 , ...,  -4.4210625,
         -0.5393088,   1.0124567],
       [-10.754401 , -11.414771 ,  15.358787 , ...,   1.7951679,
          3.4263666,  -7.985675 ],
       [-13.062882 , -11.688103 ,  16.351276 , ...,  -7.205051 ,
         -1.7339797,   5.17502  ]], dtype=float32)
ds_pca
<xarray.Dataset> Size: 17MB
Dimensions:                              (samples: 250, components: 10,
                                          variants: 3491, genotypes: 3,
                                          alleles: 2, ploidy: 2, contigs: 84,
                                          filters: 1)
Coordinates:
  * genotypes                            (genotypes) <U3 36B '0/0' '0/1' '1/1'
Dimensions without coordinates: samples, components, variants, alleles, ploidy,
                                contigs, filters
Data variables: (12/48)
    sample_pca_projection                (samples, components) float32 10kB dask.array<chunksize=(250, 10), meta=np.ndarray>
    sample_pca_component                 (variants, components) float32 140kB dask.array<chunksize=(3491, 10), meta=np.ndarray>
    sample_pca_explained_variance        (components) float32 40B dask.array<chunksize=(10,), meta=np.ndarray>
    sample_pca_explained_variance_ratio  (components) float32 40B dask.array<chunksize=(10,), meta=np.ndarray>
    sample_pca_loading                   (variants, components) float32 140kB dask.array<chunksize=(3491, 10), meta=np.ndarray>
    call_alternate_allele_count          (variants, samples) int16 2MB dask.array<chunksize=(3491, 250), meta=np.ndarray>
    ...                                   ...
    SuperPopulation                      (samples) object 2kB 'EUR' ... 'SAS'
    isFemale                             (samples) bool 250B False True ... True
    PurpleHair                           (samples) bool 250B False ... True
    CaffeineConsumption                  (samples) int64 2kB 4 4 4 3 ... 4 6 5 5
    sample_dp_mean                       (samples) float32 1kB dask.array<chunksize=(250,), meta=np.ndarray>
    call_dosage                          (variants, samples) int64 7MB dask.array<chunksize=(3491, 250), meta=np.ndarray>
Attributes: (3)

Let’s plot the first two components. Notice how they cluster by ancestry.

ds_pca["sample_pca_projection_0"] = ds_pca.sample_pca_projection[:,0]
ds_pca["sample_pca_projection_1"] = ds_pca.sample_pca_projection[:,1]
ds_pca["sample_pca_projection_2"] = ds_pca.sample_pca_projection[:,2]
# Following does not work with recent versions of xarray, see https://github.com/sgkit-dev/sgkit/issues/934
#ds_pca.plot.scatter(x="sample_pca_projection_0", y="sample_pca_projection_1", hue="SuperPopulation", size=8, s=10);

Now we can rerun our linear regression, controlling for sample sex and the first few principal components.

# copy pca components back to dataset with full set of variants to run linear regression again
ds["sample_pca_projection_0"] = ds_pca.sample_pca_projection[:,0]
ds["sample_pca_projection_1"] = ds_pca.sample_pca_projection[:,1]
ds["sample_pca_projection_2"] = ds_pca.sample_pca_projection[:,2]
ds_lr = sg.gwas_linear_regression(ds, dosage="call_dosage",
                                  add_intercept=True,
                                  covariates=["isFemale", "sample_pca_projection_0", "sample_pca_projection_1", "sample_pca_projection_2"],
                                  traits=["CaffeineConsumption"])
qq_plot(ds_lr)
../_images/69e63fa374d34e564d1fe7f8ba6e8bae07d9a88389678f275e4a6010fd728625.png
manhattan_plot(ds_lr)
../_images/9281317cf92a2df1edbf27322c7d36d794a1563bb838e90f88990402b300496c.png

The “caffeine consumption” locus in chromosome 8 is clearly apparent, just like in the Hail tutorial.