{ "cells": [ { "cell_type": "markdown", "id": "accurate-breakdown", "metadata": {}, "source": [ "# GWAS Tutorial" ] }, { "cell_type": "markdown", "id": "written-product", "metadata": {}, "source": [ "This notebook is an [sgkit](https://sgkit-dev.github.io/sgkit) port of Hail's [GWAS Tutorial](https://nbviewer.jupyter.org/github/tomwhite/sgkit/blob/86753e814c6d56982b6950ec3de727f3b1bfad7d/docs/examples/01-genome-wide-association-study.ipynb), which demonstrates how to run a genome-wide SNP association test. Readers are encouraged to read the Hail tutorial alongside this one for more background, and to see the motivation behind some of the steps.\n", "\n", "_Note that some of the results do not exactly match the output from Hail. Also, since sgkit is still a 0.x release, its API is still subject to non-backwards compatible changes._" ] }, { "cell_type": "code", "execution_count": 1, "id": "valid-pride", "metadata": {}, "outputs": [], "source": [ "import sgkit as sg" ] }, { "cell_type": "markdown", "id": "scheduled-spelling", "metadata": {}, "source": [ "Before using sgkit, we import some standard Python libraries and set the Xarray display options to not show all the attributes in a dataset by default." ] }, { "cell_type": "code", "execution_count": 2, "id": "compound-activity", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import xarray as xr\n", "xr.set_options(display_expand_attrs=False, display_expand_data_vars=True);" ] }, { "cell_type": "markdown", "id": "color-trinity", "metadata": {}, "source": [ "## Download public 1000 Genomes data" ] }, { "cell_type": "markdown", "id": "useful-imperial", "metadata": {}, "source": [ "We use the same small (20MB) portion of the public 1000 Genomes data that Hail uses.\n", "\n", "First, download the file locally:" ] }, { "cell_type": "code", "execution_count": 3, "id": "taken-banana", "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "import requests\n", "\n", "if not Path(\"1kg.vcf.bgz\").exists():\n", " response = requests.get(\"https://storage.googleapis.com/sgkit-data/tutorial/1kg.vcf.bgz\")\n", " with open(\"1kg.vcf.bgz\", \"wb\") as f:\n", " f.write(response.content)\n", "if not Path(\"1kg.vcf.bgz.tbi\").exists():\n", " response = requests.get(\"https://storage.googleapis.com/sgkit-data/tutorial/1kg.vcf.bgz.tbi\")\n", " with open(\"1kg.vcf.bgz.tbi\", \"wb\") as f:\n", " f.write(response.content)" ] }, { "cell_type": "markdown", "id": "headed-restoration", "metadata": {}, "source": [ "## Importing data from VCF" ] }, { "cell_type": "markdown", "id": "elementary-college", "metadata": {}, "source": [ "Next, convert the VCF file to Zarr using the `vcf2zarr` command in [bio2zarr](https://sgkit-dev.github.io/bio2zarr/), stored on the local filesystem in a directory called _1kg.vcz_." ] }, { "cell_type": "code", "execution_count": 4, "id": "78effc1d-b45e-4af5-85ae-e0e7a40ca049", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " Scan: 0%| | 0.00/1.00 [00:00, ?files/s][W::bcf_hdr_check_sanity] PL should be declared as Number=G\n", " Scan: 100%|██████████| 1.00/1.00 [00:01<00:00, 1.00s/files]\n", " Explode: 0%| | 0.00/10.9k [00:00, ?vars/s][W::bcf_hdr_check_sanity] PL should be declared as Number=G\n", " Explode: 100%|██████████| 10.9k/10.9k [00:19<00:00, 547vars/s]\n", " Encode: 100%|██████████| 28.2M/28.2M [00:02<00:00, 12.1MB/s]\n", "Finalise: 100%|██████████| 13.0/13.0 [00:00<00:00, 1.01karray/s]\n" ] } ], "source": [ "%%bash\n", "vcf2zarr explode --force 1kg.vcf.bgz 1kg.icf\n", "# vcf2zarr mkschema 1kg.icf > 1kg.schema.json # then edit 1kg.schema.json by hand\n", "vcf2zarr encode --force -s 1kg.schema.json 1kg.icf 1kg.vcz" ] }, { "cell_type": "markdown", "id": "plastic-running", "metadata": {}, "source": [ "We used the `vcf2zarr explode` command to first convert the VCF to an \"intermediate columnar format\" (ICF), then the `vcf2zarr encode` command to convert the ICF to Zarr, which by convention is stored in a directory with a `vcz` extension.\n", "\n", "Note that we specified a JSON schema file that was created with the `vcf2zarr mkschema` command (commented out above), then edited to drop some fields that are not needed for this tutorial (such as `FORMAT/PL`). It was also updated to change the `call_AD` field's third dimension to be `alleles`, which was not set by `vcf2zarr` since the dataset we are using defines `FORMAT/AD` as `.` which means \"unknown\", rather than `R`.\n", "\n", "For more information about using `vcf2zarr`, see the tutorial in the [bio2zarr documentation](https://sgkit-dev.github.io/bio2zarr/).\n", "\n", "Now the data has been written as Zarr, all downstream operations on will be much faster. Note that sgkit uses an [Xarray](http://xarray.pydata.org/en/stable/) dataset to represent the VCF data, where Hail uses MatrixTable." ] }, { "cell_type": "code", "execution_count": 5, "id": "blind-sunset", "metadata": {}, "outputs": [], "source": [ "ds = sg.load_dataset(\"1kg.vcz\")" ] }, { "cell_type": "markdown", "id": "imperial-california", "metadata": {}, "source": [ "## Getting to know our data" ] }, { "cell_type": "markdown", "id": "otherwise-constant", "metadata": {}, "source": [ "To start with we'll look at some summary data from the dataset.\n", "\n", "The simplest thing is to look at the dimensions and data variables in the Xarray dataset." ] }, { "cell_type": "code", "execution_count": 6, "id": "hundred-manitoba", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
<xarray.Dataset> Size: 28MB\n", "Dimensions: (variants: 10879, samples: 284, alleles: 2,\n", " ploidy: 2, contigs: 84, filters: 1)\n", "Dimensions without coordinates: variants, samples, alleles, ploidy, contigs,\n", " filters\n", "Data variables: (12/17)\n", " call_AD (variants, samples, alleles) int8 6MB dask.array<chunksize=(10000, 284, 2), meta=np.ndarray>\n", " call_DP (variants, samples) int8 3MB dask.array<chunksize=(10000, 284), meta=np.ndarray>\n", " call_GQ (variants, samples) int8 3MB dask.array<chunksize=(10000, 284), meta=np.ndarray>\n", " call_genotype (variants, samples, ploidy) int8 6MB dask.array<chunksize=(10000, 284, 2), meta=np.ndarray>\n", " call_genotype_mask (variants, samples, ploidy) bool 6MB dask.array<chunksize=(10000, 284, 2), meta=np.ndarray>\n", " call_genotype_phased (variants, samples) bool 3MB dask.array<chunksize=(10000, 284), meta=np.ndarray>\n", " ... ...\n", " variant_contig (variants) int8 11kB dask.array<chunksize=(10000,), meta=np.ndarray>\n", " variant_filter (variants, filters) bool 11kB dask.array<chunksize=(10000, 1), meta=np.ndarray>\n", " variant_id (variants) object 87kB dask.array<chunksize=(10000,), meta=np.ndarray>\n", " variant_id_mask (variants) bool 11kB dask.array<chunksize=(10000,), meta=np.ndarray>\n", " variant_position (variants) int32 44kB dask.array<chunksize=(10000,), meta=np.ndarray>\n", " variant_quality (variants) float32 44kB dask.array<chunksize=(10000,), meta=np.ndarray>\n", "Attributes: (3)
samples | \n", "HG00096 | \n", "HG00099 | \n", "... | \n", "NA21133 | \n", "NA21143 | \n", "
---|---|---|---|---|---|
variants | \n", "\n", " | \n", " | \n", " | \n", " | \n", " |
(1, 904165, .) | \n", "0/0 | \n", "0/0 | \n", "... | \n", "0/0 | \n", "0/0 | \n", "
(1, 909917, .) | \n", "0/0 | \n", "0/0 | \n", "... | \n", "0/0 | \n", "0/0 | \n", "
(1, 986963, .) | \n", "0/0 | \n", "0/0 | \n", "... | \n", "0/0 | \n", "0/0 | \n", "
(1, 1563691, .) | \n", "./. | \n", "0/0 | \n", "... | \n", "0/0 | \n", "0/0 | \n", "
(1, 1707740, .) | \n", "0/1 | \n", "0/1 | \n", "... | \n", "0/1 | \n", "0/0 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
(X, 152660491, .) | \n", "./. | \n", "0/0 | \n", "... | \n", "1/1 | \n", "0/0 | \n", "
(X, 153031688, .) | \n", "0/0 | \n", "0/0 | \n", "... | \n", "0/0 | \n", "0/0 | \n", "
(X, 153674876, .) | \n", "0/0 | \n", "0/0 | \n", "... | \n", "0/0 | \n", "0/0 | \n", "
(X, 153706320, .) | \n", "./. | \n", "0/0 | \n", "... | \n", "0/0 | \n", "0/0 | \n", "
(X, 154087368, .) | \n", "0/0 | \n", "1/1 | \n", "... | \n", "1/1 | \n", "1/1 | \n", "
10879 rows x 284 columns
\n", " | \n", " | \n", " | variant_allele | \n", "
---|---|---|---|
variant_contig_name | \n", "variant_position | \n", "variant_id | \n", "\n", " |
1 | \n", "904165 | \n", ". | \n", "[G, A] | \n", "
909917 | \n", ". | \n", "[G, A] | \n", "|
986963 | \n", ". | \n", "[C, T] | \n", "|
1563691 | \n", ". | \n", "[T, G] | \n", "|
1707740 | \n", ". | \n", "[T, G] | \n", "
\n", " | Population | \n", "SuperPopulation | \n", "isFemale | \n", "PurpleHair | \n", "CaffeineConsumption | \n", "
---|---|---|---|---|---|
Sample | \n", "\n", " | \n", " | \n", " | \n", " | \n", " |
HG00096 | \n", "GBR | \n", "EUR | \n", "False | \n", "False | \n", "4 | \n", "
HG00097 | \n", "GBR | \n", "EUR | \n", "True | \n", "True | \n", "4 | \n", "
HG00098 | \n", "GBR | \n", "EUR | \n", "False | \n", "False | \n", "5 | \n", "
HG00099 | \n", "GBR | \n", "EUR | \n", "True | \n", "False | \n", "4 | \n", "
HG00100 | \n", "GBR | \n", "EUR | \n", "True | \n", "False | \n", "5 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
NA21137 | \n", "GIH | \n", "SAS | \n", "True | \n", "False | \n", "1 | \n", "
NA21141 | \n", "GIH | \n", "SAS | \n", "True | \n", "True | \n", "2 | \n", "
NA21142 | \n", "GIH | \n", "SAS | \n", "True | \n", "True | \n", "2 | \n", "
NA21143 | \n", "GIH | \n", "SAS | \n", "True | \n", "True | \n", "5 | \n", "
NA21144 | \n", "GIH | \n", "SAS | \n", "True | \n", "False | \n", "3 | \n", "
3500 rows × 5 columns
\n", "<xarray.Dataset> Size: 28MB\n", "Dimensions: (samples: 284, variants: 10879, alleles: 2,\n", " ploidy: 2, contigs: 84, filters: 1)\n", "Dimensions without coordinates: samples, variants, alleles, ploidy, contigs,\n", " filters\n", "Data variables: (12/22)\n", " call_AD (variants, samples, alleles) int8 6MB dask.array<chunksize=(10000, 284, 2), meta=np.ndarray>\n", " call_DP (variants, samples) int8 3MB dask.array<chunksize=(10000, 284), meta=np.ndarray>\n", " call_GQ (variants, samples) int8 3MB dask.array<chunksize=(10000, 284), meta=np.ndarray>\n", " call_genotype (variants, samples, ploidy) int8 6MB dask.array<chunksize=(10000, 284, 2), meta=np.ndarray>\n", " call_genotype_mask (variants, samples, ploidy) bool 6MB dask.array<chunksize=(10000, 284, 2), meta=np.ndarray>\n", " call_genotype_phased (variants, samples) bool 3MB dask.array<chunksize=(10000, 284), meta=np.ndarray>\n", " ... ...\n", " variant_contig_name (variants) object 87kB dask.array<chunksize=(10000,), meta=np.ndarray>\n", " Population (samples) object 2kB 'GBR' 'GBR' 'GBR' ... 'GIH' 'GIH'\n", " SuperPopulation (samples) object 2kB 'EUR' 'EUR' 'EUR' ... 'SAS' 'SAS'\n", " isFemale (samples) bool 284B False True False ... False True\n", " PurpleHair (samples) bool 284B False False False ... True True\n", " CaffeineConsumption (samples) int64 2kB 4 4 4 3 6 2 4 2 ... 6 4 6 4 6 5 5\n", "Attributes: (3)
<xarray.Dataset> Size: 28MB\n", "Dimensions: (samples: 284, variants: 10879, alleles: 2,\n", " ploidy: 2, contigs: 84, filters: 1)\n", "Dimensions without coordinates: samples, variants, alleles, ploidy, contigs,\n", " filters\n", "Data variables: (12/22)\n", " call_AD (variants, samples, alleles) int8 6MB dask.array<chunksize=(10000, 284, 2), meta=np.ndarray>\n", " call_DP (variants, samples) int8 3MB dask.array<chunksize=(10000, 284), meta=np.ndarray>\n", " call_GQ (variants, samples) int8 3MB dask.array<chunksize=(10000, 284), meta=np.ndarray>\n", " call_genotype (variants, samples, ploidy) int8 6MB dask.array<chunksize=(10000, 284, 2), meta=np.ndarray>\n", " call_genotype_mask (variants, samples, ploidy) bool 6MB dask.array<chunksize=(10000, 284, 2), meta=np.ndarray>\n", " call_genotype_phased (variants, samples) bool 3MB dask.array<chunksize=(10000, 284), meta=np.ndarray>\n", " ... ...\n", " variant_contig_name (variants) object 87kB dask.array<chunksize=(10000,), meta=np.ndarray>\n", " Population (samples) object 2kB 'GBR' 'GBR' 'GBR' ... 'GIH' 'GIH'\n", " SuperPopulation (samples) object 2kB 'EUR' 'EUR' 'EUR' ... 'SAS' 'SAS'\n", " isFemale (samples) bool 284B False True False ... False True\n", " PurpleHair (samples) bool 284B False False False ... True True\n", " CaffeineConsumption (samples) int64 2kB 4 4 4 3 6 2 4 2 ... 6 4 6 4 6 5 5\n", "Attributes: (3)
<xarray.Dataset> Size: 28MB\n", "Dimensions: (samples: 284, variants: 10879, alleles: 2,\n", " ploidy: 2, contigs: 84, filters: 1)\n", "Dimensions without coordinates: samples, variants, alleles, ploidy, contigs,\n", " filters\n", "Data variables: (12/28)\n", " sample_n_called (samples) int64 2kB dask.array<chunksize=(284,), meta=np.ndarray>\n", " sample_call_rate (samples) float64 2kB dask.array<chunksize=(284,), meta=np.ndarray>\n", " sample_n_het (samples) int64 2kB dask.array<chunksize=(284,), meta=np.ndarray>\n", " sample_n_hom_ref (samples) int64 2kB dask.array<chunksize=(284,), meta=np.ndarray>\n", " sample_n_hom_alt (samples) int64 2kB dask.array<chunksize=(284,), meta=np.ndarray>\n", " sample_n_non_ref (samples) int64 2kB dask.array<chunksize=(284,), meta=np.ndarray>\n", " ... ...\n", " variant_contig_name (variants) object 87kB dask.array<chunksize=(10000,), meta=np.ndarray>\n", " Population (samples) object 2kB 'GBR' 'GBR' 'GBR' ... 'GIH' 'GIH'\n", " SuperPopulation (samples) object 2kB 'EUR' 'EUR' 'EUR' ... 'SAS' 'SAS'\n", " isFemale (samples) bool 284B False True False ... False True\n", " PurpleHair (samples) bool 284B False False False ... True True\n", " CaffeineConsumption (samples) int64 2kB 4 4 4 3 6 2 4 2 ... 6 4 6 4 6 5 5\n", "Attributes: (3)
<xarray.Dataset> Size: 26MB\n", "Dimensions: (variants: 10879, alleles: 2, samples: 250,\n", " ploidy: 2, contigs: 84, filters: 1)\n", "Dimensions without coordinates: variants, alleles, samples, ploidy, contigs,\n", " filters\n", "Data variables: (12/38)\n", " variant_n_called (variants) int64 87kB dask.array<chunksize=(10000,), meta=np.ndarray>\n", " variant_call_rate (variants) float64 87kB dask.array<chunksize=(10000,), meta=np.ndarray>\n", " variant_n_het (variants) int64 87kB dask.array<chunksize=(10000,), meta=np.ndarray>\n", " variant_n_hom_ref (variants) int64 87kB dask.array<chunksize=(10000,), meta=np.ndarray>\n", " variant_n_hom_alt (variants) int64 87kB dask.array<chunksize=(10000,), meta=np.ndarray>\n", " variant_n_non_ref (variants) int64 87kB dask.array<chunksize=(10000,), meta=np.ndarray>\n", " ... ...\n", " Population (samples) object 2kB 'GBR' 'GBR' ... 'GIH' 'GIH'\n", " SuperPopulation (samples) object 2kB 'EUR' 'EUR' ... 'SAS' 'SAS'\n", " isFemale (samples) bool 250B False True ... False True\n", " PurpleHair (samples) bool 250B False False ... True True\n", " CaffeineConsumption (samples) int64 2kB 4 4 4 3 6 2 2 ... 4 6 4 6 5 5\n", " sample_dp_mean (samples) float32 1kB dask.array<chunksize=(250,), meta=np.ndarray>\n", "Attributes: (3)
<xarray.Dataset> Size: 26MB\n", "Dimensions: (variants: 10879, genotypes: 3, alleles: 2,\n", " samples: 250, ploidy: 2, contigs: 84, filters: 1)\n", "Coordinates:\n", " * genotypes (genotypes) <U3 36B '0/0' '0/1' '1/1'\n", "Dimensions without coordinates: variants, alleles, samples, ploidy, contigs,\n", " filters\n", "Data variables: (12/41)\n", " variant_hwe_p_value (variants) float64 87kB dask.array<chunksize=(10000,), meta=np.ndarray>\n", " variant_genotype_count (variants, genotypes) uint64 261kB dask.array<chunksize=(10000, 3), meta=np.ndarray>\n", " genotype_id (genotypes) <U3 36B dask.array<chunksize=(3,), meta=np.ndarray>\n", " variant_n_called (variants) int64 87kB dask.array<chunksize=(10000,), meta=np.ndarray>\n", " variant_call_rate (variants) float64 87kB dask.array<chunksize=(10000,), meta=np.ndarray>\n", " variant_n_het (variants) int64 87kB dask.array<chunksize=(10000,), meta=np.ndarray>\n", " ... ...\n", " Population (samples) object 2kB 'GBR' 'GBR' ... 'GIH' 'GIH'\n", " SuperPopulation (samples) object 2kB 'EUR' 'EUR' ... 'SAS' 'SAS'\n", " isFemale (samples) bool 250B False True ... False True\n", " PurpleHair (samples) bool 250B False False ... True True\n", " CaffeineConsumption (samples) int64 2kB 4 4 4 3 6 2 2 ... 4 6 4 6 5 5\n", " sample_dp_mean (samples) float32 1kB dask.array<chunksize=(250,), meta=np.ndarray>\n", "Attributes: (3)
<xarray.Dataset> Size: 37MB\n", "Dimensions: (variants: 8394, traits: 1, genotypes: 3,\n", " alleles: 2, samples: 250, ploidy: 2,\n", " contigs: 84, filters: 1)\n", "Coordinates:\n", " * genotypes (genotypes) <U3 36B '0/0' '0/1' '1/1'\n", "Dimensions without coordinates: variants, traits, alleles, samples, ploidy,\n", " contigs, filters\n", "Data variables: (12/45)\n", " variant_linreg_beta (variants, traits) float64 67kB dask.array<chunksize=(5439, 1), meta=np.ndarray>\n", " variant_linreg_t_value (variants, traits) float64 67kB dask.array<chunksize=(5439, 1), meta=np.ndarray>\n", " variant_linreg_p_value (variants, traits) float64 67kB dask.array<chunksize=(5439, 1), meta=np.ndarray>\n", " variant_hwe_p_value (variants) float64 67kB dask.array<chunksize=(5439,), meta=np.ndarray>\n", " variant_genotype_count (variants, genotypes) uint64 201kB dask.array<chunksize=(5439, 3), meta=np.ndarray>\n", " genotype_id (genotypes) <U3 36B dask.array<chunksize=(3,), meta=np.ndarray>\n", " ... ...\n", " SuperPopulation (samples) object 2kB 'EUR' 'EUR' ... 'SAS' 'SAS'\n", " isFemale (samples) bool 250B False True ... False True\n", " PurpleHair (samples) bool 250B False False ... True True\n", " CaffeineConsumption (samples) int64 2kB 4 4 4 3 6 2 2 ... 4 6 4 6 5 5\n", " sample_dp_mean (samples) float32 1kB dask.array<chunksize=(250,), meta=np.ndarray>\n", " call_dosage (variants, samples) int64 17MB dask.array<chunksize=(5439, 250), meta=np.ndarray>\n", "Attributes: (3)
<xarray.Dataset> Size: 17MB\n", "Dimensions: (samples: 250, components: 10,\n", " variants: 3491, genotypes: 3,\n", " alleles: 2, ploidy: 2, contigs: 84,\n", " filters: 1)\n", "Coordinates:\n", " * genotypes (genotypes) <U3 36B '0/0' '0/1' '1/1'\n", "Dimensions without coordinates: samples, components, variants, alleles, ploidy,\n", " contigs, filters\n", "Data variables: (12/48)\n", " sample_pca_projection (samples, components) float32 10kB dask.array<chunksize=(250, 10), meta=np.ndarray>\n", " sample_pca_component (variants, components) float32 140kB dask.array<chunksize=(3491, 10), meta=np.ndarray>\n", " sample_pca_explained_variance (components) float32 40B dask.array<chunksize=(10,), meta=np.ndarray>\n", " sample_pca_explained_variance_ratio (components) float32 40B dask.array<chunksize=(10,), meta=np.ndarray>\n", " sample_pca_loading (variants, components) float32 140kB dask.array<chunksize=(3491, 10), meta=np.ndarray>\n", " call_alternate_allele_count (variants, samples) int16 2MB dask.array<chunksize=(3491, 250), meta=np.ndarray>\n", " ... ...\n", " SuperPopulation (samples) object 2kB 'EUR' ... 'SAS'\n", " isFemale (samples) bool 250B False True ... True\n", " PurpleHair (samples) bool 250B False ... True\n", " CaffeineConsumption (samples) int64 2kB 4 4 4 3 ... 4 6 5 5\n", " sample_dp_mean (samples) float32 1kB dask.array<chunksize=(250,), meta=np.ndarray>\n", " call_dosage (variants, samples) int64 7MB dask.array<chunksize=(3491, 250), meta=np.ndarray>\n", "Attributes: (3)