API reference#
This page provides an auto-generated summary of sgkits’s API.
IO/imports and exports#
See Reading and writing genetic data
BGEN#
|
Convert a BGEN file to a Zarr on-disk store. |
|
Read BGEN dataset. |
|
Rechunk BGEN dataset as Zarr. |
PLINK#
|
Convert a PLINK file to a Zarr on-disk store. |
|
Read PLINK dataset. |
|
Convert a dataset to a PLINK file. |
|
Convert a Zarr on-disk store to a PLINK file. |
VCF#
Functions for reading and writing VCF were removed from sgkit, please use the bio2zarr and vcztools packages.
Dataset#
|
Load a dataset from Zarr storage. |
|
Save a dataset to Zarr storage. |
Methods#
Basic genetics statistics#
|
Compute per sample allele frequencies from genotype calls. |
|
Compute allele frequencies for each cohort. |
|
Compute per sample allele counts from genotype calls. |
|
Compute per cohort allele counts from per-sample allele counts, or genotype calls. |
|
Compute allele count from per-sample allele counts, or genotype calls. |
|
Count the number of calls of each possible genotype, at each variant. |
|
Compute quality control variant statistics from genotype calls. |
|
Compute quality control sample statistics from genotype calls. |
Population genetic statistics#
|
Compute divergence between pairs of cohorts. |
|
Compute diversity from cohort allele counts. |
|
Compute Fst between pairs of cohorts. |
|
Compute the H1, H12, H123 and H2/H1 statistics for detecting signatures of soft sweeps, as defined in Garud et al. (2015). |
|
Compute per call individual heterozygosity. |
|
Compute per cohort observed heterozygosity. |
|
Compute the population branching statistic (PBS) between cohort triples. |
|
Compute Tajimas' D for a genotype call dataset. |
Linkage Disequilibrium#
Genetic Association and Regression#
|
Run linear regression to identify continuous trait associations with genetic variants. |
|
Regenie trait transformation. |
|
Exact test for HWE as described in Wigginton et al. 2005 [1]. |
|
Compute gene-ε as described in Cheng, et al. 2020 [1]. |
Miscellaneous#
Compute a maximal independent set of variants. |
|
|
Calculates the pairwise distance between all pairs of row vectors in the given two dimensional array x. |
Utilities#
|
Convert each call genotype to a single integer value. |
|
Convert genotype probabilities to hard calls. |
|
Display genotype calls. |
|
Display a pedigree dataset as a directed acyclic graph. |
|
Replace partial genotype calls with missing values. |
|
Infer the ploidy of each call genotype based on the number of fill (non-allele) values in each call genotype. |
|
Infer the ploidy of each sample across all variants based on the number of fill (non-allele) values in call genotypes. |
|
Infer the ploidy at each variant across all samples based on the number of fill (non-allele) values in call genotypes. |
|
Calculate the inverse relationship (sub-) matrix. |
|
Calculate the integer indices for the parents of each sample within the samples dimension. |
|
Return a new dataset with each array indexed along the 'samples' dimension using a subset of samples and the optional inclusion of their relatives. |
|
Generate progeny genotypes via a gene-drop simulation (MacCluer et al. 1986 [1]). |
|
Simulate genotype calls and variant/sample data. |
|
Add a window spanning the whole genome to a dataset. |
|
Add window information to a dataset, using arbitrary intervals. |
|
Add window information to a dataset, measured by distance along the genome. |
|
Add window information to a dataset, measured by number of variants. |
Variables#
By convention, variable names are singular in sgkit. For example, genotype_count
, not genotype_counts
.
Allele counts. |
|
Allele frequencies. |
|
Dosages, encoded as floats, with NaN indicating a missing value. |
|
A flag for each call indicating which values are missing. |
|
Call genotypes in which partial genotype calls are replaced with completely missing genotype calls. |
|
A flag for each call indicating which values are missing. |
|
Call genotype. |
|
A flag for each call indicating which values are missing. |
|
A flag for each allele position within mixed ploidy call genotypes indicating fill (non-allele) values of lower ploidy calls. |
|
A flag for each call indicating if it is phased or not. |
|
Genotype probabilities. |
|
A flag for each call indicating which values are missing. |
|
Single integer values encoding each call genotype. |
|
A flag for each call genotype index indicating missing indices (-1) which correspond to call genotypes with missing alleles. |
|
Observed heterozygosity of each call genotype. |
|
Call genotype ploidy. |
|
Cohort allele counts. |
|
Cohort Allele frequencies. |
|
Covariate variable names. |
|
Interval contig name. |
|
Interval start position (inclusive). |
|
Interval stop position (exclusive). |
|
Variant indexes to drop for LD prune. |
|
REGENIE's base prediction (blocks, alphas, samples, outcomes). |
|
REGENIE's regenie_loco_prediction (contigs, samples, outcomes). |
|
REGENIE's regenie_meta_prediction (samples, outcomes). |
|
Indices of parent samples with negative values indicating unknown parents. |
|
Unique identifiers of parent samples matching those in |
|
PC Relate kinship coefficient matrix. |
|
The fraction of variants with called genotypes. |
|
The index of the cohort that each sample belongs to. |
|
The unique identifier of the sample. |
|
The number of variants with called genotypes. |
|
The number of variants with heterozygous calls. |
|
The number of variants with homozygous alternate calls. |
|
The number of variants with homozygous reference calls. |
|
The number of variants that are not homozygous reference calls. |
|
Principal axes defined as eigenvectors for sample covariance matrix. |
|
Variance explained by each principal component. |
|
Ratio of variance explained to total variance for each principal component, i.e. |
|
PCA loadings defined as principal axes scaled by square root of eigenvalues. |
|
Projection of samples onto principal axes. |
|
Ploidy of each sample calculated from call genotypes across all variants with -1 indicating variable ploidy. |
|
Genetic divergence between pairs of cohorts. |
|
Genetic diversity (also known as "Tajima’s pi") for cohorts. |
|
Fixation index (Fst) between pairs of cohorts. |
|
Garud H1 statistic for cohorts. |
|
Garud H12 statistic for cohorts. |
|
Garud H123 statistic for cohorts. |
|
Garud H2/H1 statistic for cohorts. |
|
Genomic kinship matrix with self-kinship values on the diagonal. |
|
Genomic relationship matrix (GRM). |
|
The probability that two (randomly chosen without replacement) homologues, inherited from a single parent, were derived from a single chromosomal copy within that parent. |
|
Numerical contribution of each parent, for each individual, which must sum to the ploidy of the individual. |
|
The hybrid relationship matrix or "H-matrix". |
|
Inverse of the hybrid relationship matrix or "H-matrix". |
|
Pairwise IBS probabilities among all samples. |
|
The inverse of a relationship matrix. |
|
Observed heterozygosity for cohorts. |
|
Population branching statistic for cohort triples. |
|
Expected proportional genomic contribution to and from each pair of samples based on pedigree structure. |
|
Expected inbreeding coefficients of samples based on pedigree structure. |
|
Inverse of a kinship matrix calculated from pedigree structure. |
|
Inverse of a relationship matrix calculated from pedigree structure. |
|
Pairwise estimates of expected kinship among samples based on pedigree structure with self-kinship values on the diagonal. |
|
Relationship matrix derived from pedigree structure. |
|
Tajima’s D for cohorts. |
|
Pairwise Weir Goudet beta statistic among all samples. |
|
Trait (for example phenotype) variable names. |
|
The possible alleles for the variant. |
|
Variant allele counts. |
|
The frequency of the occurrence of each allele. |
|
The number of occurrences of all alleles. |
|
The number of observations for each possible genotype at each variant. |
|
Beta values associated with each variant and trait. |
|
The fraction of samples with called genotypes. |
|
Index corresponding to contig name for each variant. |
|
P values from HWE test for each variant as float in [0, 1]. |
|
The unique identifier of the variant. |
|
The number of samples with called genotypes. |
|
The number of samples with heterozygous calls. |
|
The number of samples with homozygous alternate calls. |
|
The number of samples with homozygous reference calls. |
|
The number of samples that are not homozygous reference calls. |
|
P values as float in [0, 1]. |
|
Ploidy of each variant calculated from call genotypes across all samples with -1 indicating variable ploidy. |
|
The reference position of the variant. |
|
Scores to prioritize variant selection when constructing an LD matrix. |
|
T statistics for each beta. |
|
The contig index of each window. |
|
The index values of window start positions along the |
|
The index values of window stop positions along the |