API reference#

This page provides an auto-generated summary of sgkits’s API.

IO/imports and exports#

See Reading and writing genetic data

BGEN#

bgen_to_zarr(input, output[, region, ...])

Convert a BGEN file to a Zarr on-disk store.

read_bgen(path[, metafile_path, ...])

Read BGEN dataset.

rechunk_bgen(ds, output, *[, chunk_length, ...])

Rechunk BGEN dataset as Zarr.

VCF (reading)#

Deprecated since version 0.9.0: Functions for reading VCF are deprecated, please use the bio2zarr package.

read_vcf(input, *[, target_part_size, ...])

Read VCF dataset.

vcf_to_zarr(input, output, *[, ...])

Convert VCF files to a single Zarr on-disk store.

For more low-level control:

partition_into_regions(vcf_path, *[, ...])

Calculate genomic region strings to partition a compressed VCF or BCF file into roughly equal parts.

vcf_to_zarrs(input, output, regions[, ...])

Convert VCF files to multiple Zarr on-disk stores, one per region.

concat_zarrs(urls, output, *[, ...])

Concatenate multiple Zarr stores into a single Zarr store.

zarr_array_sizes(input, *[, regions, ...])

Make a pass through a VCF/BCF file to determine sizes for storage in Zarr.

For converting from scikit-allel’s VCF Zarr representation to sgkit’s Zarr representation:

read_scikit_allel_vcfzarr(path[, field_defs])

Read a VCF Zarr file created using scikit-allel.

VCF (writing)#

write_vcf(input, output, *[, vcf_header])

Convert a dataset to a VCF file.

zarr_to_vcf(input, output, *[, vcf_header])

Convert a Zarr file to a VCF file.

Dataset#

load_dataset(store[, storage_options])

Load a dataset from Zarr storage.

save_dataset(ds, store[, storage_options, ...])

Save a dataset to Zarr storage.

Methods#

Basic genetics statistics#

call_allele_frequencies(ds, *[, ...])

Compute per sample allele frequencies from genotype calls.

cohort_allele_frequencies(ds, *[, ...])

Compute allele frequencies for each cohort.

count_call_alleles(ds, *[, call_genotype, merge])

Compute per sample allele counts from genotype calls.

count_cohort_alleles(ds, *[, ...])

Compute per cohort allele counts from per-sample allele counts, or genotype calls.

count_variant_alleles(ds, *[, ...])

Compute allele count from per-sample allele counts, or genotype calls.

count_variant_genotypes(ds, *[, ...])

Count the number of calls of each possible genotype, at each variant.

variant_stats(ds, *[, call_genotype, ...])

Compute quality control variant statistics from genotype calls.

sample_stats(ds, *[, call_genotype, merge])

Compute quality control sample statistics from genotype calls.

Population genetic statistics#

divergence(ds, *[, cohort_allele_count, merge])

Compute divergence between pairs of cohorts.

diversity(ds, *[, cohort_allele_count, merge])

Compute diversity from cohort allele counts.

Fst(ds, *[, estimator, stat_divergence, merge])

Compute Fst between pairs of cohorts.

Garud_H(ds, *[, call_genotype, ...])

Compute the H1, H12, H123 and H2/H1 statistics for detecting signatures of soft sweeps, as defined in Garud et al. (2015).

individual_heterozygosity(ds, *[, ...])

Compute per call individual heterozygosity.

observed_heterozygosity(ds, *[, ...])

Compute per cohort observed heterozygosity.

pbs(ds, *[, stat_Fst, cohorts, merge])

Compute the population branching statistic (PBS) between cohort triples.

Tajimas_D(ds, *[, variant_allele_count, ...])

Compute Tajimas' D for a genotype call dataset.

Relatedness#

Weir_Goudet_beta(ds, *[, ...])

Estimate pairwise beta between all pairs of samples as described in Weir and Goudet 2017 [1].

genomic_relationship(ds, *[, call_dosage, ...])

Compute a genomic relationship matrix (AKA the GRM or G-matrix).

pc_relate(ds, *[, maf, call_genotype, ...])

Compute PC-Relate as described in Conomos, et al. 2016 [1].

pedigree_contribution(ds, *[, method, ...])

Calculate the expected genomic contribution of each sample to each other sample based on pedigree structure.

pedigree_inbreeding(ds, *[, method, parent, ...])

Estimate expected inbreeding coefficients from pedigree structure.

pedigree_inverse_kinship(ds, *[, method, ...])

Calculate the inverse of the kinship matrix from pedigree structure.

pedigree_kinship(ds, *[, method, parent, ...])

Estimate expected pairwise kinship coefficients from pedigree structure.

hybrid_inverse_relationship(ds, *, ...[, ...])

Compute the inverse of a hybrid relationship matrix (AKA the HRM or H-matrix) combining pedigree and genomic information.

hybrid_relationship(ds, *, genomic_relationship)

Compute a hybrid relationship matrix (AKA the HRM or H-matrix) combining pedigree and genomic information.

identity_by_state(ds, *[, call_genotype, ...])

Compute identity by state (IBS) probabilities between all pairs of samples.

Linkage Disequilibrium#

ld_matrix(ds, *[, dosage, threshold, ...])

Compute a sparse linkage disequilibrium (LD) matrix.

ld_prune(ds, *[, dosage, threshold, ...])

Prune variants in linkage disequilibrium (LD).

Genetic Association and Regression#

gwas_linear_regression(ds, *, dosage, ...[, ...])

Run linear regression to identify continuous trait associations with genetic variants.

regenie(ds, *, dosage, covariates, traits[, ...])

Regenie trait transformation.

hardy_weinberg_test(ds, *[, genotype_count, ...])

Exact test for HWE as described in Wigginton et al. 2005 [1].

genee(ds, ld, *[, reg_covar])

Compute gene-ε as described in Cheng, et al. 2020 [1].

Miscellaneous#

maximal_independent_set(df)

Compute a maximal independent set of variants.

pairwise_distance(x[, metric, split_every, ...])

Calculates the pairwise distance between all pairs of row vectors in the given two dimensional array x.

Utilities#

convert_call_to_index(ds, *[, ...])

Convert each call genotype to a single integer value.

convert_probability_to_call(ds[, ...])

Convert genotype probabilities to hard calls.

display_genotypes(ds[, max_variants, ...])

Display genotype calls.

display_pedigree(ds[, parent, graph_attrs, ...])

Display a pedigree dataset as a directed acyclic graph.

filter_partial_calls(ds, *[, call_genotype, ...])

Replace partial genotype calls with missing values.

infer_call_ploidy(ds, *[, call_genotype, ...])

Infer the ploidy of each call genotype based on the number of fill (non-allele) values in each call genotype.

infer_sample_ploidy(ds, *[, call_genotype, ...])

Infer the ploidy of each sample across all variants based on the number of fill (non-allele) values in call genotypes.

infer_variant_ploidy(ds, *[, call_genotype, ...])

Infer the ploidy at each variant across all samples based on the number of fill (non-allele) values in call genotypes.

invert_relationship_matrix(ds, *, relationship)

Calculate the inverse relationship (sub-) matrix.

parent_indices(ds, *[, sample_id, ...])

Calculate the integer indices for the parents of each sample within the samples dimension.

pedigree_sel(ds, *, samples[, ...])

Return a new dataset with each array indexed along the 'samples' dimension using a subset of samples and the optional inclusion of their relatives.

simulate_genedrop(ds, *[, method, ...])

Generate progeny genotypes via a gene-drop simulation (MacCluer et al. 1986 [1]).

simulate_genotype_call_dataset(n_variant, ...)

Simulate genotype calls and variant/sample data.

window_by_genome(ds, *[, merge])

Add a window spanning the whole genome to a dataset.

window_by_interval(ds, *[, variant_contig, ...])

Add window information to a dataset, using arbitrary intervals.

window_by_position(ds, *, size[, step, ...])

Add window information to a dataset, measured by distance along the genome.

window_by_variant(ds, *, size[, step, ...])

Add window information to a dataset, measured by number of variants.

Variables#

By convention, variable names are singular in sgkit. For example, genotype_count, not genotype_counts.

variables.call_allele_count_spec

Allele counts.

variables.call_allele_frequency_spec

Allele frequencies.

variables.call_dosage_spec

Dosages, encoded as floats, with NaN indicating a missing value.

variables.call_dosage_mask_spec

A flag for each call indicating which values are missing.

variables.call_genotype_complete_spec

Call genotypes in which partial genotype calls are replaced with completely missing genotype calls.

variables.call_genotype_complete_mask_spec

A flag for each call indicating which values are missing.

variables.call_genotype_spec

Call genotype.

variables.call_genotype_mask_spec

A flag for each call indicating which values are missing.

variables.call_genotype_fill_spec

A flag for each allele position within mixed ploidy call genotypes indicating fill (non-allele) values of lower ploidy calls.

variables.call_genotype_phased_spec

A flag for each call indicating if it is phased or not.

variables.call_genotype_probability_spec

Genotype probabilities.

variables.call_genotype_probability_mask_spec

A flag for each call indicating which values are missing.

variables.call_genotype_index_spec

Single integer values encoding each call genotype.

variables.call_genotype_index_mask_spec

A flag for each call genotype index indicating missing indices (-1) which correspond to call genotypes with missing alleles.

variables.call_heterozygosity_spec

Observed heterozygosity of each call genotype.

variables.call_ploidy_spec

Call genotype ploidy.

variables.cohort_allele_count_spec

Cohort allele counts.

variables.cohort_allele_frequency_spec

Cohort Allele frequencies.

variables.covariates_spec

Covariate variable names.

variables.interval_contig_name_spec

Interval contig name.

variables.interval_start_spec

Interval start position (inclusive).

variables.interval_stop_spec

Interval stop position (exclusive).

variables.ld_prune_index_to_drop_spec

Variant indexes to drop for LD prune.

variables.regenie_base_prediction_spec

REGENIE's base prediction (blocks, alphas, samples, outcomes).

variables.regenie_loco_prediction_spec

REGENIE's regenie_loco_prediction (contigs, samples, outcomes).

variables.regenie_meta_prediction_spec

REGENIE's regenie_meta_prediction (samples, outcomes).

variables.parent_spec

Indices of parent samples with negative values indicating unknown parents.

variables.parent_id_spec

Unique identifiers of parent samples matching those in sgkit.variables.sample_id_spec.

variables.pc_relate_phi_spec

PC Relate kinship coefficient matrix.

variables.sample_call_rate_spec

The fraction of variants with called genotypes.

variables.sample_cohort_spec

The index of the cohort that each sample belongs to.

variables.sample_id_spec

The unique identifier of the sample.

variables.sample_n_called_spec

The number of variants with called genotypes.

variables.sample_n_het_spec

The number of variants with heterozygous calls.

variables.sample_n_hom_alt_spec

The number of variants with homozygous alternate calls.

variables.sample_n_hom_ref_spec

The number of variants with homozygous reference calls.

variables.sample_n_non_ref_spec

The number of variants that are not homozygous reference calls.

variables.sample_pca_component_spec

Principal axes defined as eigenvectors for sample covariance matrix.

variables.sample_pca_explained_variance_spec

Variance explained by each principal component.

variables.sample_pca_explained_variance_ratio_spec

Ratio of variance explained to total variance for each principal component, i.e. dask_ml.decomposition.TruncatedSVD.explained_variance_ratio_.

variables.sample_pca_loading_spec

PCA loadings defined as principal axes scaled by square root of eigenvalues.

variables.sample_pca_projection_spec

Projection of samples onto principal axes.

variables.sample_ploidy_spec

Ploidy of each sample calculated from call genotypes across all variants with -1 indicating variable ploidy.

variables.stat_divergence_spec

Genetic divergence between pairs of cohorts.

variables.stat_diversity_spec

Genetic diversity (also known as "Tajima’s pi") for cohorts.

variables.stat_Fst_spec

Fixation index (Fst) between pairs of cohorts.

variables.stat_Garud_h1_spec

Garud H1 statistic for cohorts.

variables.stat_Garud_h12_spec

Garud H12 statistic for cohorts.

variables.stat_Garud_h123_spec

Garud H123 statistic for cohorts.

variables.stat_Garud_h2_h1_spec

Garud H2/H1 statistic for cohorts.

variables.stat_genomic_kinship_spec

Genomic kinship matrix with self-kinship values on the diagonal.

variables.stat_genomic_relationship_spec

Genomic relationship matrix (GRM).

variables.stat_Hamilton_Kerr_lambda_spec

The probability that two (randomly chosen without replacement) homologues, inherited from a single parent, were derived from a single chromosomal copy within that parent.

variables.stat_Hamilton_Kerr_tau_spec

Numerical contribution of each parent, for each individual, which must sum to the ploidy of the individual.

variables.stat_hybrid_relationship_spec

The hybrid relationship matrix or "H-matrix".

variables.stat_hybrid_inverse_relationship_spec

Inverse of the hybrid relationship matrix or "H-matrix".

variables.stat_identity_by_state_spec

Pairwise IBS probabilities among all samples.

variables.stat_inverse_relationship_spec

The inverse of a relationship matrix.

variables.stat_observed_heterozygosity_spec

Observed heterozygosity for cohorts.

variables.stat_pbs_spec

Population branching statistic for cohort triples.

variables.stat_pedigree_contribution_spec

Expected proportional genomic contribution to and from each pair of samples based on pedigree structure.

variables.stat_pedigree_inbreeding_spec

Expected inbreeding coefficients of samples based on pedigree structure.

variables.stat_pedigree_inverse_kinship_spec

Inverse of a kinship matrix calculated from pedigree structure.

variables.stat_pedigree_inverse_relationship_spec

Inverse of a relationship matrix calculated from pedigree structure.

variables.stat_pedigree_kinship_spec

Pairwise estimates of expected kinship among samples based on pedigree structure with self-kinship values on the diagonal.

variables.stat_pedigree_relationship_spec

Relationship matrix derived from pedigree structure.

variables.stat_Tajimas_D_spec

Tajima’s D for cohorts.

variables.stat_Weir_Goudet_beta_spec

Pairwise Weir Goudet beta statistic among all samples.

variables.traits_spec

Trait (for example phenotype) variable names.

variables.variant_allele_spec

The possible alleles for the variant.

variables.variant_allele_count_spec

Variant allele counts.

variables.variant_allele_frequency_spec

The frequency of the occurrence of each allele.

variables.variant_allele_total_spec

The number of occurrences of all alleles.

variables.variant_genotype_count_spec

The number of observations for each possible genotype at each variant.

variables.variant_linreg_beta_spec

Beta values associated with each variant and trait.

variables.variant_call_rate_spec

The fraction of samples with called genotypes.

variables.variant_contig_spec

Index corresponding to contig name for each variant.

variables.variant_hwe_p_value_spec

P values from HWE test for each variant as float in [0, 1].

variables.variant_id_spec

The unique identifier of the variant.

variables.variant_n_called_spec

The number of samples with called genotypes.

variables.variant_n_het_spec

The number of samples with heterozygous calls.

variables.variant_n_hom_alt_spec

The number of samples with homozygous alternate calls.

variables.variant_n_hom_ref_spec

The number of samples with homozygous reference calls.

variables.variant_n_non_ref_spec

The number of samples that are not homozygous reference calls.

variables.variant_linreg_p_value_spec

P values as float in [0, 1].

variables.variant_ploidy_spec

Ploidy of each variant calculated from call genotypes across all samples with -1 indicating variable ploidy.

variables.variant_position_spec

The reference position of the variant.

variables.variant_score_spec

Scores to prioritize variant selection when constructing an LD matrix.

variables.variant_linreg_t_value_spec

T statistics for each beta.

variables.window_contig_spec

The contig index of each window.

variables.window_start_spec

The index values of window start positions along the variants dimension.

variables.window_stop_spec

The index values of window stop positions along the variants dimension.