API reference#

This page provides an auto-generated summary of sgkits’s API.

IO/imports and exports#

BGEN#

`bgen_to_zarr`(input, output[, region, ...])	Convert a BGEN file to a Zarr on-disk store.
`read_bgen`(path[, metafile_path, ...])	Read BGEN dataset.
`rechunk_bgen`(ds, output, *[, chunk_length, ...])	Rechunk BGEN dataset as Zarr.

PLINK#

Deprecated since version 0.11.0: Functions for reading and writing plink are deprecated, please use the bio2zarr and vcztools packages.

`plink_to_zarr`(*[, path, bed_path, bim_path, ...])	Convert a PLINK file to a Zarr on-disk store.
`read_plink`(*[, path, bed_path, bim_path, ...])	Read PLINK dataset.
`write_plink`(ds, *[, path, bed_path, ...])	Convert a dataset to a PLINK file.
`zarr_to_plink`(input, *[, path, bed_path, ...])	Convert a Zarr on-disk store to a PLINK file.

VCF#

Functions for reading and writing VCF were removed from sgkit, please use the bio2zarr and vcztools packages.

Dataset#

`load_dataset`(store[, storage_options])	Load a dataset from Zarr storage.
`save_dataset`(ds, store[, storage_options, ...])	Save a dataset to Zarr storage.

Methods#

Basic genetics statistics#

`call_allele_frequencies`(ds, *[, ...])	Compute per sample allele frequencies from genotype calls.
`cohort_allele_frequencies`(ds, *[, ...])	Compute allele frequencies for each cohort.
`count_call_alleles`(ds, *[, call_genotype, merge])	Compute per sample allele counts from genotype calls.
`count_cohort_alleles`(ds, *[, ...])	Compute per cohort allele counts from per-sample allele counts, or genotype calls.
`count_variant_alleles`(ds, *[, ...])	Compute allele count from per-sample allele counts, or genotype calls.
`count_variant_genotypes`(ds, *[, ...])	Count the number of calls of each possible genotype, at each variant.
`variant_stats`(ds, *[, call_genotype, ...])	Compute quality control variant statistics from genotype calls.
`sample_stats`(ds, *[, call_genotype, merge])	Compute quality control sample statistics from genotype calls.

Population genetic statistics#

`divergence`(ds, *[, cohort_allele_count, merge])	Compute divergence between pairs of cohorts.
`diversity`(ds, *[, cohort_allele_count, merge])	Compute diversity as the average number of differences between a pair of sequences from cohort allele counts.
`Fst`(ds, *[, estimator, stat_divergence, merge])	Compute Fst between pairs of cohorts.
`Garud_H`(ds, *[, call_genotype, ...])	Compute the H1, H12, H123 and H2/H1 statistics for detecting signatures of soft sweeps, as defined in Garud et al. (2015).
`individual_heterozygosity`(ds, *[, ...])	Compute per call individual heterozygosity.
`observed_heterozygosity`(ds, *[, ...])	Compute per cohort observed heterozygosity.
`pbs`(ds, *[, stat_Fst, cohorts, merge])	Compute the population branching statistic (PBS) between cohort triples.
`Tajimas_D`(ds, *[, variant_allele_count, ...])	Compute Tajimas' D for a genotype call dataset.

Relatedness#

`Weir_Goudet_beta`(ds, *[, ...])	Estimate pairwise beta between all pairs of samples as described in Weir and Goudet 2017 [1].
`genomic_relationship`(ds, *[, call_dosage, ...])	Compute a genomic relationship matrix (AKA the GRM or G-matrix).
`pc_relate`(ds, *[, maf, call_genotype, ...])	Compute PC-Relate as described in Conomos, et al. 2016 [1].
`pedigree_contribution`(ds, *[, method, ...])	Calculate the expected genomic contribution of each sample to each other sample based on pedigree structure.
`pedigree_inbreeding`(ds, *[, method, parent, ...])	Estimate expected inbreeding coefficients from pedigree structure.
`pedigree_inverse_kinship`(ds, *[, method, ...])	Calculate the inverse of the kinship matrix from pedigree structure.
`pedigree_kinship`(ds, *[, method, parent, ...])	Estimate expected pairwise kinship coefficients from pedigree structure.
`hybrid_inverse_relationship`(ds, *, ...[, ...])	Compute the inverse of a hybrid relationship matrix (AKA the HRM or H-matrix) combining pedigree and genomic information.
`hybrid_relationship`(ds, *, genomic_relationship)	Compute a hybrid relationship matrix (AKA the HRM or H-matrix) combining pedigree and genomic information.
`identity_by_state`(ds, *[, call_genotype, ...])	Compute identity by state (IBS) probabilities between all pairs of samples.

Linkage Disequilibrium#

`ld_matrix`(ds, *[, dosage, threshold, ...])	Compute a sparse linkage disequilibrium (LD) matrix.
`ld_prune`(ds, *[, dosage, threshold, ...])	Prune variants in linkage disequilibrium (LD).

Genetic Association and Regression#

`gwas_linear_regression`(ds, *, dosage, ...[, ...])	Run linear regression to identify continuous trait associations with genetic variants.
`regenie`(ds, *, dosage, covariates, traits[, ...])	Regenie trait transformation.
`hardy_weinberg_test`(ds, *[, genotype_count, ...])	Exact test for HWE as described in Wigginton et al. 2005 [1].
`genee`(ds, ld, *[, reg_covar])	Compute gene-ε as described in Cheng, et al. 2020 [1].

Miscellaneous#

`maximal_independent_set`(df)	Compute a maximal independent set of variants.
`pairwise_distance`(x[, metric, split_every, ...])	Calculates the pairwise distance between all pairs of row vectors in the given two dimensional array x.

Utilities#

`convert_call_to_index`(ds, *[, ...])	Convert each call genotype to a single integer value.
`convert_probability_to_call`(ds[, ...])	Convert genotype probabilities to hard calls.
`display_genotypes`(ds[, max_variants, ...])	Display genotype calls.
`display_pedigree`(ds[, parent, graph_attrs, ...])	Display a pedigree dataset as a directed acyclic graph.
`filter_partial_calls`(ds, *[, call_genotype, ...])	Replace partial genotype calls with missing values.
`infer_call_ploidy`(ds, *[, call_genotype, ...])	Infer the ploidy of each call genotype based on the number of fill (non-allele) values in each call genotype.
`infer_sample_ploidy`(ds, *[, call_genotype, ...])	Infer the ploidy of each sample across all variants based on the number of fill (non-allele) values in call genotypes.
`infer_variant_ploidy`(ds, *[, call_genotype, ...])	Infer the ploidy at each variant across all samples based on the number of fill (non-allele) values in call genotypes.
`invert_relationship_matrix`(ds, *, relationship)	Calculate the inverse relationship (sub-) matrix.
`parent_indices`(ds, *[, sample_id, ...])	Calculate the integer indices for the parents of each sample within the samples dimension.
`pedigree_sel`(ds, *, samples[, ...])	Return a new dataset with each array indexed along the 'samples' dimension using a subset of samples and the optional inclusion of their relatives.
`simulate_genedrop`(ds, *[, method, ...])	Generate progeny genotypes via a gene-drop simulation (MacCluer et al. 1986 [1]).
`simulate_genotype_call_dataset`(n_variant, ...)	Simulate genotype calls and variant/sample data.
`window_by_genome`(ds, *[, merge])	Add a window spanning the whole genome to a dataset.
`window_by_interval`(ds, *[, variant_contig, ...])	Add window information to a dataset, using arbitrary intervals.
`window_by_position`(ds, *, size[, step, ...])	Add window information to a dataset, measured by distance along the genome.
`window_by_variant`(ds, *, size[, step, ...])	Add window information to a dataset, measured by number of variants.

Variables#

By convention, variable names are singular in sgkit. For example, genotype_count, not genotype_counts.

`variables.call_allele_count_spec`	Allele counts.
`variables.call_allele_frequency_spec`	Allele frequencies.
`variables.call_dosage_spec`	Dosages, encoded as floats, with NaN indicating a missing value.
`variables.call_dosage_mask_spec`	A flag for each call indicating which values are missing.
`variables.call_genotype_complete_spec`	Call genotypes in which partial genotype calls are replaced with completely missing genotype calls.
`variables.call_genotype_complete_mask_spec`	A flag for each call indicating which values are missing.
`variables.call_genotype_spec`	Call genotype.
`variables.call_genotype_mask_spec`	A flag for each call indicating which values are missing.
`variables.call_genotype_fill_spec`	A flag for each allele position within mixed ploidy call genotypes indicating fill (non-allele) values of lower ploidy calls.
`variables.call_genotype_phased_spec`	A flag for each call indicating if it is phased or not.
`variables.call_genotype_probability_spec`	Genotype probabilities.
`variables.call_genotype_probability_mask_spec`	A flag for each call indicating which values are missing.
`variables.call_genotype_index_spec`	Single integer values encoding each call genotype.
`variables.call_genotype_index_mask_spec`	A flag for each call genotype index indicating missing indices (-1) which correspond to call genotypes with missing alleles.
`variables.call_heterozygosity_spec`	Observed heterozygosity of each call genotype.
`variables.call_ploidy_spec`	Call genotype ploidy.
`variables.cohort_allele_count_spec`	Cohort allele counts.
`variables.cohort_allele_frequency_spec`	Cohort Allele frequencies.
`variables.covariates_spec`	Covariate variable names.
`variables.interval_contig_name_spec`	Interval contig name.
`variables.interval_start_spec`	Interval start position (inclusive).
`variables.interval_stop_spec`	Interval stop position (exclusive).
`variables.ld_prune_index_to_drop_spec`	Variant indexes to drop for LD prune.
`variables.regenie_base_prediction_spec`	REGENIE's base prediction (blocks, alphas, samples, outcomes).
`variables.regenie_loco_prediction_spec`	REGENIE's regenie_loco_prediction (contigs, samples, outcomes).
`variables.regenie_meta_prediction_spec`	REGENIE's regenie_meta_prediction (samples, outcomes).
`variables.parent_spec`	Indices of parent samples with negative values indicating unknown parents.
`variables.parent_id_spec`	Unique identifiers of parent samples matching those in `sgkit.variables.sample_id_spec`.
`variables.pc_relate_phi_spec`	PC Relate kinship coefficient matrix.
`variables.sample_call_rate_spec`	The fraction of variants with called genotypes.
`variables.sample_cohort_spec`	The index of the cohort that each sample belongs to.
`variables.sample_id_spec`	The unique identifier of the sample.
`variables.sample_n_called_spec`	The number of variants with called genotypes.
`variables.sample_n_het_spec`	The number of variants with heterozygous calls.
`variables.sample_n_hom_alt_spec`	The number of variants with homozygous alternate calls.
`variables.sample_n_hom_ref_spec`	The number of variants with homozygous reference calls.
`variables.sample_n_non_ref_spec`	The number of variants that are not homozygous reference calls.
`variables.sample_pca_component_spec`	Principal axes defined as eigenvectors for sample covariance matrix.
`variables.sample_pca_explained_variance_spec`	Variance explained by each principal component.
`variables.sample_pca_explained_variance_ratio_spec`	Ratio of variance explained to total variance for each principal component, i.e. `sgkit.stats.truncated_svd.TruncatedSVD.explained_variance_ratio_`.
`variables.sample_pca_loading_spec`	PCA loadings defined as principal axes scaled by square root of eigenvalues.
`variables.sample_pca_projection_spec`	Projection of samples onto principal axes.
`variables.sample_ploidy_spec`	Ploidy of each sample calculated from call genotypes across all variants with -1 indicating variable ploidy.
`variables.stat_divergence_spec`	Genetic divergence between pairs of cohorts.
`variables.stat_diversity_spec`	Genetic diversity (also known as "Tajima’s pi") for cohorts.
`variables.stat_Fst_spec`	Fixation index (Fst) between pairs of cohorts.
`variables.stat_Garud_h1_spec`	Garud H1 statistic for cohorts.
`variables.stat_Garud_h12_spec`	Garud H12 statistic for cohorts.
`variables.stat_Garud_h123_spec`	Garud H123 statistic for cohorts.
`variables.stat_Garud_h2_h1_spec`	Garud H2/H1 statistic for cohorts.
`variables.stat_genomic_kinship_spec`	Genomic kinship matrix with self-kinship values on the diagonal.
`variables.stat_genomic_relationship_spec`	Genomic relationship matrix (GRM).
`variables.stat_Hamilton_Kerr_lambda_spec`	The probability that two (randomly chosen without replacement) homologues, inherited from a single parent, were derived from a single chromosomal copy within that parent.
`variables.stat_Hamilton_Kerr_tau_spec`	Numerical contribution of each parent, for each individual, which must sum to the ploidy of the individual.
`variables.stat_hybrid_relationship_spec`	The hybrid relationship matrix or "H-matrix".
`variables.stat_hybrid_inverse_relationship_spec`	Inverse of the hybrid relationship matrix or "H-matrix".
`variables.stat_identity_by_state_spec`	Pairwise IBS probabilities among all samples.
`variables.stat_inverse_relationship_spec`	The inverse of a relationship matrix.
`variables.stat_observed_heterozygosity_spec`	Observed heterozygosity for cohorts.
`variables.stat_pbs_spec`	Population branching statistic for cohort triples.
`variables.stat_pedigree_contribution_spec`	Expected proportional genomic contribution to and from each pair of samples based on pedigree structure.
`variables.stat_pedigree_inbreeding_spec`	Expected inbreeding coefficients of samples based on pedigree structure.
`variables.stat_pedigree_inverse_kinship_spec`	Inverse of a kinship matrix calculated from pedigree structure.
`variables.stat_pedigree_inverse_relationship_spec`	Inverse of a relationship matrix calculated from pedigree structure.
`variables.stat_pedigree_kinship_spec`	Pairwise estimates of expected kinship among samples based on pedigree structure with self-kinship values on the diagonal.
`variables.stat_pedigree_relationship_spec`	Relationship matrix derived from pedigree structure.
`variables.stat_Tajimas_D_spec`	Tajima’s D for cohorts.
`variables.stat_Weir_Goudet_beta_spec`	Pairwise Weir Goudet beta statistic among all samples.
`variables.traits_spec`	Trait (for example phenotype) variable names.
`variables.variant_allele_spec`	The possible alleles for the variant.
`variables.variant_allele_count_spec`	Variant allele counts.
`variables.variant_allele_frequency_spec`	The frequency of the occurrence of each allele.
`variables.variant_allele_total_spec`	The number of occurrences of all alleles.
`variables.variant_genotype_count_spec`	The number of observations for each possible genotype at each variant.
`variables.variant_linreg_beta_spec`	Beta values associated with each variant and trait.
`variables.variant_call_rate_spec`	The fraction of samples with called genotypes.
`variables.variant_contig_spec`	Index corresponding to contig name for each variant.
`variables.variant_hwe_p_value_spec`	P values from HWE test for each variant as float in [0, 1].
`variables.variant_id_spec`	The unique identifier of the variant.
`variables.variant_n_called_spec`	The number of samples with called genotypes.
`variables.variant_n_het_spec`	The number of samples with heterozygous calls.
`variables.variant_n_hom_alt_spec`	The number of samples with homozygous alternate calls.
`variables.variant_n_hom_ref_spec`	The number of samples with homozygous reference calls.
`variables.variant_n_non_ref_spec`	The number of samples that are not homozygous reference calls.
`variables.variant_linreg_p_value_spec`	P values as float in [0, 1].
`variables.variant_ploidy_spec`	Ploidy of each variant calculated from call genotypes across all samples with -1 indicating variable ploidy.
`variables.variant_position_spec`	The reference position of the variant.
`variables.variant_score_spec`	Scores to prioritize variant selection when constructing an LD matrix.
`variables.variant_linreg_t_value_spec`	T statistics for each beta.
`variables.window_contig_spec`	The contig index of each window.
`variables.window_start_spec`	The index values of window start positions along the `variants` dimension.
`variables.window_stop_spec`	The index values of window stop positions along the `variants` dimension.