Changelog#

0.11.0 (unreleased)#

Deprecations#

Functions for reading and writing plink are deprecated, please use the bio2zarr and vcztools packages. (tomwhite, PR 1324, GH 1318)
Using rechunker for rechunking in sgkit is deprecated, please use the the Xarray chunk function directly. (tomwhite, PR 1325)

0.10.0 (7 April 2025)#

New Features#

Add ‘matching’ method to identity_by_state() function. (timothymillar, PR 1229, GH 1227)
Add ‘chunks’ option to pedigree_kinship() function. (timothymillar, PR 1282, GH 1280)

Breaking changes#

Functions for reading and writing VCF were removed from sgkit, please use the bio2zarr and vcztools packages instead. (tomwhite, PR 1264)
Drop support for Python 3.9. (tomwhite, PR 1276)

Improvements#

Add support for NumPy 2. (tomwhite, PR 1297)
Run tests against Zarr Python 3. (tomwhite, PR 1285, GH 1226)
Switch to pyproject.toml. (ianhi, PR 1300)

Documentation#

Update GWAS tutorial to use vcf2zarr. (tomwhite, PR 1258, GH 1257)

0.9.0 (24 June 2024)#

Deprecations#

Deprecate functions for reading VCF, in favour of bio2zarr. (tomwhite, PR 1233, GH 1232)

Improvements#

Pin to Zarr < 3. (tomwhite, PR 1230)
Pin numpy and cbgen to numpy < 2. (timothymillar, PR 1228, GH 1225)

0.8.0 (1 May 2024)#

New Features#

Add pedigree_contribution() function. (timothymillar, PR 1093, GH 963)
Add pedigree_sel() function. (timothymillar, PR 1087, GH 1033)
Add ‘Endelman-Jannink’ estimator to genomic_relationship() function. (timothymillar, PR 1100, GH 1062)
Add display_pedigree() function. (timothymillar, PR 1104, GH 1097)
Add option to count variant alleles directly from call genotypes in function count_variant_alleles(). (timothymillar, PR 1119, GH 1116)
Add simulate_genedrop() function. (timothymillar, PR 1139, GH 1107)

Improvements#

Improve performance of variant_stats() and sample_stats() functions. (timothymillar, PR 1119, GH 1116)

Bug fixes#

Fix error in missing data handling for VCF. Missing values for most fields were marked as the corresponding “fill” value. For example, missing string values were stored as the empty string (string fill value) rather than “.” (string missing value). Similarly for integer fields, missing values were stored as -2 (int fill) rather than -1 (int missing). (jeromekelleher, PR 1190, GH 1192)
Change max_str_len utility function to return a Python int rather than an array scalar. (tomwhite, PR 1086, GH 1085)
Turn off numba cache by default. (jeromekelleher, PR 1191, GH 1156)
Fix xarray repr in doctests. (tomwhite, PR 1205, GH 1204)
Avoid disabling jit on guvectorised functions. (timothymillar, PR 1199, GH 1194)

Documentation#

Add tutorial on relationship matrices. (timothymillar, PR 1072, GH 956)
Add pairwise_distance to docs. (benjeffery, PR 1145, GH 1142)
Format matrix in pairwise docs. (benjeffery, PR 1148, GH 1147)

0.7.0 (2 May 2023)#

New Features#

Add sgkit.io.plink.write_plink() function. (tomwhite, PR 1003, GH 926)
Add phased option to simulate_genotype_call_dataset() function. (tomwhite, PR 1022, GH 973)
Add sgkit.io.plink.plink_to_zarr() and sgkit.io.plink.zarr_to_plink() convenience functions (tomwhite, PR 1047, GH 1004)
Add sgkit.convert_call_to_index() method. (timothymillar, PR 1050, GH 1048)
Add read_chunk_length option to sgkit.io.vcf.vcf_to_zarr and sgkit.io.vcf.vcf_to_zarrs functions. These are useful to reduce memory usage with large sample counts or a large chunk_length. (benjeffery, PR 1044, GH 1042)
Add retain_temp_files to sgkit.io.vcf.vcf_to_zarr function. (benjeffery, PR 1046, GH 1036)
Add sgkit.io.vcf.read_vcf convenience function. (tomwhite, PR 1052, GH 1004)
Add sgkit.hybrid_relationship(), sgkit.hybrid_inverse_relationship() and invert_relationship_matrix() methods. (timothymillar, PR 1053, GH 993)
Add sgkit.io.vcf.zarr_array_sizes for determining array sizes for storage in Zarr. (tomwhite, PR 1073, GH 734)
Add skipna option to genomic_relationship() function. (timothymillar, PR 1078, GH 1076)
Add additional_variant_fields to sgkit.simulate_genotype_call_dataset() function. (benjeffery, PR 1056)

Breaking changes#

Generate VCF header by default when writing VCF using sgkit.io.vcf.write_vcf or sgkit.io.vcf.zarr_to_vcf. Previously, the dataset had to contain a vcf_header attribute. (tomwhite, PR 1021, GH 1020)

Deprecations#

Store contig names in a contig_id array, and filters in a filter_id array. This change duplicates the information stored in dataset attributes, which should now be considered deprecated, and will be removed in a future release. (tomwhite, PR 1054, GH 1035)
Parameter founder_indices in sgkit.pedigree_kinship() is now considered deprecated, and will be removed in a future release. Instead, the founder_kinship parameter now expects a full sized kinship matrix in which non-founder values are ignored. (timothymillar, PR 1075, GH 1061)

Improvements#

Improve performance scaling of method sgkit.identity_by_state() with number of samples. (timothymillar, PR 1028, GH 1026)
Add skipna option to method sgkit.identity_by_state(). (timothymillar, PR 1028, GH 1027)
Importing sgkit is now much faster due to deferred numba compilation. (tomwhite, PR 1039, GH 939)

Bug fixes#

vcf_to_zarrs was outputting zarr files with temp_chunk_length sized chunks. This has now been fixed so that the chunks are chunk_length sized. (benjeffery, PR 1044,)
vcf_to_zarrs failed with ZeroDivisionError when VCFs with no samples were converted. (benjeffery, PR 1069, GH 1068)
Correct formatting of mixed-ploidy data in sgkit.display_genotypes(). (timothymillar, PR 1030, GH 571)

0.6.0 (1 February 2023)#

New Features#

Add support for Python 3.10. (tomwhite, PR 813, GH 801)
Add pedigree support. This allows parent-child relationships to be stored in sgkit, and provides a number of new pedigree methods: pedigree_inbreeding(), pedigree_inverse_kinship(), and pedigree_kinship(). (timothymillar, GH 786)
Implement a function to calculate the VanRaden genomic relationship matrix, genomic_relationship(). (timothymillar, PR 903, GH 874)
Generic functions for cohort sums and means. (timothymillar, PR 867, GH 730)
Toggle numba caching by environment variable SGKIT_DISABLE_NUMBA_CACHE. (timothymillar, PR 870, GH 869)
Add window_by_genome() for computing whole-genome statistics. (tomwhite, PR 945, GH 893)
Add window_by_interval() to create windows from arbitrary intervals. (tomwhite, PR 974)
Add contig_lengths dataset attribute if found in the VCF file. (tomwhite, PR 946, GH 464)
Add VCF export functions. (tomwhite, PR 953, GH 924)
Add auto_rechunk option to sgkit.save_dataset to automatically rechunk the dataset before saving it to disk, if necessary, as zarr requires equal chunk sizes. (benjeffery, PR 988, GH 981)
Implement gene-ε for gene set association analysis. (tomwhite, PR 975, GH 692)
Add count_variant_genotypes() to count the occurrence of each possible genotype. (timothymillar, GH 911, PR 1002)

Breaking changes#

Remove support for Python 3.7. (tomwhite, PR 927, GH 802)
The count_a1 parameter to sgkit.io.plink.read_plink() previously defaulted to True but now defaults to False. Furthermore, True is no longer supported since it is not clear how it should behave. (tomwhite, PR 952, GH 947)
The dosage variable specification has been removed and all references to it have been replaced with sgkit.variables.call_dosage_spec which has been generalized to include integer encodings. Additionally, the default value for the dosage parameter in ld_matrix() and ld_prune() has been changed from 'dosage' to 'call_dosage'. (timothymillar, PR 995, GH 875)
The genotype_count variable has been removed in favour of sgkit.variables.variant_genotype_count_spec which follows VCF ordering (i.e., homozygous reference, heterozygous, homozygous alternate for biallelic, diploid genotypes). hardy_weinberg_test() now defaults to using sgkit.variables.variant_genotype_count_spec for the genotype_count parameter. (timothymillar, GH 911, PR 1002)

Improvements#

Improvements to VCF parsing performance. (benjeffery, PR 933)
Improve default VCF compression. (tomwhite, PR 937, GH 925)
Ensure chunking is not excessive in samples dimension. (tomwhite, PR 943)
Add asv benchmarks for VCF performance. (tomwhite, PR 976)
Add asv benchmarks for VCF compression size. (tomwhite, PR 978)

Bug fixes#

Allow chunking in the samples dimension for identity_by_state(). (timothymillar, PR 837, GH 836)
Remove VLenUTF8 from filters to avoid double encoding error. (tomwhite, PR 852, GH 785)
Fix numpy input for Weir_Goudet_beta. (timothymillar, PR 865, GH 861)
Fix get_region_start to work with contig names that have colons and dashes. (d-laub, PR 883, GH 882)
Fixes to VCF reading and writing found by hypothesis testing. (tomwhite, PR 972)

Changelog#

0.11.0 (unreleased)#

Deprecations#

0.10.0 (7 April 2025)#

New Features#

Breaking changes#

Improvements#

Documentation#

0.9.0 (24 June 2024)#

Deprecations#

Improvements#

0.8.0 (1 May 2024)#

New Features#

Improvements#

Bug fixes#

Documentation#

0.7.0 (2 May 2023)#

New Features#

Breaking changes#

Deprecations#

Improvements#

Bug fixes#

0.6.0 (1 February 2023)#

New Features#

Breaking changes#

Improvements#

Bug fixes#

This Page