Changelog#
0.10.0 (unreleased)#
New Features#
Add ‘matching’ method to
identity_by_state()
function. (timothymillar, PR 1229, GH 1227)
Breaking changes#
Documentation#
0.9.0 (24 June 2024)#
Deprecations#
Improvements#
Pin numpy and cbgen to numpy < 2. (timothymillar, PR 1228, GH 1225)
0.8.0 (1 May 2024)#
New Features#
Add
pedigree_contribution()
function. (timothymillar, PR 1093, GH 963)Add
pedigree_sel()
function. (timothymillar, PR 1087, GH 1033)Add ‘Endelman-Jannink’ estimator to
genomic_relationship()
function. (timothymillar, PR 1100, GH 1062)Add
display_pedigree()
function. (timothymillar, PR 1104, GH 1097)Add option to count variant alleles directly from call genotypes in function
count_variant_alleles()
. (timothymillar, PR 1119, GH 1116)Add
simulate_genedrop()
function. (timothymillar, PR 1139, GH 1107)
Improvements#
Improve performance of
variant_stats()
andsample_stats()
functions. (timothymillar, PR 1119, GH 1116)
Bug fixes#
Fix error in missing data handling for VCF. Missing values for most fields were marked as the corresponding “fill” value. For example, missing string values were stored as the empty string (string fill value) rather than “.” (string missing value). Similarly for integer fields, missing values were stored as -2 (int fill) rather than -1 (int missing). (jeromekelleher, PR 1190, GH 1192)
Change max_str_len utility function to return a Python int rather than an array scalar. (tomwhite, PR 1086, GH 1085)
Turn off numba cache by default. (jeromekelleher, PR 1191, GH 1156)
Avoid disabling jit on guvectorised functions. (timothymillar, PR 1199, GH 1194)
Documentation#
Add tutorial on relationship matrices. (timothymillar, PR 1072, GH 956)
Add pairwise_distance to docs. (benjeffery, PR 1145, GH 1142)
Format matrix in pairwise docs. (benjeffery, PR 1148, GH 1147)
0.7.0 (2 May 2023)#
New Features#
Add
sgkit.io.plink.write_plink()
function. (tomwhite, PR 1003, GH 926)Add
phased
option tosimulate_genotype_call_dataset()
function. (tomwhite, PR 1022, GH 973)Add
sgkit.io.plink.plink_to_zarr()
andsgkit.io.plink.zarr_to_plink()
convenience functions (tomwhite, PR 1047, GH 1004)Add
sgkit.convert_call_to_index()
method. (timothymillar, PR 1050, GH 1048)Add
read_chunk_length
option tosgkit.io.vcf.vcf_to_zarr
andsgkit.io.vcf.vcf_to_zarrs
functions. These are useful to reduce memory usage with large sample counts or a largechunk_length
. (benjeffery, PR 1044, GH 1042)Add
retain_temp_files
tosgkit.io.vcf.vcf_to_zarr
function. (benjeffery, PR 1046, GH 1036)Add
sgkit.io.vcf.read_vcf
convenience function. (tomwhite, PR 1052, GH 1004)Add
sgkit.hybrid_relationship()
,sgkit.hybrid_inverse_relationship()
andinvert_relationship_matrix()
methods. (timothymillar, PR 1053, GH 993)Add
sgkit.io.vcf.zarr_array_sizes
for determining array sizes for storage in Zarr. (tomwhite, PR 1073, GH 734)Add
skipna
option togenomic_relationship()
function. (timothymillar, PR 1078, GH 1076)Add additional_variant_fields to
sgkit.simulate_genotype_call_dataset()
function. (benjeffery, PR 1056)
Breaking changes#
Deprecations#
Store contig names in a
contig_id
array, and filters in afilter_id
array. This change duplicates the information stored in dataset attributes, which should now be considered deprecated, and will be removed in a future release. (tomwhite, PR 1054, GH 1035)Parameter
founder_indices
insgkit.pedigree_kinship()
is now considered deprecated, and will be removed in a future release. Instead, thefounder_kinship
parameter now expects a full sized kinship matrix in which non-founder values are ignored. (timothymillar, PR 1075, GH 1061)
Improvements#
Improve performance scaling of method
sgkit.identity_by_state()
with number of samples. (timothymillar, PR 1028, GH 1026)Add
skipna
option to methodsgkit.identity_by_state()
. (timothymillar, PR 1028, GH 1027)Importing
sgkit
is now much faster due to deferred numba compilation. (tomwhite, PR 1039, GH 939)
Bug fixes#
vcf_to_zarrs
was outputting zarr files withtemp_chunk_length
sized chunks. This has now been fixed so that the chunks arechunk_length
sized. (benjeffery, PR 1044,)vcf_to_zarrs
failed withZeroDivisionError
when VCFs with no samples were converted. (benjeffery, PR 1069, GH 1068)Correct formatting of mixed-ploidy data in
sgkit.display_genotypes()
. (timothymillar, PR 1030, GH 571)
0.6.0 (1 February 2023)#
New Features#
Add pedigree support. This allows parent-child relationships to be stored in sgkit, and provides a number of new pedigree methods:
pedigree_inbreeding()
,pedigree_inverse_kinship()
, andpedigree_kinship()
. (timothymillar, GH 786)Implement a function to calculate the VanRaden genomic relationship matrix,
genomic_relationship()
. (timothymillar, PR 903, GH 874)Generic functions for cohort sums and means. (timothymillar, PR 867, GH 730)
Toggle numba caching by environment variable
SGKIT_DISABLE_NUMBA_CACHE
. (timothymillar, PR 870, GH 869)Add
window_by_genome()
for computing whole-genome statistics. (tomwhite, PR 945, GH 893)Add
window_by_interval()
to create windows from arbitrary intervals. (tomwhite, PR 974)Add
contig_lengths
dataset attribute if found in the VCF file. (tomwhite, PR 946, GH 464)Add
auto_rechunk
option tosgkit.save_dataset
to automatically rechunk the dataset before saving it to disk, if necessary, as zarr requires equal chunk sizes. (benjeffery, PR 988, GH 981)Implement gene-ε for gene set association analysis. (tomwhite, PR 975, GH 692)
Add
count_variant_genotypes()
to count the occurrence of each possible genotype. (timothymillar, GH 911, PR 1002)
Breaking changes#
The
count_a1
parameter tosgkit.io.plink.read_plink()
previously defaulted toTrue
but now defaults toFalse
. Furthermore,True
is no longer supported since it is not clear how it should behave. (tomwhite, PR 952, GH 947)The
dosage
variable specification has been removed and all references to it have been replaced withsgkit.variables.call_dosage_spec
which has been generalized to include integer encodings. Additionally, the default value for thedosage
parameter inld_matrix()
andld_prune()
has been changed from'dosage'
to'call_dosage'
. (timothymillar, PR 995, GH 875)The
genotype_count
variable has been removed in favour ofsgkit.variables.variant_genotype_count_spec
which follows VCF ordering (i.e., homozygous reference, heterozygous, homozygous alternate for biallelic, diploid genotypes).hardy_weinberg_test()
now defaults to usingsgkit.variables.variant_genotype_count_spec
for thegenotype_count
parameter. (timothymillar, GH 911, PR 1002)
Improvements#
Bug fixes#
Allow chunking in the samples dimension for
identity_by_state()
. (timothymillar, PR 837, GH 836)Remove VLenUTF8 from filters to avoid double encoding error. (tomwhite, PR 852, GH 785)
Fix numpy input for
Weir_Goudet_beta
. (timothymillar, PR 865, GH 861)Fix
get_region_start
to work with contig names that have colons and dashes. (d-laub, PR 883, GH 882)Fixes to VCF reading and writing found by hypothesis testing. (tomwhite, PR 972)