.. currentmodule:: sgkit

Changelog
=========

.. _changelog.0.7.0:

0.7.0 (unreleased)
-----------------------

New Features
~~~~~~~~~~~~

- Add :func:`sgkit.io.plink.write_plink` function.
  (:user:`tomwhite`, :pr:`1003`, :issue:`926`)

- Add ``phased`` option to :func:`simulate_genotype_call_dataset` function.
  (:user:`tomwhite`, :pr:`1022`, :issue:`973`)

- Add :func:`sgkit.io.plink.plink_to_zarr` and
  :func:`sgkit.io.plink.zarr_to_plink` convenience functions
  (:user:`tomwhite`, :pr:`1047`, :issue:`1004`)

- Add :func:`sgkit.convert_call_to_index` method.
  (:user:`timothymillar`, :pr:`1050`, :issue:`1048`)

- Add ``read_chunk_length`` option to :func:`sgkit.io.vcf.vcf_to_zarr` and
  :func:`sgkit.io.vcf.vcf_to_zarrs` functions. These are useful to reduce memory usage
  with large sample counts or a large ``chunk_length``.
  (:user:`benjeffery`, :pr:`1044`, :issue:`1042`)

- Add ``retain_temp_files`` to :func:`sgkit.io.vcf.vcf_to_zarr` function.
  (:user:`benjeffery`, :pr:`1046`, :issue:`1036`)

- Add :func:`sgkit.io.vcf.read_vcf` convenience function.
  (:user:`tomwhite`, :pr:`1052`, :issue:`1004`)

- Add :func:`sgkit.hybrid_relationship`, :func:`sgkit.hybrid_inverse_relationship`
  and :func:`invert_relationship_matrix` methods.
  (:user:`timothymillar`, :pr:`1053`, :issue:`993`)

- Add :func:`sgkit.io.vcf.zarr_array_sizes` for determining array sizes for storage in Zarr.
  (:user:`tomwhite`, :pr:`1073`, :issue:`734`)

- Add ``skipna`` option to :func:`genomic_relationship` function.
  (:user:`timothymillar`, :pr:`1078`, :issue:`1076`)

- Add `additional_variant_fields` to :func:`sgkit.simulate_genotype_call_dataset` function.
  (:user:`benjeffery`, :pr:`1056`)

Bug fixes
~~~~~~~~~

- ``vcf_to_zarrs`` was outputting zarr files with ``temp_chunk_length`` sized chunks. This has now been fixed
  so that the chunks are ``chunk_length`` sized.
  (:user:`benjeffery`, :pr:`1044`,)

- ``vcf_to_zarrs`` failed with ``ZeroDivisionError`` when VCFs with no samples were converted.
  (:user:`benjeffery`, :pr:`1069`, :issue:`1068`)

Breaking changes
~~~~~~~~~~~~~~~~

- Generate VCF header by default when writing VCF using :func:`sgkit.io.vcf.write_vcf` or :func:`sgkit.io.vcf.zarr_to_vcf`.
  Previously, the dataset had to contain a ``vcf_header`` attribute.
  (:user:`tomwhite`, :pr:`1021`, :issue:`1020`)

Deprecations
~~~~~~~~~~~~

- Store contig names in a ``contig_id`` array, and filters in a ``filter_id`` array.
  This change duplicates the information stored in dataset attributes, which should
  now be considered deprecated, and will be removed in a future release.
  (:user:`tomwhite`, :pr:`1054`, :issue:`1035`)
- Parameter ``founder_indices`` in :func:`sgkit.pedigree_kinship` is now considered
  deprecated, and will be removed in a future release. Instead, the ``founder_kinship``
  parameter now expects a full sized kinship matrix in which non-founder values are
  ignored.
  (:user:`timothymillar`, :pr:`1075`, :issue:`1061`)
  
Improvements
~~~~~~~~~~~~

- Improve performance scaling of method :func:`sgkit.identity_by_state`
  with number of samples.
  (:user:`timothymillar`, :pr:`1028`, :issue:`1026`)
- Add ``skipna`` option to method :func:`sgkit.identity_by_state`.
  (:user:`timothymillar`, :pr:`1028`, :issue:`1027`)
- Importing ``sgkit`` is now much faster due to deferred numba compilation.
  (:user:`tomwhite`, :pr:`1039`, :issue:`939`)

Bug fixes
~~~~~~~~~

- Correct formatting of mixed-ploidy data in :func:`sgkit.display_genotypes`.
  (:user:`timothymillar`, :pr:`1030`, :issue:`571`)

.. Documentation
.. ~~~~~~~~~~~~~

.. _changelog.0.6.0:

0.6.0 (1 February 2023)
-----------------------

New Features
~~~~~~~~~~~~

- Add support for Python 3.10.
  (:user:`tomwhite`, :pr:`813`, :issue:`801`)
- Add pedigree support. This allows parent-child relationships to be
  stored in sgkit, and provides a number of new pedigree methods:
  :func:`pedigree_inbreeding`, :func:`pedigree_inverse_kinship`,
  and :func:`pedigree_kinship`.
  (:user:`timothymillar`, :issue:`786`)
- Implement a function to calculate the VanRaden genomic relationship matrix,
  :func:`genomic_relationship`.
  (:user:`timothymillar`, :pr:`903`, :issue:`874`)
- Generic functions for cohort sums and means.
  (:user:`timothymillar`, :pr:`867`, :issue:`730`)
- Toggle numba caching by environment variable ``SGKIT_DISABLE_NUMBA_CACHE``.
  (:user:`timothymillar`, :pr:`870`, :issue:`869`)
- Add :func:`window_by_genome` for computing whole-genome statistics.
  (:user:`tomwhite`, :pr:`945`, :issue:`893`)
- Add :func:`window_by_interval` to create windows from arbitrary intervals.
  (:user:`tomwhite`, :pr:`974`)
- Add ``contig_lengths`` dataset attribute if found in the VCF file.
  (:user:`tomwhite`, :pr:`946`, :issue:`464`)
- Add VCF export functions.
  (:user:`tomwhite`, :pr:`953`, :issue:`924`)
- Add ``auto_rechunk`` option to ``sgkit.save_dataset`` to automatically rechunk
  the dataset before saving it to disk, if necessary, as zarr requires equal chunk
  sizes. (:user:`benjeffery`, :pr:`988`, :issue:`981`)
- Implement gene-ε for gene set association analysis.
  (:user:`tomwhite`, :pr:`975`, :issue:`692`)
- Add :func:`count_variant_genotypes` to count the occurrence of each possible
  genotype.
  (:user:`timothymillar`, :issue:`911`, :pr:`1002`)

Breaking changes
~~~~~~~~~~~~~~~~

- Remove support for Python 3.7.
  (:user:`tomwhite`, :pr:`927`, :issue:`802`)
- The ``count_a1`` parameter to :func:`sgkit.io.plink.read_plink` previously
  defaulted to ``True`` but now defaults to ``False``. Furthermore, ``True``
  is no longer supported since it is not clear how it should behave.
  (:user:`tomwhite`, :pr:`952`, :issue:`947`)
- The ``dosage`` variable specification has been removed and all references
  to it have been replaced with :data:`sgkit.variables.call_dosage_spec`
  which has been generalized to include integer encodings. Additionally,
  the default value for the ``dosage`` parameter in :func:`ld_matrix` and
  :func:`ld_prune` has been changed from ``'dosage'`` to ``'call_dosage'``.
  (:user:`timothymillar`, :pr:`995`, :issue:`875`)
- The ``genotype_count`` variable has been removed in favour of
  :data:`sgkit.variables.variant_genotype_count_spec` which follows VCF ordering
  (i.e., homozygous reference, heterozygous, homozygous alternate for biallelic, 
  diploid genotypes).
  :func:`hardy_weinberg_test` now defaults to using 
  :data:`sgkit.variables.variant_genotype_count_spec` for the ``genotype_count``
  parameter. (:user:`timothymillar`, :issue:`911`, :pr:`1002`)

.. Deprecations
.. ~~~~~~~~~~~~

Improvements
~~~~~~~~~~~~

- Improvements to VCF parsing performance.
  (:user:`benjeffery`, :pr:`933`)
- Improve default VCF compression.
  (:user:`tomwhite`, :pr:`937`, :issue:`925`)
- Ensure chunking is not excessive in samples dimension.
  (:user:`tomwhite`, :pr:`943`)
- Add asv benchmarks for VCF performance.
  (:user:`tomwhite`, :pr:`976`)
- Add asv benchmarks for VCF compression size.
  (:user:`tomwhite`, :pr:`978`)

Bug fixes
~~~~~~~~~

- Allow chunking in the samples dimension for :func:`identity_by_state`.
  (:user:`timothymillar`, :pr:`837`, :issue:`836`)
- Remove VLenUTF8 from filters to avoid double encoding error.
  (:user:`tomwhite`, :pr:`852`, :issue:`785`)
- Fix numpy input for ``Weir_Goudet_beta``.
  (:user:`timothymillar`, :pr:`865`, :issue:`861`)
- Fix ``get_region_start`` to work with contig names that have colons and dashes.
  (:user:`d-laub`, :pr:`883`, :issue:`882`)
- Fixes to VCF reading and writing found by hypothesis testing.
  (:user:`tomwhite`, :pr:`972`)

.. Documentation
.. ~~~~~~~~~~~~~