sgkit.simulate_genedrop#
- sgkit.simulate_genedrop(ds, *, method='diploid', call_genotype='call_genotype', parent='parent', stat_Hamilton_Kerr_tau='stat_Hamilton_Kerr_tau', stat_Hamilton_Kerr_lambda='stat_Hamilton_Kerr_lambda', seed=None, merge=True)#
Generate progeny genotypes via a gene-drop simulation (MacCluer et al. 1986 [1]).
Simulate Mendelian inheritance of founder alleles throughout a pedigree. Founders are identified as those individuals with unrecorded parents.
- Parameters:
- ds
Dataset
Dataset containing genotype calls and pedigree structure.
- method {‘diploid’, ‘Hamilton-Kerr’} |
None
Optional
[Literal
['diploid'
,'Hamilton-Kerr'
]] (default:'diploid'
) The method used for gene-drop simulation. Defaults to “diploid” which is only suitable for pedigrees in which all samples are diploids resulting from sexual reproduction. The “Hamilton-Kerr” method is suitable for autopolyploid and mixed-ploidy datasets following Kerr et al. 2012 [2] and Hamilton and Kerr 2017 [3].
- call_genotype
Hashable
(default:'call_genotype'
) Input variable name holding call_genotype as defined by
sgkit.variables.call_genotype_spec
. Must be present inds
.- parent
Hashable
(default:'parent'
) Input variable name holding parents of each sample as defined by
sgkit.variables.parent_spec
. If the variable is not present inds
, it will be computed usingparent_indices()
.- stat_Hamilton_Kerr_tau
Hashable
(default:'stat_Hamilton_Kerr_tau'
) Input variable name holding stat_Hamilton_Kerr_tau as defined by
sgkit.variables.stat_Hamilton_Kerr_tau_spec
. This variable is only required for the “Hamilton-Kerr” method.- stat_Hamilton_Kerr_lambda
Hashable
(default:'stat_Hamilton_Kerr_lambda'
) Input variable name holding stat_Hamilton_Kerr_lambda as defined by
sgkit.variables.stat_Hamilton_Kerr_lambda_spec
. This variable is only required for the “Hamilton-Kerr” method.- seed
ndarray
|Array
|None
Union
[ndarray
,Array
,None
] (default:None
) Optionally specify a random seed to initialise gene-drop simulations. This may be a single integer value or an array of unsigned 32 bit integers used to specify the random seed for each variant.
- merge
bool
(default:True
) If True (the default), merge the input dataset and the computed output variables into a single dataset, otherwise return only the computed output variables. See Dataset merge behavior for more details.
- ds
- Return type:
- Returns:
: A dataset containing the following variables:
- Raises:
ValueError – If an unknown method is specified.
ValueError – If the pedigree contains half-founders.
ValueError – If the diploid method is used with a non-diploid dataset.
ValueError – If the diploid method is used and the parents dimension does not have a length of two.
ValueError – If the Hamilton-Kerr method is used and a sample has more than two contributing parents.
ValueError – If the Hamilton-Kerr method is used and the number of alleles in a founder genotype does not match the sum of its tau values (i.e., ploidy).
NotImplementedError – If the Hamilton-Kerr method is used and a tau value exceeds the parental ploidy.
NotImplementedError – If the Hamilton-Kerr method is used and a non-zero lambda value is specified when tau is not 2.
Note
Linkage between variant loci is not simulated. However, variants will have identical inheritance patterns if initialized with identical random seeds when using an array of seeds (see the example).
Examples
Dataset with founder genotypes
>>> import sgkit as sg >>> import numpy as np >>> ds = sg.simulate_genotype_call_dataset(n_variant=3, n_sample=5, n_allele=4) >>> ds["parent_id"] = ["samples", "parents"], [ ... [".", "."], ... [".", "."], ... ["S0", "S1"], ... ["S0", "S1"], ... ["S2", "S3"], ... ] >>> ds.call_genotype.data[:] = -1 >>> ds.call_genotype.data[:,0:2] = np.arange(4).reshape(2,2) >>> sg.display_genotypes(ds) samples S0 S1 S2 S3 S4 variants 0 0/1 2/3 ./. ./. ./. 1 0/1 2/3 ./. ./. ./. 2 0/1 2/3 ./. ./. ./.
Simulation with random seed
>>> sim = sg.simulate_genedrop(ds, merge=False, seed=1) >>> sim["sample_id"] = ds["sample_id"] >>> sim["variant_position"] = ds["variant_position"] >>> sim["variant_allele"] = ds["variant_allele"] >>> sg.display_genotypes(sim) samples S0 S1 S2 S3 S4 variants 0 0/1 2/3 0/3 1/3 0/3 1 0/1 2/3 0/3 0/3 0/3 2 0/1 2/3 0/2 0/3 2/0
Simulation with seed per variant (including duplicates)
>>> seeds = np.array([0,0,1], 'uint32') >>> sim = sg.simulate_genedrop(ds, merge=False, seed=seeds) >>> sim["sample_id"] = ds["sample_id"] >>> sim["variant_position"] = ds["variant_position"] >>> sim["variant_allele"] = ds["variant_allele"] >>> sg.display_genotypes(sim) samples S0 S1 S2 S3 S4 variants 0 0/1 2/3 1/2 0/3 2/3 1 0/1 2/3 1/2 0/3 2/3 2 0/1 2/3 0/2 1/3 2/3
References
[1] Jean W. MacCluer, John L. VanderBerg. Bruce Read and Oliver A. Ryder 1986. “Pedigree analysis by computer simulation.” Zoo Biology 5: 147-160.
[2] - Richard J. Kerr, Li Li, Bruce Tier, Gregory W. Dutkowski and Thomas A. McRae 2012. “Use of the numerator relationship matrix in genetic analysis of autopolyploid species.” Theoretical and Applied Genetics 124: 1271-1282.
[3] - Matthew G. Hamilton and Richard J. Kerr 2017. “Computation of the inverse additive relationship matrix for autopolyploid and multiple-ploidy populations.” Theoretical and Applied Genetics 131: 851-860.