sgkit.io.plink.read_plink¶
- sgkit.io.plink.read_plink(*, path=None, bed_path=None, bim_path=None, fam_path=None, chunks='auto', fam_sep=' ', bim_sep='\\t', bim_int_contig=False, count_a1=True, lock=False, persist=True)¶
Read PLINK dataset.
Loads a single PLINK dataset as dask arrays within a Dataset from bed, bim, and fam files.
- Parameters
- path :
str
|Path
|None
Union
[str
,Path
,None
] (default:None
) Path to PLINK file set. This should not include a suffix, i.e. if the files are at data.{bed,fam,bim} then only ‘data’ should be provided (suffixes are added internally). Either this path must be provided or all 3 of bed_path, bim_path and fam_path.
- bed_path :
str
|Path
|None
Union
[str
,Path
,None
] (default:None
) Path to PLINK bed file. This should be a full path including the .bed extension and cannot be specified in conjunction with path.
- bim_path :
str
|Path
|None
Union
[str
,Path
,None
] (default:None
) Path to PLINK bim file. This should be a full path including the .bim extension and cannot be specified in conjunction with path.
- fam_path :
str
|Path
|None
Union
[str
,Path
,None
] (default:None
) Path to PLINK fam file. This should be a full path including the .fam extension and cannot be specified in conjunction with path.
- chunks :
str
|int
|tuple
Union
[str
,int
,tuple
] (default:'auto'
) Chunk size for genotype (i.e. .bed) data, by default “auto”
- fam_sep :
str
str
(default:' '
) Delimiter for .fam file, by default ” “
- bim_sep :
str
str
(default:'\t'
) Delimiter for .bim file, by default ” “
- bim_int_contig :
bool
bool
(default:False
) Whether or not the contig/chromosome name in the .bim file should be interpreted as an integer, by default False. If False, then the variant/contig field in the resulting dataset will contain the indexes of corresponding strings encountered in the first .bim field.
- count_a1 :
bool
bool
(default:True
) Whether or not allele counts should be for A1 or A2, by default True. Typically A1 is the minor allele and should be counted instead of A2. This is not enforced by PLINK though and it is up to the data generating process to ensure that A1 is in fact an alternate/minor/effect allele. See https://www.cog-genomics.org/plink/1.9/formats for more details.
- lock :
bool
bool
(default:False
) Whether or not to synchronize concurrent reads of .bed file blocks, by default False. This is passed through to [dask.array.from_array](https://docs.dask.org/en/latest/array-api.html#dask.array.from_array).
- persist :
bool
bool
(default:True
) Whether or not to persist .fam and .bim information in memory, by default True. This is an important performance consideration as the plain text files for this data will be read multiple times when False. This can lead to load times that are upwards of 10x slower.
- path :
- Return type
- Returns
A dataset containing genotypes as 3 dimensional calls along with all accompanying pedigree and variant information. The content of this dataset includes:
sgkit.variables.variant_id_spec
(variants)sgkit.variables.variant_contig_spec
(variants)sgkit.variables.variant_position_spec
(variants)sgkit.variables.variant_allele_spec
(variants)sgkit.variables.sample_id_spec
(samples)sgkit.variables.call_genotype_spec
(variants, samples, ploidy)sgkit.variables.call_genotype_mask_spec
(variants, samples, ploidy)
The following pedigree-specific fields are also included:
sample_family_id
: Family identifier commonly referred to as FIDsample_id
: Within-family identifier for samplesample_paternal_id
: Within-family identifier for father of samplesample_maternal_id
: Within-family identifier for mother of samplesample_sex
: Sex code equal to 1 for male, 2 for female, and -1for missing
sample_phenotype
: Phenotype code equal to 1 for control, 2 for case,and -1 for missing
See https://www.cog-genomics.org/plink/1.9/formats#fam for more details.
- Raises
ValueError – If path and one of bed_path, bim_path or fam_path are provided.