Data Formats¶
Gelex supports standard bioinformatics formats for genomic analysis. Most inputs are PLINK binaries or tabular text files, and outputs are command-specific TSV-style summaries.
Quick Reference¶
Use this section to quickly map your files to Gelex commands and the relevant detail section.
Data type |
Files |
Used by |
Details |
|---|---|---|---|
Genotype |
|
|
|
Phenotype |
|
|
|
Quantitative covariates |
|
|
|
Discrete covariates |
|
|
|
GRM |
|
|
Generated by |
Output file |
Typical content |
Details |
|---|---|---|---|
|
|
SNP posterior effects |
|
|
|
Model-level posterior summaries |
|
|
|
GWAS summary statistics |
|
|
|
Individual predictions |
Input Data Formats¶
Genotype Data (PLINK Binary)¶
Gelex uses PLINK 1.x binary files as its primary genotype input (PLINK 1.9 formats).
Extension |
Description |
|---|---|
.bed |
Binary file containing genotype calls. |
.bim |
Text file containing SNP metadata (chromosome, ID, position, alleles). |
.fam |
Text file containing sample metadata (FID, IID, parents, sex, phenotype). |
Important
When specifying genotype input with --bfile, provide the prefix only.
Gelex automatically searches for .bed, .bim, and .fam.
- Example:
--bfile mydataloadsmydata.bed,mydata.bim, andmydata.fam.
Phenotype Data¶
Phenotype files should be tab-separated (TSV) or space-separated text with a required header row.
Component |
Requirement |
|---|---|
Format |
|
Header |
A header row is required. |
Missing values |
Use |
Tip
By default, Gelex reads the 3rd column (index 2) as the trait.
Use --pheno-col to select a different trait column
(1-based index or column name).
Example:
FID IID Height Weight
1001 1001 175.5 70.2
1002 1002 168.0 -9
1003 1003 NA 65.5
Covariate Data¶
Gelex supports two covariate types. Both use the same base layout as
phenotype files: FID, IID, followed by one or more covariate columns.
Type |
Use case |
CLI option |
|---|---|---|
Quantitative covariates |
Continuous variables (age, BMI, PCs) |
|
Discrete covariates |
Categorical variables (sex, batch, site) |
|
Quantitative Covariates (qcovar)¶
Use for continuous variables.
FID IID Age PC1 PC2
1001 1001 45 -0.012 0.045
1002 1002 32 0.005 -0.021
Discrete Covariates (dcovar)¶
Use for categorical variables. Gelex internally converts them into dummy variables (one-hot encoding).
FID IID Sex Site
1001 1001 1 A
1002 1002 2 B
1003 1003 1 B
Genomic Relationship Matrix (GRM)¶
Gelex uses GRM files in the GCTA binary format.
Extension |
Description |
|---|---|
.grm.bin |
Binary file containing the lower triangle of the relationship matrix. |
.grm.id |
Text file containing the |
Note
For --grm, you can pass either a prefix (for example my_grm)
or the full path to the binary file.
Output Data Formats¶
Gelex generates structured output files based on the command you run.
Command-to-Output Mapping¶
Command |
Main output file(s) |
Description |
|---|---|---|
|
|
Posterior SNP effects and model-level summaries. |
|
|
SNP-level association statistics. |
|
|
Individual-level predicted values and components. |
SNP Effects (.snp.eff)¶
Generated by fit. Contains posterior estimates for each SNP.
Column |
Description |
|---|---|
|
1-based SNP index. |
|
SNP identifier. |
|
Chromosome name. |
|
Base-pair position. |
|
Effect allele and reference allele. |
|
Frequency of the A1 allele. |
|
Posterior mean and standard deviation of additive effect. |
|
Proportion of variance explained by additive effect. |
|
Posterior Inclusion Probability (for mixture models). |
|
(Optional) Dominance effect posterior mean and standard deviation. |
|
(Optional) Dominance-specific PVE and PIP. |
|
(Optional) Posterior probability for the k-th mixture component. |
Model Parameters (.param)¶
Generated by fit. Summarizes model-wide parameters.
Column |
Description |
|---|---|
|
Parameter name (for example |
|
Posterior mean. |
|
Posterior standard deviation. |
|
90% credible interval boundaries. |
|
Effective Sample Size. |
|
Potential Scale Reduction Factor (Gelman-Rubin diagnostic). |
GWAS Results (.gwas.tsv)¶
Generated by assoc. Standard summary statistics for association testing.
Column |
Description |
|---|---|
|
Chromosome, SNP ID, and base-pair position. |
|
Effect allele and other allele. |
|
Effect allele frequency. |
|
Effect size estimate (Wald test). |
|
Standard error of the effect size. |
|
P-value from the Wald test. |
Prediction Results (.pred.tsv)¶
Generated by predict. Contains individual prediction components.
Column |
Description |
|---|---|
|
Individual identifiers (FID omitted if |
|
Total predicted value (Additive + Dominance + Covariates). |
|
Estimated contribution from each covariate. |
|
Predicted additive genetic value (Estimated Breeding Value). |
|
(Optional) Predicted dominance genetic value. |