Data Formats

Gelex supports standard bioinformatics formats for genomic analysis. Most inputs are PLINK binaries or tabular text files, and outputs are command-specific TSV-style summaries.

Quick Reference

Use this section to quickly map your files to Gelex commands and the relevant detail section.

Input files at a glance

Data type

Files

Used by

Details

Genotype

.bed + .bim + .fam

fit, assoc, predict

Genotype Data (PLINK Binary)

Phenotype

.tsv or space-separated text

fit, assoc

Phenotype Data

Quantitative covariates

.tsv or space-separated text

fit, predict

Covariate Data

Discrete covariates

.tsv or space-separated text

fit, predict

Covariate Data

GRM

.grm.bin + .grm.id

gblup-style workflows

Genomic Relationship Matrix (GRM)

Output files at a glance

Generated by

Output file

Typical content

Details

fit

.snp.eff

SNP posterior effects

SNP Effects (.snp.eff)

fit

.param

Model-level posterior summaries

Model Parameters (.param)

assoc

.gwas.tsv

GWAS summary statistics

GWAS Results (.gwas.tsv)

predict

.pred.tsv

Individual predictions

Prediction Results (.pred.tsv)

Input Data Formats

Phenotype Data

Phenotype files should be tab-separated (TSV) or space-separated text with a required header row.

Phenotype file requirements

Component

Requirement

Format

FID IID Trait1 Trait2 ...

Header

A header row is required.

Missing values

Use -9, NA, or nan.

Tip

By default, Gelex reads the 3rd column (index 2) as the trait. Use --pheno-col to select a different trait column (1-based index or column name).

Example:

FID    IID    Height    Weight
1001   1001   175.5     70.2
1002   1002   168.0     -9
1003   1003   NA        65.5

Related command docs: fit, assoc.

Covariate Data

Gelex supports two covariate types. Both use the same base layout as phenotype files: FID, IID, followed by one or more covariate columns.

Covariate types

Type

Use case

CLI option

Quantitative covariates

Continuous variables (age, BMI, PCs)

--qcovar

Discrete covariates

Categorical variables (sex, batch, site)

--dcovar

Quantitative Covariates (qcovar)

Use for continuous variables.

FID    IID    Age    PC1       PC2
1001   1001   45     -0.012    0.045
1002   1002   32     0.005     -0.021

Discrete Covariates (dcovar)

Use for categorical variables. Gelex internally converts them into dummy variables (one-hot encoding).

FID    IID    Sex    Site
1001   1001   1      A
1002   1002   2      B
1003   1003   1      B

Related command docs: fit, predict.

Genomic Relationship Matrix (GRM)

Gelex uses GRM files in the GCTA binary format.

GRM files

Extension

Description

.grm.bin

Binary file containing the lower triangle of the relationship matrix.

.grm.id

Text file containing the FID and IID of matrix samples.

Note

For --grm, you can pass either a prefix (for example my_grm) or the full path to the binary file.

Output Data Formats

Gelex generates structured output files based on the command you run.

Command-to-Output Mapping

Command

Main output file(s)

Description

fit

<out>.snp.eff, <out>.param

Posterior SNP effects and model-level summaries.

assoc

<out>.gwas.tsv

SNP-level association statistics.

predict

<out>.pred.tsv

Individual-level predicted values and components.

See fit, assoc, and predict for command options.

SNP Effects (.snp.eff)

Generated by fit. Contains posterior estimates for each SNP.

Column

Description

Index

1-based SNP index.

ID

SNP identifier.

Chrom

Chromosome name.

Position

Base-pair position.

A1, A2

Effect allele and reference allele.

A1Freq

Frequency of the A1 allele.

Add, AddSE

Posterior mean and standard deviation of additive effect.

AddPVE

Proportion of variance explained by additive effect.

PIP

Posterior Inclusion Probability (for mixture models).

Dom, DomSE

(Optional) Dominance effect posterior mean and standard deviation.

DomPVE, PIP

(Optional) Dominance-specific PVE and PIP.

pi_k

(Optional) Posterior probability for the k-th mixture component.

Model Parameters (.param)

Generated by fit. Summarizes model-wide parameters.

Column

Description

term

Parameter name (for example σ²_add, , Intercept).

mean

Posterior mean.

stddev

Posterior standard deviation.

5%, 95%

90% credible interval boundaries.

ess

Effective Sample Size.

rhat

Potential Scale Reduction Factor (Gelman-Rubin diagnostic).

GWAS Results (.gwas.tsv)

Generated by assoc. Standard summary statistics for association testing.

Column

Description

CHR, SNP, BP

Chromosome, SNP ID, and base-pair position.

A1, A2

Effect allele and other allele.

A1FREQ

Effect allele frequency.

BETA

Effect size estimate (Wald test).

SE

Standard error of the effect size.

P

P-value from the Wald test.

Prediction Results (.pred.tsv)

Generated by predict. Contains individual prediction components.

Column

Description

FID, IID

Individual identifiers (FID omitted if --iid-only).

prediction

Total predicted value (Additive + Dominance + Covariates).

{covar}

Estimated contribution from each covariate.

additive

Predicted additive genetic value (Estimated Breeding Value).

dominant

(Optional) Predicted dominance genetic value.