fit¶

Train SNP effect models for genomic prediction using BayesAlphabet methods.

Use this command when you want to learn marker effects from training data, then reuse those effects in predict.

Basic Syntax¶

Minimum Working Command¶

gelex fit -b train_data -p phenotypes.tsv -m RR -o model_rr

Full Syntax Template¶

gelex fit --pheno <pheno_file> --bfile <genotype_prefix> --method <method> [OPTIONS]

Required inputs are phenotype file (--pheno), genotype prefix (--bfile), and model method (--method).

Method Selection¶

Choose a method based on your goal before tuning other parameters.

Method	Use when	Trade-off
`RR`	All SNPs are assumed to have non-zero effects; use as a baseline.	Stable and simple, but weak variable selection.
`R`	You expect a mixture of effect sizes and want flexible shrinkage.	Better accuracy in many traits, with moderate runtime.
`B` / `C`	You expect many near-zero SNP effects and want explicit variable selection.	Stronger sparsity, but more sensitive to prior settings.
`A`	You want all SNPs included with SNP-specific shrinkage.	More MCMC sampling cost than `RR`.
`Rd`	You want to model dominance effects alongside additive effects.	More parameters and longer runtime.
`Bpi` / `Cpi` / `Rpi`	You want the model to estimate mixture proportions from the data.	More adaptive, but may require longer chains for stable estimates.

If you are unsure, start with RR to establish a baseline, then try R as a stronger default for production runs.

Options¶

Quick Start Options

-p, --pheno required: Phenotype TSV file (FID IID trait1 ...).
-b, --bfile required: PLINK binary prefix (.bed/.bim/.fam).
-m, --method RR: Modeling method. Start with RR (baseline) or R (accuracy-oriented).
-o, --out gelex: Output prefix for generated files.

Input Options

-p, --pheno required: Phenotype TSV file in format FID IID trait1 ....
--pheno-col 2: 0-based trait column index in the phenotype file.
-b, --bfile required: PLINK binary prefix (.bed/.bim/.fam).
--qcovar: Quantitative covariate TSV in format FID IID covar1 ....
--dcovar: Categorical covariate TSV in format FID IID factor1 ....

Model Options

-m, --method RR: BayesAlphabet method. Supported: A/B/C/R/RR; add d for dominance (for example Rd); add pi to estimate mixture proportions (for example Cpi).
--geno-method OrthStandardizeHWE: Genotype processing method. Available methods: StandardizeHWE (SH), CenterHWE (CH), OrthStandardizeHWE (OSH), OrthCenterHWE (OCH), Standardize (S), Center (C), OrthStandardize (OS), OrthCenter (OC). Abbreviations accepted. See Genotype Processing Methods.
--scale 0 0.001 0.01 0.1 1: Additive variance scales, typically used in BayesR-style models.
--pi 0.99 0.01: Additive mixture proportions. For BayesR, default is 0.99 0.005 0.003 0.001 0.001.
--dscale 0 0.001 0.01 0.1 1: Dominance variance scales for dominance-enabled models.
--dpi 0.99 0.01: Dominance mixture proportions. For BayesR dominance models, default is 0.99 0.005 0.003 0.001 0.001.

MCMC Options

--iters 3000: Total MCMC iterations.
--burnin 2000: Initial iterations discarded before sampling.
--thin 1: Keep one sample every thin iterations.
--seed 42: Random seed for reproducible MCMC.

Performance and Output

-c, --chunk-size 10000: Number of SNPs per processing chunk. Lower values reduce peak memory.
-t, --threads 12: Number of CPU threads to use.
--mmap false: Enable memory-mapped I/O. Usually lowers RAM pressure and may reduce speed.
-o, --out gelex: Output prefix for all generated files.

Output Files¶

After a successful run, check files with your output prefix first.

File pattern	Contents	Typical next step
`<out>.snp.eff`	Estimated SNP effects	Use with `gelex predict --snp-eff`
`<out>.param`	Estimated fixed/covariate effects and model parameters	Optional input for `gelex predict --covar-eff`
`<out>*`	Run logs and model-specific artifacts	Review convergence and configuration used

Warnings and Notes¶

Note

For many datasets, a practical starting point is --burnin around 20%-50% of --iters. Increase --iters when posterior summaries are unstable across runs.

Note

If memory is limited, reduce --chunk-size first, then enable --mmap. This usually lowers RAM usage with a possible runtime penalty.

Examples¶

Quick Start Baseline (RR)¶

gelex fit \
   -b train_data \
   -p phenotypes.tsv \
   -m RR \
   -o model_rr

Expected outputs: model_rr.snp.eff, model_rr.param.

Accuracy-Oriented Training (R)¶

gelex fit \
   -b train_data \
   -p phenotypes.tsv \
   -m R \
   -o model_bayesr

Expected outputs: model_bayesr.snp.eff, model_bayesr.param.

Sparse Effects with Variable Selection (B)¶

gelex fit \
   -b train_data \
   -p phenotypes.tsv \
   -m B \
   --pi 0.99 0.01 \
   -o model_bayesb

Add Fixed Effects (qcovar + dcovar)¶

gelex fit \
   -b train_data \
   -p phenotypes.tsv \
   -m R \
   --dcovar sex.tsv \
   --qcovar age.tsv \
   -o model_covar

Longer MCMC for Stable Posterior Estimates¶

gelex fit \
   -b train_data \
   -p phenotypes.tsv \
   -m R \
   --iters 50000 \
   --burnin 10000 \
   --thin 5 \
   -o model_high_prec

Additive + Dominance Model (Rd)¶

gelex fit \
   -b train_data \
   -p phenotypes.tsv \
   -m Rd \
   --dscale 0.0001 0.001 0.01 0.1 1.0 \
   --dpi 0.95 0.05 \
   -o model_dom

Estimate Mixture Proportions (Cpi)¶

gelex fit \
   -b train_data \
   -p phenotypes.tsv \
   -m Cpi \
   --pi 0.9 0.1 \
   --scale 0.01 0.1 \
   -o model_cpi

fit¶

Basic Syntax¶

Method Selection¶

Options¶

Output Files¶

Warnings and Notes¶

Examples¶

See Also¶