fit

Train SNP effect models for genomic prediction using BayesAlphabet methods.

Use this command when you want to learn marker effects from training data, then reuse those effects in predict.

Basic Syntax

Minimum Working Command
gelex fit -b train_data -p phenotypes.tsv -m RR -o model_rr
Full Syntax Template
gelex fit --pheno <pheno_file> --bfile <genotype_prefix> --method <method> [OPTIONS]

Required inputs are phenotype file (--pheno), genotype prefix (--bfile), and model method (--method).

Method Selection

Choose a method based on your goal before tuning other parameters.

Method

Use when

Trade-off

RR

All SNPs are assumed to have non-zero effects; use as a baseline.

Stable and simple, but weak variable selection.

R

You expect a mixture of effect sizes and want flexible shrinkage.

Better accuracy in many traits, with moderate runtime.

B / C

You expect many near-zero SNP effects and want explicit variable selection.

Stronger sparsity, but more sensitive to prior settings.

A

You want all SNPs included with SNP-specific shrinkage.

More MCMC sampling cost than RR.

Rd

You want to model dominance effects alongside additive effects.

More parameters and longer runtime.

Bpi / Cpi / Rpi

You want the model to estimate mixture proportions from the data.

More adaptive, but may require longer chains for stable estimates.

If you are unsure, start with RR to establish a baseline, then try R as a stronger default for production runs.

Options

Quick Start Options

-p, --pheno required

Phenotype TSV file (FID IID trait1 ...).

-b, --bfile required

PLINK binary prefix (.bed/.bim/.fam).

-m, --method RR

Modeling method. Start with RR (baseline) or R (accuracy-oriented).

-o, --out gelex

Output prefix for generated files.

Input Options

-p, --pheno required

Phenotype TSV file in format FID IID trait1 ....

--pheno-col 2

0-based trait column index in the phenotype file.

-b, --bfile required

PLINK binary prefix (.bed/.bim/.fam).

--qcovar

Quantitative covariate TSV in format FID IID covar1 ....

--dcovar

Categorical covariate TSV in format FID IID factor1 ....

Model Options

-m, --method RR

BayesAlphabet method. Supported: A/B/C/R/RR; add d for dominance (for example Rd); add pi to estimate mixture proportions (for example Cpi).

--geno-method OrthStandardizeHWE

Genotype processing method. Available methods: StandardizeHWE (SH), CenterHWE (CH), OrthStandardizeHWE (OSH), OrthCenterHWE (OCH), Standardize (S), Center (C), OrthStandardize (OS), OrthCenter (OC). Abbreviations accepted. See Genotype Processing Methods.

--scale 0 0.001 0.01 0.1 1

Additive variance scales, typically used in BayesR-style models.

--pi 0.99 0.01

Additive mixture proportions. For BayesR, default is 0.99 0.005 0.003 0.001 0.001.

--dscale 0 0.001 0.01 0.1 1

Dominance variance scales for dominance-enabled models.

--dpi 0.99 0.01

Dominance mixture proportions. For BayesR dominance models, default is 0.99 0.005 0.003 0.001 0.001.

MCMC Options

--iters 3000

Total MCMC iterations.

--burnin 2000

Initial iterations discarded before sampling.

--thin 1

Keep one sample every thin iterations.

--seed 42

Random seed for reproducible MCMC.

Performance and Output

-c, --chunk-size 10000

Number of SNPs per processing chunk. Lower values reduce peak memory.

-t, --threads 12

Number of CPU threads to use.

--mmap false

Enable memory-mapped I/O. Usually lowers RAM pressure and may reduce speed.

-o, --out gelex

Output prefix for all generated files.

Output Files

After a successful run, check files with your output prefix first.

File pattern

Contents

Typical next step

<out>.snp.eff

Estimated SNP effects

Use with gelex predict --snp-eff

<out>.param

Estimated fixed/covariate effects and model parameters

Optional input for gelex predict --covar-eff

<out>*

Run logs and model-specific artifacts

Review convergence and configuration used

Warnings and Notes

Note

For many datasets, a practical starting point is --burnin around 20%-50% of --iters. Increase --iters when posterior summaries are unstable across runs.

Note

If memory is limited, reduce --chunk-size first, then enable --mmap. This usually lowers RAM usage with a possible runtime penalty.

Examples

Quick Start Baseline (RR)
gelex fit \
   -b train_data \
   -p phenotypes.tsv \
   -m RR \
   -o model_rr

Expected outputs: model_rr.snp.eff, model_rr.param.

Accuracy-Oriented Training (R)
gelex fit \
   -b train_data \
   -p phenotypes.tsv \
   -m R \
   -o model_bayesr

Expected outputs: model_bayesr.snp.eff, model_bayesr.param.

Sparse Effects with Variable Selection (B)
gelex fit \
   -b train_data \
   -p phenotypes.tsv \
   -m B \
   --pi 0.99 0.01 \
   -o model_bayesb
Add Fixed Effects (qcovar + dcovar)
gelex fit \
   -b train_data \
   -p phenotypes.tsv \
   -m R \
   --dcovar sex.tsv \
   --qcovar age.tsv \
   -o model_covar
Longer MCMC for Stable Posterior Estimates
gelex fit \
   -b train_data \
   -p phenotypes.tsv \
   -m R \
   --iters 50000 \
   --burnin 10000 \
   --thin 5 \
   -o model_high_prec
Additive + Dominance Model (Rd)
gelex fit \
   -b train_data \
   -p phenotypes.tsv \
   -m Rd \
   --dscale 0.0001 0.001 0.01 0.1 1.0 \
   --dpi 0.95 0.05 \
   -o model_dom
Estimate Mixture Proportions (Cpi)
gelex fit \
   -b train_data \
   -p phenotypes.tsv \
   -m Cpi \
   --pi 0.9 0.1 \
   --scale 0.01 0.1 \
   -o model_cpi

See Also

  • predict for applying trained effects to target samples.

  • assoc for SNP-wise association analysis.

  • grm for genomic relationship matrix construction.