Genomic Selection Tutorial¶
Quick Start
For a standard Genomic Selection pipeline using BayesR:
gelex fit \
--bfile train_data \
--pheno train_pheno.tsv \
--method R \
--iters 20000 --burnin 5000 \
--out trained_model
gelex predict \
--bfile target_data \
--snp-eff trained_model.snp.eff \
--out predictions.tsv
Background¶
Genomic Selection (GS) uses genome-wide markers to predict complex traits. Unlike GWAS, which focuses on identifying specific significant variants, GS aims to capture the total genetic value of an individual by simultaneously estimating the effects of all markers, even those with small effects.
Gelex implements the BayesAlphabet family of models (BayesA, BayesB, BayesC, BayesR, BayesRR), which use different prior distributions to model the genetic architecture of traits ranging from simple (oligogenic) to complex (polygenic).
Workflow Overview¶
A typical GS analysis involves two main steps:
Model Fitting (`fit`): Train a Bayesian model on a reference population with both genotypes and phenotypes to estimate marker effects.
Prediction (`predict`): Apply the estimated marker effects to the genotypes of a target population (candidates) to predict their Genomic Estimated Breeding Values (GEBVs) or phenotypic values.
Step 1: Model Fitting¶
The first step is to fit a model to your training data. This estimates the effect size of each SNP and any fixed covariates.
See also
See fit for a full list of options.
Choose a Method¶
BayesRR / Ridge Regression: Assumes all SNPs have non-zero effects drawn from a normal distribution. Good for highly polygenic traits.
BayesA: Assumes all SNPs have non-zero effects but allows for different variances (heavy-tailed).
BayesB / BayesC: Variable selection models that assume a proportion of SNPs have zero effect (pi parameter).
BayesR: A flexible mixture model that assumes SNP effects come from a mixture of normal distributions with different variances (e.g., zero, small, medium, large). Often provides the highest accuracy across diverse genetic architectures.
Basic Usage¶
To fit a BayesR model:
gelex fit \
--bfile train_genotypes \
--pheno phenotypes.tsv \
--method R \
--chains 4 \
--out model_output
Handling Covariates¶
You can include fixed effects such as sex (discrete) or age (quantitative):
gelex fit \
--bfile train_genotypes \
--pheno phenotypes.tsv \
--dcovar sex.tsv \
--qcovar age.tsv \
--method R \
--out model_with_covars
Outputs¶
The fit command generates several files:
<out>.snp.eff: Estimated SNP effects (used for prediction).<out>.param: Estimated hyper-parameters and covariate effects.<out>.log: Log of the MCMC process.
Step 2: Genomic Prediction¶
Once you have the estimated SNP effects (and optionally covariate effects), you can predict phenotypes for a target population.
See also
See predict for detailed options.
Basic Usage¶
gelex predict \
--bfile target_genotypes \
--snp-eff model_output.snp.eff \
--out predicted_values.tsv
Using Covariate Effects¶
If your training model included covariates, you can also use those estimated effects for prediction, provided the target population has the corresponding covariate data:
gelex predict \
--bfile target_genotypes \
--snp-eff model_with_covars.snp.eff \
--covar-eff model_with_covars.param \
--qcovar target_age.tsv \
--dcovar target_sex.tsv \
--out predicted_values_with_covars.tsv
Output Format¶
The output file is a TSV containing the predicted values:
FID IID PRS Total_Value
fam1 ind1 1.234e-01 1.543e-01
fam1 ind2 -0.456e-01 -0.123e-01
...
PRS: Polygenic Risk Score (sum of SNP effects).
Total_Value: PRS + Covariate Effects (if provided).