Genomic Selection Tutorial¶

Quick Start

For a standard Genomic Selection pipeline using BayesR:

Model Fitting (Train)¶

gelex fit \
  --bfile train_data \
  --pheno train_pheno.tsv \
  --method R \
  --iters 20000 --burnin 5000 \
  --out trained_model

Genomic Prediction (Predict)¶

gelex predict \
  --bfile target_data \
  --snp-eff trained_model.snp.eff \
  --out predictions.tsv

Background¶

Genomic Selection (GS) uses genome-wide markers to predict complex traits. Unlike GWAS, which focuses on identifying specific significant variants, GS aims to capture the total genetic value of an individual by simultaneously estimating the effects of all markers, even those with small effects.

Gelex implements the BayesAlphabet family of models (BayesA, BayesB, BayesC, BayesR, BayesRR), which use different prior distributions to model the genetic architecture of traits ranging from simple (oligogenic) to complex (polygenic).

Workflow Overview¶

A typical GS analysis involves two main steps:

Model Fitting (`fit`): Train a Bayesian model on a reference population with both genotypes and phenotypes to estimate marker effects.
Prediction (`predict`): Apply the estimated marker effects to the genotypes of a target population (candidates) to predict their Genomic Estimated Breeding Values (GEBVs) or phenotypic values.

Step 1: Model Fitting¶

The first step is to fit a model to your training data. This estimates the effect size of each SNP and any fixed covariates.

Choose a Method¶

BayesRR / Ridge Regression: Assumes all SNPs have non-zero effects drawn from a normal distribution. Good for highly polygenic traits.
BayesA: Assumes all SNPs have non-zero effects but allows for different variances (heavy-tailed).
BayesB / BayesC: Variable selection models that assume a proportion of SNPs have zero effect (pi parameter).
BayesR: A flexible mixture model that assumes SNP effects come from a mixture of normal distributions with different variances (e.g., zero, small, medium, large). Often provides the highest accuracy across diverse genetic architectures.

Basic Usage¶

To fit a BayesR model:

Fit BayesR Model¶

gelex fit \
  --bfile train_genotypes \
  --pheno phenotypes.tsv \
  --method R \
  --chains 4 \
  --out model_output

Handling Covariates¶

You can include fixed effects such as sex (discrete) or age (quantitative):

Fit Model with Covariates¶

gelex fit \
  --bfile train_genotypes \
  --pheno phenotypes.tsv \
  --dcovar sex.tsv \
  --qcovar age.tsv \
  --method R \
  --out model_with_covars

Outputs¶

The fit command generates several files:

<out>.snp.eff: Estimated SNP effects (used for prediction).
<out>.param: Estimated hyper-parameters and covariate effects.
<out>.log: Log of the MCMC process.

Step 2: Genomic Prediction¶

Once you have the estimated SNP effects (and optionally covariate effects), you can predict phenotypes for a target population.

Basic Usage¶

Basic Prediction¶

gelex predict \
  --bfile target_genotypes \
  --snp-eff model_output.snp.eff \
  --out predicted_values.tsv

Using Covariate Effects¶

If your training model included covariates, you can also use those estimated effects for prediction, provided the target population has the corresponding covariate data:

Prediction with Covariates¶

gelex predict \
  --bfile target_genotypes \
  --snp-eff model_with_covars.snp.eff \
  --covar-eff model_with_covars.param \
  --qcovar target_age.tsv \
  --dcovar target_sex.tsv \
  --out predicted_values_with_covars.tsv

Output Format¶

The output file is a TSV containing the predicted values:

Example Output¶

FID    IID    PRS          Total_Value
fam1   ind1   1.234e-01    1.543e-01
fam1   ind2   -0.456e-01   -0.123e-01
...

PRS: Polygenic Risk Score (sum of SNP effects).
Total_Value: PRS + Covariate Effects (if provided).