Data Formats ============ Gelex supports standard bioinformatics formats for genomic analysis. Most inputs are PLINK binaries or tabular text files, and outputs are command-specific TSV-style summaries. Quick Reference --------------- Use this section to quickly map your files to Gelex commands and the relevant detail section. .. list-table:: Input files at a glance :header-rows: 1 :widths: 22 28 25 25 * - Data type - Files - Used by - Details * - Genotype - ``.bed`` + ``.bim`` + ``.fam`` - ``fit``, ``assoc``, ``predict`` - :ref:`genotype-format` * - Phenotype - ``.tsv`` or space-separated text - ``fit``, ``assoc`` - :ref:`phenotype-format` * - Quantitative covariates - ``.tsv`` or space-separated text - ``fit``, ``predict`` - :ref:`covariate-format` * - Discrete covariates - ``.tsv`` or space-separated text - ``fit``, ``predict`` - :ref:`covariate-format` * - GRM - ``.grm.bin`` + ``.grm.id`` - ``gblup``-style workflows - :ref:`grm-format` .. list-table:: Output files at a glance :header-rows: 1 :widths: 22 28 25 25 * - Generated by - Output file - Typical content - Details * - ``fit`` - ``.snp.eff`` - SNP posterior effects - :ref:`snp-eff-format` * - ``fit`` - ``.param`` - Model-level posterior summaries - :ref:`param-format` * - ``assoc`` - ``.gwas.tsv`` - GWAS summary statistics - :ref:`gwas-output-format` * - ``predict`` - ``.pred.tsv`` - Individual predictions - :ref:`predict-output-format` Input Data Formats ------------------ .. _genotype-format: Genotype Data (PLINK Binary) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Gelex uses PLINK 1.x binary files as its primary genotype input (`PLINK 1.9 formats `_). .. list-table:: Required files for ``--bfile`` :header-rows: 1 :widths: 20 80 * - Extension - Description * - **.bed** - Binary file containing genotype calls. * - **.bim** - Text file containing SNP metadata (chromosome, ID, position, alleles). * - **.fam** - Text file containing sample metadata (FID, IID, parents, sex, phenotype). .. important:: When specifying genotype input with ``--bfile``, provide the **prefix only**. Gelex automatically searches for ``.bed``, ``.bim``, and ``.fam``. Example: ``--bfile mydata`` loads ``mydata.bed``, ``mydata.bim``, and ``mydata.fam``. Related command docs: :ref:`fit-command`, :ref:`assoc-command`, :ref:`predict-command`. .. _phenotype-format: Phenotype Data ~~~~~~~~~~~~~~ Phenotype files should be tab-separated (TSV) or space-separated text with a required header row. .. list-table:: Phenotype file requirements :header-rows: 1 :widths: 20 80 * - Component - Requirement * - **Format** - ``FID IID Trait1 Trait2 ...`` * - **Header** - A header row is required. * - **Missing values** - Use ``-9``, ``NA``, or ``nan``. .. tip:: By default, Gelex reads the **3rd column** (index 2) as the trait. Use ``--pheno-col`` to select a different trait column (1-based index or column name). Example: .. code-block:: text FID IID Height Weight 1001 1001 175.5 70.2 1002 1002 168.0 -9 1003 1003 NA 65.5 Related command docs: :ref:`fit-command`, :ref:`assoc-command`. .. _covariate-format: Covariate Data ~~~~~~~~~~~~~~ Gelex supports two covariate types. Both use the same base layout as phenotype files: ``FID``, ``IID``, followed by one or more covariate columns. .. list-table:: Covariate types :header-rows: 1 :widths: 24 38 38 * - Type - Use case - CLI option * - Quantitative covariates - Continuous variables (age, BMI, PCs) - ``--qcovar`` * - Discrete covariates - Categorical variables (sex, batch, site) - ``--dcovar`` Quantitative Covariates (qcovar) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Use for continuous variables. .. code-block:: text FID IID Age PC1 PC2 1001 1001 45 -0.012 0.045 1002 1002 32 0.005 -0.021 Discrete Covariates (dcovar) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Use for categorical variables. Gelex internally converts them into dummy variables (one-hot encoding). .. code-block:: text FID IID Sex Site 1001 1001 1 A 1002 1002 2 B 1003 1003 1 B Related command docs: :ref:`fit-command`, :ref:`predict-command`. .. _grm-format: Genomic Relationship Matrix (GRM) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Gelex uses GRM files in the **GCTA binary format**. .. list-table:: GRM files :header-rows: 1 :widths: 20 80 * - Extension - Description * - **.grm.bin** - Binary file containing the lower triangle of the relationship matrix. * - **.grm.id** - Text file containing the ``FID`` and ``IID`` of matrix samples. .. note:: For ``--grm``, you can pass either a prefix (for example ``my_grm``) or the full path to the binary file. Output Data Formats ------------------- Gelex generates structured output files based on the command you run. Command-to-Output Mapping ~~~~~~~~~~~~~~~~~~~~~~~~~ .. list-table:: :header-rows: 1 :widths: 20 35 45 * - Command - Main output file(s) - Description * - ``fit`` - ``.snp.eff``, ``.param`` - Posterior SNP effects and model-level summaries. * - ``assoc`` - ``.gwas.tsv`` - SNP-level association statistics. * - ``predict`` - ``.pred.tsv`` - Individual-level predicted values and components. See :ref:`fit-command`, :ref:`assoc-command`, and :ref:`predict-command` for command options. .. _snp-eff-format: SNP Effects (.snp.eff) ~~~~~~~~~~~~~~~~~~~~~~ Generated by ``fit``. Contains posterior estimates for each SNP. .. list-table:: :header-rows: 1 :widths: 25 75 * - Column - Description * - ``Index`` - 1-based SNP index. * - ``ID`` - SNP identifier. * - ``Chrom`` - Chromosome name. * - ``Position`` - Base-pair position. * - ``A1, A2`` - Effect allele and reference allele. * - ``A1Freq`` - Frequency of the A1 allele. * - ``Add, AddSE`` - Posterior mean and standard deviation of additive effect. * - ``AddPVE`` - Proportion of variance explained by additive effect. * - ``PIP`` - Posterior Inclusion Probability (for mixture models). * - ``Dom, DomSE`` - (Optional) Dominance effect posterior mean and standard deviation. * - ``DomPVE, PIP`` - (Optional) Dominance-specific PVE and PIP. * - ``pi_k`` - (Optional) Posterior probability for the k-th mixture component. .. _param-format: Model Parameters (.param) ~~~~~~~~~~~~~~~~~~~~~~~~~ Generated by ``fit``. Summarizes model-wide parameters. .. list-table:: :header-rows: 1 :widths: 25 75 * - Column - Description * - ``term`` - Parameter name (for example ``σ²_add``, ``h²``, ``Intercept``). * - ``mean`` - Posterior mean. * - ``stddev`` - Posterior standard deviation. * - ``5%, 95%`` - 90% credible interval boundaries. * - ``ess`` - Effective Sample Size. * - ``rhat`` - Potential Scale Reduction Factor (Gelman-Rubin diagnostic). .. _gwas-output-format: GWAS Results (.gwas.tsv) ~~~~~~~~~~~~~~~~~~~~~~~~ Generated by ``assoc``. Standard summary statistics for association testing. .. list-table:: :header-rows: 1 :widths: 25 75 * - Column - Description * - ``CHR, SNP, BP`` - Chromosome, SNP ID, and base-pair position. * - ``A1, A2`` - Effect allele and other allele. * - ``A1FREQ`` - Effect allele frequency. * - ``BETA`` - Effect size estimate (Wald test). * - ``SE`` - Standard error of the effect size. * - ``P`` - P-value from the Wald test. .. _predict-output-format: Prediction Results (.pred.tsv) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Generated by ``predict``. Contains individual prediction components. .. list-table:: :header-rows: 1 :widths: 25 75 * - Column - Description * - ``FID, IID`` - Individual identifiers (FID omitted if ``--iid-only``). * - ``prediction`` - Total predicted value (Additive + Dominance + Covariates). * - ``{covar}`` - Estimated contribution from each covariate. * - ``additive`` - Predicted additive genetic value (Estimated Breeding Value). * - ``dominant`` - (Optional) Predicted dominance genetic value.