Data Formats
============

Gelex supports standard bioinformatics formats for genomic analysis.
Most inputs are PLINK binaries or tabular text files, and outputs are
command-specific TSV-style summaries.

Quick Reference
---------------

Use this section to quickly map your files to Gelex commands and the
relevant detail section.

.. list-table:: Input files at a glance
   :header-rows: 1
   :widths: 22 28 25 25

   * - Data type
     - Files
     - Used by
     - Details
   * - Genotype
     - ``.bed`` + ``.bim`` + ``.fam``
     - ``fit``, ``assoc``, ``predict``
     - :ref:`genotype-format`
   * - Phenotype
     - ``.tsv`` or space-separated text
     - ``fit``, ``assoc``
     - :ref:`phenotype-format`
   * - Quantitative covariates
     - ``.tsv`` or space-separated text
     - ``fit``, ``predict``
     - :ref:`covariate-format`
   * - Discrete covariates
     - ``.tsv`` or space-separated text
     - ``fit``, ``predict``
     - :ref:`covariate-format`
   * - GRM
     - ``.grm.bin`` + ``.grm.id``
     - ``gblup``-style workflows
     - :ref:`grm-format`

.. list-table:: Output files at a glance
   :header-rows: 1
   :widths: 22 28 25 25

   * - Generated by
     - Output file
     - Typical content
     - Details
   * - ``fit``
     - ``.snp.eff``
     - SNP posterior effects
     - :ref:`snp-eff-format`
   * - ``fit``
     - ``.param``
     - Model-level posterior summaries
     - :ref:`param-format`
   * - ``assoc``
     - ``.gwas.tsv``
     - GWAS summary statistics
     - :ref:`gwas-output-format`
   * - ``predict``
     - ``.pred.tsv``
     - Individual predictions
     - :ref:`predict-output-format`

Input Data Formats
------------------

.. _genotype-format:

Genotype Data (PLINK Binary)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Gelex uses PLINK 1.x binary files as its primary genotype input
(`PLINK 1.9 formats <https://www.cog-genomics.org/plink/1.9/formats#bed>`_).

.. list-table:: Required files for ``--bfile``
   :header-rows: 1
   :widths: 20 80

   * - Extension
     - Description
   * - **.bed**
     - Binary file containing genotype calls.
   * - **.bim**
     - Text file containing SNP metadata (chromosome, ID, position, alleles).
   * - **.fam**
     - Text file containing sample metadata (FID, IID, parents, sex, phenotype).

.. important::
   When specifying genotype input with ``--bfile``, provide the **prefix only**.
   Gelex automatically searches for ``.bed``, ``.bim``, and ``.fam``.

Example:
   ``--bfile mydata`` loads ``mydata.bed``, ``mydata.bim``, and
   ``mydata.fam``.

Related command docs: :ref:`fit-command`, :ref:`assoc-command`,
:ref:`predict-command`.

.. _phenotype-format:

Phenotype Data
~~~~~~~~~~~~~~

Phenotype files should be tab-separated (TSV) or space-separated text
with a required header row.

.. list-table:: Phenotype file requirements
   :header-rows: 1
   :widths: 20 80

   * - Component
     - Requirement
   * - **Format**
     - ``FID IID Trait1 Trait2 ...``
   * - **Header**
     - A header row is required.
   * - **Missing values**
     - Use ``-9``, ``NA``, or ``nan``.

.. tip::
   By default, Gelex reads the **3rd column** (index 2) as the trait.
   Use ``--pheno-col`` to select a different trait column
   (1-based index or column name).

Example:

.. code-block:: text

   FID    IID    Height    Weight
   1001   1001   175.5     70.2
   1002   1002   168.0     -9
   1003   1003   NA        65.5

Related command docs: :ref:`fit-command`, :ref:`assoc-command`.

.. _covariate-format:

Covariate Data
~~~~~~~~~~~~~~

Gelex supports two covariate types. Both use the same base layout as
phenotype files: ``FID``, ``IID``, followed by one or more covariate columns.

.. list-table:: Covariate types
   :header-rows: 1
   :widths: 24 38 38

   * - Type
     - Use case
     - CLI option
   * - Quantitative covariates
     - Continuous variables (age, BMI, PCs)
     - ``--qcovar``
   * - Discrete covariates
     - Categorical variables (sex, batch, site)
     - ``--dcovar``

Quantitative Covariates (qcovar)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Use for continuous variables.

.. code-block:: text

   FID    IID    Age    PC1       PC2
   1001   1001   45     -0.012    0.045
   1002   1002   32     0.005     -0.021

Discrete Covariates (dcovar)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Use for categorical variables. Gelex internally converts them into dummy
variables (one-hot encoding).

.. code-block:: text

   FID    IID    Sex    Site
   1001   1001   1      A
   1002   1002   2      B
   1003   1003   1      B

Related command docs: :ref:`fit-command`, :ref:`predict-command`.

.. _grm-format:

Genomic Relationship Matrix (GRM)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Gelex uses GRM files in the **GCTA binary format**.

.. list-table:: GRM files
   :header-rows: 1
   :widths: 20 80

   * - Extension
     - Description
   * - **.grm.bin**
     - Binary file containing the lower triangle of the relationship matrix.
   * - **.grm.id**
     - Text file containing the ``FID`` and ``IID`` of matrix samples.

.. note::
   For ``--grm``, you can pass either a prefix (for example ``my_grm``)
   or the full path to the binary file.

Output Data Formats
-------------------

Gelex generates structured output files based on the command you run.

Command-to-Output Mapping
~~~~~~~~~~~~~~~~~~~~~~~~~

.. list-table::
   :header-rows: 1
   :widths: 20 35 45

   * - Command
     - Main output file(s)
     - Description
   * - ``fit``
     - ``<out>.snp.eff``, ``<out>.param``
     - Posterior SNP effects and model-level summaries.
   * - ``assoc``
     - ``<out>.gwas.tsv``
     - SNP-level association statistics.
   * - ``predict``
     - ``<out>.pred.tsv``
     - Individual-level predicted values and components.

See :ref:`fit-command`, :ref:`assoc-command`, and :ref:`predict-command`
for command options.

.. _snp-eff-format:

SNP Effects (.snp.eff)
~~~~~~~~~~~~~~~~~~~~~~

Generated by ``fit``. Contains posterior estimates for each SNP.

.. list-table::
   :header-rows: 1
   :widths: 25 75

   * - Column
     - Description
   * - ``Index``
     - 1-based SNP index.
   * - ``ID``
     - SNP identifier.
   * - ``Chrom``
     - Chromosome name.
   * - ``Position``
     - Base-pair position.
   * - ``A1, A2``
     - Effect allele and reference allele.
   * - ``A1Freq``
     - Frequency of the A1 allele.
   * - ``Add, AddSE``
     - Posterior mean and standard deviation of additive effect.
   * - ``AddPVE``
     - Proportion of variance explained by additive effect.
   * - ``PIP``
     - Posterior Inclusion Probability (for mixture models).
   * - ``Dom, DomSE``
     - (Optional) Dominance effect posterior mean and standard deviation.
   * - ``DomPVE, PIP``
     - (Optional) Dominance-specific PVE and PIP.
   * - ``pi_k``
     - (Optional) Posterior probability for the k-th mixture component.

.. _param-format:

Model Parameters (.param)
~~~~~~~~~~~~~~~~~~~~~~~~~

Generated by ``fit``. Summarizes model-wide parameters.

.. list-table::
   :header-rows: 1
   :widths: 25 75

   * - Column
     - Description
   * - ``term``
     - Parameter name (for example ``σ²_add``, ``h²``, ``Intercept``).
   * - ``mean``
     - Posterior mean.
   * - ``stddev``
     - Posterior standard deviation.
   * - ``5%, 95%``
     - 90% credible interval boundaries.
   * - ``ess``
     - Effective Sample Size.
   * - ``rhat``
     - Potential Scale Reduction Factor (Gelman-Rubin diagnostic).

.. _gwas-output-format:

GWAS Results (.gwas.tsv)
~~~~~~~~~~~~~~~~~~~~~~~~

Generated by ``assoc``. Standard summary statistics for association testing.

.. list-table::
   :header-rows: 1
   :widths: 25 75

   * - Column
     - Description
   * - ``CHR, SNP, BP``
     - Chromosome, SNP ID, and base-pair position.
   * - ``A1, A2``
     - Effect allele and other allele.
   * - ``A1FREQ``
     - Effect allele frequency.
   * - ``BETA``
     - Effect size estimate (Wald test).
   * - ``SE``
     - Standard error of the effect size.
   * - ``P``
     - P-value from the Wald test.

.. _predict-output-format:

Prediction Results (.pred.tsv)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Generated by ``predict``. Contains individual prediction components.

.. list-table::
   :header-rows: 1
   :widths: 25 75

   * - Column
     - Description
   * - ``FID, IID``
     - Individual identifiers (FID omitted if ``--iid-only``).
   * - ``prediction``
     - Total predicted value (Additive + Dominance + Covariates).
   * - ``{covar}``
     - Estimated contribution from each covariate.
   * - ``additive``
     - Predicted additive genetic value (Estimated Breeding Value).
   * - ``dominant``
     - (Optional) Predicted dominance genetic value.