Atlas / Modern pipeline roadmap

How to Rebuild the MCPH Protocol for 2026

The inherited R script answers an old, very specific question: what happens when a microcephaly gene list is intersected with the frozen bundle's expression matrices, disease gene lists, MAGMA p-values, and GWAS gene sets? The next project is bigger: build a modern, modular evidence pipeline that keeps the old plot reproducible while adding current gene nomenclature, clinical validity, developmental brain expression, cross-condition genetics, and mechanism-level interpretation.

1old protocol kept reproducible

7new evidence layers proposed

33MCPH genes as the first test set

3plot modes: layer, comparison, mega atlas

Core principle Do not jump straight to a mega plot.

Run each new database separately first, understand what it contributes, then combine only the layers that survive biological and methodological scrutiny.

Scientific goal Turn intersections into interpretation.

A gene overlap is only a starting clue. The modern pipeline should explain evidence strength, developmental timing, disease specificity, and mechanism.

Why update the protocol?

The old heatmap is useful, but it is not the final biological answer.

The original bundle is valuable because it is a real inherited analysis: it contains frozen databases, a defined R script, precomputed expression resources, disease gene sets, MAGMA gene-level values, and a visual grammar for compressing many evidence types into one heatmap. We should preserve that. Reproducibility matters: if Mario, Gabriel, or a thesis reader asks what the original protocol did, we need to be able to rerun and explain it faithfully.

But the bundle is also old. Gene symbols changed; database coverage improved; clinical gene panels evolved; cross-disorder psychiatric genetics expanded; single-cell and developmental brain atlases became richer; and the field's vocabulary for gene-disease validity became more rigorous. So the correct move is not to throw away the old script. The correct move is to wrap it: old protocol underneath, modern evidence layers above.

The roadmap below separates the work into layers. Each layer answers a distinct scientific question and can produce its own plot. That lets us learn what each database contributes before we attempt a new all-in-one figure.

Seven evidence layers

A zero-to-master pipeline for the 33 MCPH genes.

Layer 01

Gene Identity and Alias Normalization

First we make a canonical gene table. Every later database depends on this. If the symbol layer is wrong, the whole analysis becomes quietly wrong.

Question answered

Are all 33 genes being recognized under their current and historical names?

Candidate sources

HGNC REST API: approved symbols, previous symbols, aliases, HGNC IDs.
Ensembl BioMart or REST: Ensembl stable IDs and genomic coordinates.
NCBI Gene: Entrez IDs and older literature compatibility.

Expected output

current_symbol | bundle_symbol | aliases | HGNC_ID | Ensembl_ID | Entrez_ID
KNL1           | CASC5         | D40...  | ...     | ...        | ...
TRAPPC14       | C7orf43       | MAP11   | ...     | ...        | ...

This is the layer that prevents KNL1 and TRAPPC14 from disappearing just because an older database still knows them as CASC5 and C7orf43.

Layer 02

Clinical Gene-Disease Validity

This layer asks whether a gene is clinically supported as a microcephaly gene, not merely present in a historical list. It is the most important first update because it separates established MCPH genes from candidates, syndromic microcephaly genes, and genes that need cautious interpretation.

Question answered

Which of the 33 genes are currently strong, moderate, weak, or uncurated microcephaly genes?

Candidate sources

PanelApp Severe microcephaly: green/amber/red clinical evidence and inheritance.
Gene2Phenotype: allelic requirement, mutation consequence, mechanism, confidence.
GenCC: harmonized gene-disease validity assertions from multiple curators.
ClinGen: conservative expert validity curations where available.
OMIM: phenotype names, inheritance, original gene-disease history; use with licensing awareness.

Plot proposal

A clinical validity heatmap with genes as rows and evidence sources as columns.

gene x evidence source
PanelApp: green / amber / red / absent
G2P: confirmed / probable / possible / absent
GenCC: definitive / strong / moderate / limited / disputed
ClinGen: definitive / strong / moderate / limited / no curation
OMIM: phenotype record present / absent

This plot would tell us which parts of Mario's list are clinically mature and which parts are still research-facing.

Layer 03

Phenotype Ontology and Disease Vocabulary

Disease labels are messy. “Primary microcephaly,” “severe microcephaly,” “microcephalic dwarfism,” “cortical malformation,” “neurodevelopmental disorder with microcephaly,” and “intellectual disability with microcephaly” overlap but are not identical. Before plotting cross-condition intersections, we need a controlled vocabulary.

Question answered

What phenotype or disease entity is each database actually talking about?

Candidate sources

HPO: phenotype terms such as microcephaly, seizures, intellectual disability, cortical malformation.
MONDO: disease entities and cross-database mappings.
Orphadata: rare disease classifications and Orphanet identifiers.

Plot proposal

A phenotype map or nested disease-tree annotation beside the gene table.

This layer protects us from treating every database hit containing the word “microcephaly” as if it meant the same biological condition.

Layer 04

Variant, Inheritance, and Constraint Evidence

A gene can be associated with disease in different ways: biallelic loss-of-function, dominant-negative variants, dosage sensitivity, missense-specific mechanisms, or uncertain candidate variants. This layer asks what kind of genetic damage the gene tolerates and what variant classes are reported in patients.

Question answered

What variant logic explains each gene-disease relationship?

Candidate sources

ClinVar: pathogenic and likely pathogenic variants, with review status caveats.
gnomAD: LOEUF, pLI, missense Z, observed/expected constraint.
ClinGen dosage sensitivity: haploinsufficiency and triplosensitivity where curated.
DECIPHER: patient-level phenotype-genotype evidence where accessible.

Plot proposal

gene | inheritance | variant class | LOEUF | pLI | missense Z | ClinVar P/LP count | dosage sensitivity

This layer helps us distinguish “this gene is generally important” from “this kind of mutation in this gene causes this kind of microcephaly.”

Layer 05

Developmental Brain Expression

This is the neuroscience heart of the update. The old bundle includes developmental expression blocks, but modern brain atlases can ask more precise questions: which cortical progenitors express these genes, at what developmental stages, and how do expression profiles differ between radial glia, intermediate progenitors, neurons, organoids, and non-human primate comparisons?

Question answered

When and where in brain development are these genes likely to matter?

Candidate sources

BrainSpan: foundational human developmental brain transcriptome downloads.
NeMO Analytics: recent neocortical development compendium, highly relevant to Gabriel Santpere's line.
PsychENCODE: neuropsychiatric brain regulatory and transcriptomic resources.
CZ CELLxGENE Census: programmatic single-cell access in Python and R.

Plot proposal

genes x developmental cell states
apical radial glia
basal radial glia
intermediate progenitors
early neurons
cortical plate neurons
organoid progenitors
non-human primate / human comparative stages

This layer would transform the atlas from a gene-list exercise into a developmental neuroscience argument.

Layer 06

Cross-Condition Genetic Associations

The old plot already points in this direction with MAGMA and GWAS-like disease columns. The modern version should be more explicit about what kind of evidence is being shown: Mendelian gene causality, common-variant association, locus-to-gene prioritization, literature co-mention, pathway evidence, or functional regulatory evidence.

Question answered

Do MCPH genes intersect with ASD, schizophrenia, cognition, ADHD, bipolar disorder, Alzheimer disease, Parkinson disease, or other brain-related traits?

Candidate sources

Open Targets Platform: integrated target-disease associations through GraphQL and downloads.
NHGRI-EBI GWAS Catalog: curated GWAS associations and REST API.
PGC cross-disorder resources: psychiatric common-variant signals and gene-set analyses.
FinnGen, UK Biobank, GeneBass: broader disease/trait associations where mapping is appropriate.

Plot proposal

gene x condition
Open Targets association score
GWAS Catalog mapped traits
MAGMA or gene-based signal
PGC membership
old bundle MAGMA comparison

This layer must be labeled carefully: GWAS evidence does not mean “this MCPH gene causes schizophrenia.” It means the gene may sit near, or be prioritized from, broader trait-associated biology.

Layer 07

Mechanism, Pathway, and Network Interpretation

Once clinical and association evidence is mapped, we need to ask what the genes do together. This is where the project becomes understandable for a neuroscience reader: not 33 names, but a set of biological modules that explain why neural progenitors are vulnerable.

Question answered

What biological systems organize the 33 genes?

Candidate sources

Gene Ontology: biological process, cellular component, molecular function.
Reactome: curated pathways and reactions.
STRING and BioGRID: protein interaction networks.
g:Profiler and Enrichr: enrichment analyses.
MSigDB and KEGG: useful but check licensing/redistribution constraints.

Plot proposal

A module map: centrosome, centriole, kinetochore, spindle checkpoint, cytokinesis, nuclear envelope, DNA repair, chromatin regulation, RNA/ribosome biology, trafficking, autophagy, lipid transport, and BBB biology.

This is the interpretive layer that lets the thesis say: “microcephaly genes converge on progenitor proliferation, genome integrity, and developmental timing.”

Plot strategy

Three ways to visualize the update.

Option A

One Database at a Time

This is the safest scientific path. Run the 33 genes through one database, produce one plot, and interpret exactly what changed. For example: first PanelApp, then BrainSpan, then Open Targets, then GWAS Catalog.

Best for: learning, thesis explanation, debugging, avoiding conceptual overload.

Risk: many separate plots may feel fragmented unless we write clear connecting text.

Option B

Layer Comparison Plots

Put old and new evidence side by side. For example, compare old bundle disease-list membership against PanelApp/G2P/GenCC, or compare old CoGAPS expression peaks against BrainSpan/NeMO developmental expression.

Best for: showing what the old protocol got right, missed, or could not know.

Risk: requires careful normalization so rows, symbols, and evidence scales are comparable.

Option C

New Mega Plot

Combine selected layers into a modern successor to the inherited heatmap. This should only happen after the individual layers are understood, because a beautiful mega plot can easily hide methodological confusion.

Best for: final thesis figure, team presentation, publication-style atlas.

Risk: too many tracks can become unreadable unless grouped into biological chapters.

Implementation plan

How the updated R pipeline should be written.

I would not write one giant R script. The original script is already hard to understand because many tasks are entangled: reading inputs, loading databases, joining evidence, transforming matrices, plotting annotations, and exporting figures. The new version should be modular from the beginning.

Proposed file structure R + targets + cached downloads

00_config.yml
01_normalize_genes.R
02_fetch_hgnc_ids.R
03_fetch_panelapp.R
04_fetch_g2p_gencc_clingen_omim_links.R
05_fetch_opentargets.R
06_fetch_gwas_catalog.R
07_fetch_expression_brainspan_nemo.R
08_fetch_constraint_clinvar_gnomad.R
09_score_evidence.R
10_plot_clinical_validity.R
11_plot_developmental_expression.R
12_plot_cross_condition.R
13_plot_mechanism_modules.R
14_plot_mega_atlas.R
R/
  helpers_api.R
  helpers_symbols.R
  helpers_plot_tracks.R
  scoring_rules.R
data_raw/
data_cache/
data_processed/
figures/
reports/

Reproducibility

Use `renv` and cached raw downloads

Every API response and downloaded table should be saved with date, source URL, version, and query parameters. This makes the thesis reproducible even if a database changes next month.

Pipeline control

Use `targets`

The targets package lets us rerun only the pieces that changed. If we update PanelApp but not BrainSpan, only the clinical plot rebuilds.

Plotting

Use `ComplexHeatmap` deliberately

The old script already uses complex heatmap logic. The update should keep that strength, but split tracks into named, documented evidence modules.

Recommended path

The best next move is clinical validity first, developmental expression second.

Step 1

Build the canonical gene table

Lock HGNC-approved symbols, aliases, bundle symbols, Ensembl IDs, and Entrez IDs.

Step 2

Make the clinical validity plot

PanelApp + G2P + GenCC + ClinGen + OMIM links. This gives the strongest immediate scientific update.

Step 3

Make the developmental expression plot

BrainSpan + NeMO first. This connects the analysis to cortical progenitors and Gabriel's line of work.

Step 4

Make the cross-condition plot

Open Targets + GWAS Catalog + old MAGMA comparison, carefully labeled as association evidence.

Step 5

Build the mega atlas only after interpretation

Combine the tracks that teach something. Leave out any layer that looks impressive but does not clarify the biology.

How to Rebuild the MCPH Protocol for 2026

The old heatmap is useful, but it is not the final biological answer.

A zero-to-master pipeline for the 33 MCPH genes.

Gene Identity and Alias Normalization

Question answered

Candidate sources

Expected output

Clinical Gene-Disease Validity

Question answered

Candidate sources

Plot proposal

Phenotype Ontology and Disease Vocabulary

Question answered

Candidate sources

Plot proposal

Variant, Inheritance, and Constraint Evidence

Question answered

Candidate sources

Plot proposal

Developmental Brain Expression

Question answered

Candidate sources

Plot proposal

Cross-Condition Genetic Associations

Question answered

Candidate sources

Plot proposal

Mechanism, Pathway, and Network Interpretation

Question answered

Candidate sources

Plot proposal

Three ways to visualize the update.

One Database at a Time

Layer Comparison Plots

New Mega Plot

How the updated R pipeline should be written.

Use renv and cached raw downloads

Use targets

Use ComplexHeatmap deliberately

The best next move is clinical validity first, developmental expression second.

Build the canonical gene table

Make the clinical validity plot

Make the developmental expression plot

Make the cross-condition plot

Build the mega atlas only after interpretation

Use `renv` and cached raw downloads

Use `targets`

Use `ComplexHeatmap` deliberately