Atlas / Modern pipeline roadmap
How to Rebuild the MCPH Protocol for 2026
The inherited R script answers an old, very specific question: what happens when a microcephaly gene list is intersected with the frozen bundle's expression matrices, disease gene lists, MAGMA p-values, and GWAS gene sets? The next project is bigger: build a modern, modular evidence pipeline that keeps the old plot reproducible while adding current gene nomenclature, clinical validity, developmental brain expression, cross-condition genetics, and mechanism-level interpretation.
Run each new database separately first, understand what it contributes, then combine only the layers that survive biological and methodological scrutiny.
A gene overlap is only a starting clue. The modern pipeline should explain evidence strength, developmental timing, disease specificity, and mechanism.
Why update the protocol?
The old heatmap is useful, but it is not the final biological answer.
The original bundle is valuable because it is a real inherited analysis: it contains frozen databases, a defined R script, precomputed expression resources, disease gene sets, MAGMA gene-level values, and a visual grammar for compressing many evidence types into one heatmap. We should preserve that. Reproducibility matters: if Mario, Gabriel, or a thesis reader asks what the original protocol did, we need to be able to rerun and explain it faithfully.
But the bundle is also old. Gene symbols changed; database coverage improved; clinical gene panels evolved; cross-disorder psychiatric genetics expanded; single-cell and developmental brain atlases became richer; and the field's vocabulary for gene-disease validity became more rigorous. So the correct move is not to throw away the old script. The correct move is to wrap it: old protocol underneath, modern evidence layers above.
The roadmap below separates the work into layers. Each layer answers a distinct scientific question and can produce its own plot. That lets us learn what each database contributes before we attempt a new all-in-one figure.
Seven evidence layers
A zero-to-master pipeline for the 33 MCPH genes.
Gene Identity and Alias Normalization
First we make a canonical gene table. Every later database depends on this. If the symbol layer is wrong, the whole analysis becomes quietly wrong.
Question answered
Are all 33 genes being recognized under their current and historical names?
Candidate sources
- HGNC REST API: approved symbols, previous symbols, aliases, HGNC IDs.
- Ensembl BioMart or REST: Ensembl stable IDs and genomic coordinates.
- NCBI Gene: Entrez IDs and older literature compatibility.
Expected output
current_symbol | bundle_symbol | aliases | HGNC_ID | Ensembl_ID | Entrez_ID
KNL1 | CASC5 | D40... | ... | ... | ...
TRAPPC14 | C7orf43 | MAP11 | ... | ... | ...
This is the layer that prevents KNL1 and TRAPPC14 from disappearing just because an older database still knows them as CASC5 and C7orf43.
Clinical Gene-Disease Validity
This layer asks whether a gene is clinically supported as a microcephaly gene, not merely present in a historical list. It is the most important first update because it separates established MCPH genes from candidates, syndromic microcephaly genes, and genes that need cautious interpretation.
Question answered
Which of the 33 genes are currently strong, moderate, weak, or uncurated microcephaly genes?
Candidate sources
- PanelApp Severe microcephaly: green/amber/red clinical evidence and inheritance.
- Gene2Phenotype: allelic requirement, mutation consequence, mechanism, confidence.
- GenCC: harmonized gene-disease validity assertions from multiple curators.
- ClinGen: conservative expert validity curations where available.
- OMIM: phenotype names, inheritance, original gene-disease history; use with licensing awareness.
Plot proposal
A clinical validity heatmap with genes as rows and evidence sources as columns.
gene x evidence source
PanelApp: green / amber / red / absent
G2P: confirmed / probable / possible / absent
GenCC: definitive / strong / moderate / limited / disputed
ClinGen: definitive / strong / moderate / limited / no curation
OMIM: phenotype record present / absent
This plot would tell us which parts of Mario's list are clinically mature and which parts are still research-facing.
Phenotype Ontology and Disease Vocabulary
Disease labels are messy. “Primary microcephaly,” “severe microcephaly,” “microcephalic dwarfism,” “cortical malformation,” “neurodevelopmental disorder with microcephaly,” and “intellectual disability with microcephaly” overlap but are not identical. Before plotting cross-condition intersections, we need a controlled vocabulary.
Question answered
What phenotype or disease entity is each database actually talking about?
Candidate sources
- HPO: phenotype terms such as microcephaly, seizures, intellectual disability, cortical malformation.
- MONDO: disease entities and cross-database mappings.
- Orphadata: rare disease classifications and Orphanet identifiers.
Plot proposal
A phenotype map or nested disease-tree annotation beside the gene table.
This layer protects us from treating every database hit containing the word “microcephaly” as if it meant the same biological condition.
Variant, Inheritance, and Constraint Evidence
A gene can be associated with disease in different ways: biallelic loss-of-function, dominant-negative variants, dosage sensitivity, missense-specific mechanisms, or uncertain candidate variants. This layer asks what kind of genetic damage the gene tolerates and what variant classes are reported in patients.
Question answered
What variant logic explains each gene-disease relationship?
Candidate sources
- ClinVar: pathogenic and likely pathogenic variants, with review status caveats.
- gnomAD: LOEUF, pLI, missense Z, observed/expected constraint.
- ClinGen dosage sensitivity: haploinsufficiency and triplosensitivity where curated.
- DECIPHER: patient-level phenotype-genotype evidence where accessible.
Plot proposal
gene | inheritance | variant class | LOEUF | pLI | missense Z | ClinVar P/LP count | dosage sensitivity
This layer helps us distinguish “this gene is generally important” from “this kind of mutation in this gene causes this kind of microcephaly.”
Developmental Brain Expression
This is the neuroscience heart of the update. The old bundle includes developmental expression blocks, but modern brain atlases can ask more precise questions: which cortical progenitors express these genes, at what developmental stages, and how do expression profiles differ between radial glia, intermediate progenitors, neurons, organoids, and non-human primate comparisons?
Question answered
When and where in brain development are these genes likely to matter?
Candidate sources
- BrainSpan: foundational human developmental brain transcriptome downloads.
- NeMO Analytics: recent neocortical development compendium, highly relevant to Gabriel Santpere's line.
- PsychENCODE: neuropsychiatric brain regulatory and transcriptomic resources.
- CZ CELLxGENE Census: programmatic single-cell access in Python and R.
Plot proposal
genes x developmental cell states
apical radial glia
basal radial glia
intermediate progenitors
early neurons
cortical plate neurons
organoid progenitors
non-human primate / human comparative stages
This layer would transform the atlas from a gene-list exercise into a developmental neuroscience argument.
Cross-Condition Genetic Associations
The old plot already points in this direction with MAGMA and GWAS-like disease columns. The modern version should be more explicit about what kind of evidence is being shown: Mendelian gene causality, common-variant association, locus-to-gene prioritization, literature co-mention, pathway evidence, or functional regulatory evidence.
Question answered
Do MCPH genes intersect with ASD, schizophrenia, cognition, ADHD, bipolar disorder, Alzheimer disease, Parkinson disease, or other brain-related traits?
Candidate sources
- Open Targets Platform: integrated target-disease associations through GraphQL and downloads.
- NHGRI-EBI GWAS Catalog: curated GWAS associations and REST API.
- PGC cross-disorder resources: psychiatric common-variant signals and gene-set analyses.
- FinnGen, UK Biobank, GeneBass: broader disease/trait associations where mapping is appropriate.
Plot proposal
gene x condition
Open Targets association score
GWAS Catalog mapped traits
MAGMA or gene-based signal
PGC membership
old bundle MAGMA comparison
This layer must be labeled carefully: GWAS evidence does not mean “this MCPH gene causes schizophrenia.” It means the gene may sit near, or be prioritized from, broader trait-associated biology.
Mechanism, Pathway, and Network Interpretation
Once clinical and association evidence is mapped, we need to ask what the genes do together. This is where the project becomes understandable for a neuroscience reader: not 33 names, but a set of biological modules that explain why neural progenitors are vulnerable.
Question answered
What biological systems organize the 33 genes?
Candidate sources
- Gene Ontology: biological process, cellular component, molecular function.
- Reactome: curated pathways and reactions.
- STRING and BioGRID: protein interaction networks.
- g:Profiler and Enrichr: enrichment analyses.
- MSigDB and KEGG: useful but check licensing/redistribution constraints.
Plot proposal
A module map: centrosome, centriole, kinetochore, spindle checkpoint, cytokinesis, nuclear envelope, DNA repair, chromatin regulation, RNA/ribosome biology, trafficking, autophagy, lipid transport, and BBB biology.
This is the interpretive layer that lets the thesis say: “microcephaly genes converge on progenitor proliferation, genome integrity, and developmental timing.”
Plot strategy
Three ways to visualize the update.
One Database at a Time
This is the safest scientific path. Run the 33 genes through one database, produce one plot, and interpret exactly what changed. For example: first PanelApp, then BrainSpan, then Open Targets, then GWAS Catalog.
Best for: learning, thesis explanation, debugging, avoiding conceptual overload.
Risk: many separate plots may feel fragmented unless we write clear connecting text.
Layer Comparison Plots
Put old and new evidence side by side. For example, compare old bundle disease-list membership against PanelApp/G2P/GenCC, or compare old CoGAPS expression peaks against BrainSpan/NeMO developmental expression.
Best for: showing what the old protocol got right, missed, or could not know.
Risk: requires careful normalization so rows, symbols, and evidence scales are comparable.
New Mega Plot
Combine selected layers into a modern successor to the inherited heatmap. This should only happen after the individual layers are understood, because a beautiful mega plot can easily hide methodological confusion.
Best for: final thesis figure, team presentation, publication-style atlas.
Risk: too many tracks can become unreadable unless grouped into biological chapters.
Implementation plan
How the updated R pipeline should be written.
I would not write one giant R script. The original script is already hard to understand because many tasks are entangled: reading inputs, loading databases, joining evidence, transforming matrices, plotting annotations, and exporting figures. The new version should be modular from the beginning.
00_config.yml
01_normalize_genes.R
02_fetch_hgnc_ids.R
03_fetch_panelapp.R
04_fetch_g2p_gencc_clingen_omim_links.R
05_fetch_opentargets.R
06_fetch_gwas_catalog.R
07_fetch_expression_brainspan_nemo.R
08_fetch_constraint_clinvar_gnomad.R
09_score_evidence.R
10_plot_clinical_validity.R
11_plot_developmental_expression.R
12_plot_cross_condition.R
13_plot_mechanism_modules.R
14_plot_mega_atlas.R
R/
helpers_api.R
helpers_symbols.R
helpers_plot_tracks.R
scoring_rules.R
data_raw/
data_cache/
data_processed/
figures/
reports/
Use renv and cached raw downloads
Every API response and downloaded table should be saved with date, source URL, version, and query parameters. This makes the thesis reproducible even if a database changes next month.
Use targets
The targets package lets us rerun only the pieces that changed. If we update PanelApp but not BrainSpan, only the clinical plot rebuilds.
Use ComplexHeatmap deliberately
The old script already uses complex heatmap logic. The update should keep that strength, but split tracks into named, documented evidence modules.
Recommended path
The best next move is clinical validity first, developmental expression second.
Build the canonical gene table
Lock HGNC-approved symbols, aliases, bundle symbols, Ensembl IDs, and Entrez IDs.
Make the clinical validity plot
PanelApp + G2P + GenCC + ClinGen + OMIM links. This gives the strongest immediate scientific update.
Make the developmental expression plot
BrainSpan + NeMO first. This connects the analysis to cortical progenitors and Gabriel's line of work.
Make the cross-condition plot
Open Targets + GWAS Catalog + old MAGMA comparison, carefully labeled as association evidence.
Build the mega atlas only after interpretation
Combine the tracks that teach something. Leave out any layer that looks impressive but does not clarify the biology.