Sequence Alignment 101
From the mathematical rigor of Dynamic Programming to the life-saving applications of Precision Oncology.
1. Fundamental Concepts
THE BIOLOGICAL VOCABULARY
Similarity
THE OBSERVABLE FACTA quantitative measurement. Includes Identity (exact) and Similarity (biochemical).
Homology
THE INFERENCEA qualitative, binary conclusion: Common Ancestry. You cannot be "50% homologous".
Speciation. Same function. (Human vs Mouse Hb)
Duplication. New function. (Trypsins)
Convergent Evolution. Not homology.
2. Algorithmic_Engines
EXECUTING TIMELINE PROTOCOL...
3. The Mathematics of Evolution
"Converting biological intuition into computable probabilities."
Substitution Matrices
Scores are Log-Odds ratios: $$S_{ij} = \log_2\left(\frac{q_{ij}}{p_i p_j}\right)$$
Derived from very close homologs (>85%). Extrapolated for distance.
Empirical. Observed from conserved blocks. BLOSUM62 is standard.
Affine Gap Model
Biology favors extending gaps over opening new ones.
4. Strategic Decisions
The Twilight Zone
Below 30% Identity, sequence signal fades into statistical noise. Pairwise alignment fails.
Matrix Navigation
5. Genome_Annotation
CONSTRUCTING THE PARTS LIST
Method A: Ab Initio
> INPUT: Raw DNA Sequence
> PROCESS: Find ORFs > 100aa
> CHECK: Codon Bias & Splice Sites
> WARNING: High false positive rate.
Method B: Homology-Based
> INPUT: SwissProt / RNA-seq
> ACTION: Align to Genome (Splice-Aware)
> RESULT: High confidence gene models
6. Phylogenetics
Multiple Sequence Alignment (MSA)
Fast. Uses "Guide Tree". Prone to "Frozen Gaps".
Slower. Re-aligns to fix errors. Higher accuracy.
The Saururaceae Revision
Morphology said Saururus was the oldest ancestor.
Molecular Alignment (3 genes) proved it was derived.
This overturned centuries of plant taxonomy.
7. Medical Genomics
Oncology: BRCA1
Scenario: Variant of Uncertain Significance (VUS)
If residue is conserved in Chimp, Mouse, and Fish...
Mutation = PATHOGENIC
Pharma: CYP2D6
Scenario: Dosing Determination
Short reads map ambiguously ($MAPQ=0$) to CYP2D7.
Fix: Long Reads (PacBio)
8. Global Surveillance
Target: SARS-CoV-2
- 01 Align >10M genomes to Wuhan-Hu-1
- 02 Detect Spike D614G mutation
- 03 Calculate $R_0$ via Phylogeny
Protocol: Host Depletion
9. Data Hygiene & QC
Poly-A tails (AAAAAA) or repeats align to everything by chance.
Hard Mask Replaced with N
Soft Mask lowercase (a,c,g,t)
10. Future Horizons
Pangenomics
The linear reference (GRCh38) is biased. We are moving to Genome Graphs that capture all human diversity in a single structure.
T2T Assembly
Telomere-to-Telomere alignment. Mapping the "Dark Matter" (Centromeres & Repeats) using ultra-long reads and graph aligners.