Understanding Scoring Matrices:
BLOSUM & PAM
An exhaustive analysis of amino acid substitution models, from Markovian extrapolation to empirical block clustering.
The Scale of Complexity
Discrete Transitions
4 Bases (A, C, G, T). Mutations are simple binary swaps.
Physicochemical Landscape
20 Amino Acids. Size, Charge, Hydrophobicity, Flexibility.
Glycine (Gly): Tiny, flexible. Allows tight turns.
Tryptophan (Trp): Bulky, aromatic. Anchors protein cores.
Arginine (Arg): Positively charged. Exposed to solvent.
"Not interchangeable units. Substitution is a function of structural compatibility."
PAM Series
Percent Accepted Mutation
Strict Dataset
Dayhoff (1978) selected 71 families with >85% identity.
Why >85%? To ensure observed differences are single events, avoiding "Multiple Hit" noise (A→B→A).
Tree-Based Counting
Used Parsimony trees to infer ancestral sequences.
Symmetrized Counts: Since evolution direction is reversible in short time, \(A_{xy} = A_{yx}\).
Markovian Extrapolation
Assumes Time-Homogeneity (rates don't change).
\( M_n = (M_1)^n \)
PAM250 is the PAM1 matrix multiplied by itself 250 times. Results in ~20% sequence identity (Twilight Zone).
BLOSUM Series
Blocks Substitution Matrix
The BLOCKS Database
Henikoff (1992) used Ungapped Local Alignments of conserved domains (500+ groups).
Feature Selection: Focuses on functional cores, ignoring noisy loops.
Bias Removal (Clustering)
Sequences >X% identity are clustered into a single weighted entity.
Counting Rule: Pairs within a cluster are ignored. Pairs between clusters are counted.
Empirical Observation
Direct observation. No extrapolation. BLOSUM62 observes substitution frequencies in sequences that have diverged to ~62% identity.
Foundations: Log-Odds & Entropy
The Hypothesis
\(H_1\) (Homology): Common ancestor. Probability \(q_{ij}\).
\(H_0\) (Chance): Random alignment. Probability \(p_i p_j\).
Odds Ratio: \( \frac{q_{ij}}{p_i p_j} \)
Log-Odds Score
- \(S > 0\): Conservative.
- \(S < 0\): Deleterious.
- \(S = 0\): Neutral/Chance.
Entropy (H)
Bits per position.
High H (BLOSUM80): Short, strict alignments.
Low H (BLOSUM45): Long, distant alignments.
Bit Score & Constraint
Critical Rule: \( E = \sum p_i p_j S_{ij} < 0 \)
Expected score must be negative to prevent infinite random alignment growth.
The Algorithm Engine
Affine Gap Penalties
A simple matrix score isn't enough. Biology prefers one long gap over many short gaps.
- Gap Open (e.g., -11): High penalty to start a gap.
- Gap Extend (e.g., -1): Low penalty to continue it.
BLAST (Local Heuristic)
- Seeding (W=3): Breaks query into 3-mer words.
-
Neighborhood (T): Finds database words scoring > \(T\) (Threshold).
Example: P-Q-G scores 18. P-E-G scores 15. If T=11, both are seeds. - Extension (X-dropoff): Extends until score drops below \(X\) from max.
Visual Case Study: Strict vs. Lenient
Alignment of close homologs (Human vs Chimp).
Seq B: L-R-W-V
Score: Low (Penalty for K-R is high)
High penalty for even conservative changes to ensure exact matching.
Alignment of distant homologs (Human vs Yeast).
Seq B: I-R-F-L
Score: High (Positives for L-I, K-R, W-F)
Allows physicochemical substitutions to detect ancient structural similarity.
Equivalence & Specialized Matrices
| BLOSUM | PAM Equiv. | Entropy (Bits) |
|---|---|---|
| BLOSUM 80 | PAM 30 | ~2.5 |
| BLOSUM 62 | PAM 160 | ~1.0 |
| BLOSUM 45 | PAM 250 | ~0.4 |
Position-Specific Scoring Matrix: Instead of a fixed 20x20 matrix, scores depend on the specific position in the alignment (e.g., Position 42 is strictly conserved Cysteine).
PHAT: For Transmembrane proteins.
VTML: Maximum Likelihood approach.
Corrected versions (CorBLOSUM) fix the clustering bug in original BLOSUM code.
| Close (>85%) | Strict boundaries, recent divergence | BLOSUM 80 | PAM 30 |
| Moderate (60-85%) | General Purpose ("Best Guess") | BLOSUM 62 | PAM 160 |
| Distant (30-50%) | Remote homology | BLOSUM 50 | PAM 200 |
| Twilight (<25%) | "The Twilight Zone" | BLOSUM 45 | PAM 250 |