Understanding Scoring Matrices: BLOSUM & PAM

Advanced Bioinformatics Series

Understanding Scoring Matrices:
BLOSUM & PAM

An exhaustive analysis of amino acid substitution models, from Markovian extrapolation to empirical block clustering.

The Scale of Complexity

DNA

Discrete Transitions

4 Bases (A, C, G, T). Mutations are simple binary swaps.

PROT

Physicochemical Landscape

20 Amino Acids. Size, Charge, Hydrophobicity, Flexibility.

Glycine (Gly): Tiny, flexible. Allows tight turns.

Tryptophan (Trp): Bulky, aromatic. Anchors protein cores.

Arginine (Arg): Positively charged. Exposed to solvent.

"Not interchangeable units. Substitution is a function of structural compatibility."

Global Alignment Model

PAM Series

Percent Accepted Mutation

Strict Dataset

Dayhoff (1978) selected 71 families with >85% identity.
Why >85%? To ensure observed differences are single events, avoiding "Multiple Hit" noise (A→B→A).

Tree-Based Counting

Used Parsimony trees to infer ancestral sequences.
Symmetrized Counts: Since evolution direction is reversible in short time, \(A_{xy} = A_{yx}\).

Markovian Extrapolation

Assumes Time-Homogeneity (rates don't change).

1 PAM = 1% Divergence
\( M_n = (M_1)^n \)

PAM250 is the PAM1 matrix multiplied by itself 250 times. Results in ~20% sequence identity (Twilight Zone).

Local Blocks Model

BLOSUM Series

Blocks Substitution Matrix

The BLOCKS Database

Henikoff (1992) used Ungapped Local Alignments of conserved domains (500+ groups).
Feature Selection: Focuses on functional cores, ignoring noisy loops.

Bias Removal (Clustering)

Sequences >X% identity are clustered into a single weighted entity.
Counting Rule: Pairs within a cluster are ignored. Pairs between clusters are counted.

Empirical Observation

Weight = 1 / \(N_{cluster}\)

Direct observation. No extrapolation. BLOSUM62 observes substitution frequencies in sequences that have diverged to ~62% identity.

Foundations: Log-Odds & Entropy

The Hypothesis

\(H_1\) (Homology): Common ancestor. Probability \(q_{ij}\).

\(H_0\) (Chance): Random alignment. Probability \(p_i p_j\).

Odds Ratio: \( \frac{q_{ij}}{p_i p_j} \)

Log-Odds Score

\( S_{ij} = \lambda \log_b \left( \frac{q_{ij}}{p_i p_j} \right) \)

\(S > 0\): Conservative.
\(S < 0\): Deleterious.
\(S = 0\): Neutral/Chance.

Entropy (H)

\( H = \sum q \log \left( \frac{q}{p p} \right) \)

Bits per position.
High H (BLOSUM80): Short, strict alignments.
Low H (BLOSUM45): Long, distant alignments.

Bit Score & Constraint

\( S' = \frac{\lambda S - \ln K}{\ln 2} \)

Critical Rule: \( E = \sum p_i p_j S_{ij} < 0 \)
Expected score must be negative to prevent infinite random alignment growth.

The Algorithm Engine

Affine Gap Penalties

A simple matrix score isn't enough. Biology prefers one long gap over many short gaps.

\( \text{Gap Cost} = \text{Open} + (\text{Extend} \times \text{Length}) \)

Gap Open (e.g., -11): High penalty to start a gap.
Gap Extend (e.g., -1): Low penalty to continue it.

Note: If Matrix is scaled (e.g. 1/3 bits), Gap Penalties must scale proportionally.

BLAST (Local Heuristic)

Seeding (W=3): Breaks query into 3-mer words.
Neighborhood (T): Finds database words scoring > \(T\) (Threshold).
Example: P-Q-G scores 18. P-E-G scores 15. If T=11, both are seeds.
Extension (X-dropoff): Extends until score drops below \(X\) from max.

Visual Case Study: Strict vs. Lenient

Strict PAM30 / BLOSUM80

Alignment of close homologs (Human vs Chimp).

Seq A: L-K-W-V
Seq B: L-R-W-V

Score: Low (Penalty for K-R is high)

High penalty for even conservative changes to ensure exact matching.

Lenient PAM250 / BLOSUM45

Alignment of distant homologs (Human vs Yeast).

Seq A: L-K-W-V
Seq B: I-R-F-L

Score: High (Positives for L-I, K-R, W-F)

Allows physicochemical substitutions to detect ancient structural similarity.

Equivalence & Specialized Matrices

BLOSUM	PAM Equiv.	Entropy (Bits)
BLOSUM 80	PAM 30	~2.5
BLOSUM 62	PAM 160	~1.0
BLOSUM 45	PAM 250	~0.4

Note: BLOSUM62 corresponds to PAM160, not PAM250.

PSSM (PSI-BLAST)

Position-Specific Scoring Matrix: Instead of a fixed 20x20 matrix, scores depend on the specific position in the alignment (e.g., Position 42 is strictly conserved Cysteine).

Specialized Matrices

PHAT: For Transmembrane proteins.
VTML: Maximum Likelihood approach.

Styczynski Correction (CorBLOSUM)

Corrected versions (CorBLOSUM) fix the clustering bug in original BLOSUM code.

Final Decision Matrix

Close (>85%)	Strict boundaries, recent divergence	BLOSUM 80	PAM 30
Moderate (60-85%)	General Purpose ("Best Guess")	BLOSUM 62	PAM 160
Distant (30-50%)	Remote homology	BLOSUM 50	PAM 200
Twilight (<25%)	"The Twilight Zone"	BLOSUM 45	PAM 250

Understanding Scoring Matrices:
BLOSUM & PAM

The Scale of Complexity

Discrete Transitions

Physicochemical Landscape

PAM Series

Strict Dataset

Tree-Based Counting

Markovian Extrapolation

BLOSUM Series

The BLOCKS Database

Bias Removal (Clustering)

Empirical Observation

Foundations: Log-Odds & Entropy

The Hypothesis

Log-Odds Score

Entropy (H)

Bit Score & Constraint

The Algorithm Engine

Affine Gap Penalties

BLAST (Local Heuristic)

Visual Case Study: Strict vs. Lenient

Equivalence & Specialized Matrices

How can we help?

Learning Bioinformatics

Bioinformatics Services

Almost there!

Recommended Starting Course

Unlock Your Potential!

Have questions about this course?

Recommended Analytical Approach

Proposed Service Roadmap

How was your experience?

Hurry up! Sale ends in:

The Scale of Complexity

Discrete Transitions

Physicochemical Landscape

PAM Series

Strict Dataset

Tree-Based Counting

Markovian Extrapolation

BLOSUM Series

The BLOCKS Database

Bias Removal (Clustering)

Empirical Observation

Foundations: Log-Odds & Entropy

The Hypothesis

Log-Odds Score

Entropy (H)

Bit Score & Constraint

The Algorithm Engine

Affine Gap Penalties

BLAST (Local Heuristic)

Visual Case Study: Strict vs. Lenient

Equivalence & Specialized Matrices

How was your experience?

Course Enrollment

End Conversation?

Hurry up! Sale ends in: