Understanding Scoring Matrices: BLOSUM & PAM
Advanced Bioinformatics Series

Understanding Scoring Matrices:
BLOSUM & PAM

An exhaustive analysis of amino acid substitution models, from Markovian extrapolation to empirical block clustering.

The Scale of Complexity

DNA

Discrete Transitions

4 Bases (A, C, G, T). Mutations are simple binary swaps.

PROT

Physicochemical Landscape

20 Amino Acids. Size, Charge, Hydrophobicity, Flexibility.

Glycine (Gly): Tiny, flexible. Allows tight turns.

Tryptophan (Trp): Bulky, aromatic. Anchors protein cores.

Arginine (Arg): Positively charged. Exposed to solvent.

"Not interchangeable units. Substitution is a function of structural compatibility."

Global Alignment Model

PAM Series

Percent Accepted Mutation

Strict Dataset

Dayhoff (1978) selected 71 families with >85% identity.
Why >85%? To ensure observed differences are single events, avoiding "Multiple Hit" noise (A→B→A).

Tree-Based Counting

Used Parsimony trees to infer ancestral sequences.
Symmetrized Counts: Since evolution direction is reversible in short time, \(A_{xy} = A_{yx}\).

Markovian Extrapolation

Assumes Time-Homogeneity (rates don't change).

1 PAM = 1% Divergence
\( M_n = (M_1)^n \)

PAM250 is the PAM1 matrix multiplied by itself 250 times. Results in ~20% sequence identity (Twilight Zone).

Local Blocks Model

BLOSUM Series

Blocks Substitution Matrix

The BLOCKS Database

Henikoff (1992) used Ungapped Local Alignments of conserved domains (500+ groups).
Feature Selection: Focuses on functional cores, ignoring noisy loops.

Bias Removal (Clustering)

Sequences >X% identity are clustered into a single weighted entity.
Counting Rule: Pairs within a cluster are ignored. Pairs between clusters are counted.

Empirical Observation

Weight = 1 / \(N_{cluster}\)

Direct observation. No extrapolation. BLOSUM62 observes substitution frequencies in sequences that have diverged to ~62% identity.

Foundations: Log-Odds & Entropy

The Hypothesis

\(H_1\) (Homology): Common ancestor. Probability \(q_{ij}\).

\(H_0\) (Chance): Random alignment. Probability \(p_i p_j\).

Odds Ratio: \( \frac{q_{ij}}{p_i p_j} \)

Log-Odds Score

\( S_{ij} = \lambda \log_b \left( \frac{q_{ij}}{p_i p_j} \right) \)
  • \(S > 0\): Conservative.
  • \(S < 0\): Deleterious.
  • \(S = 0\): Neutral/Chance.

Entropy (H)

\( H = \sum q \log \left( \frac{q}{p p} \right) \)

Bits per position.
High H (BLOSUM80): Short, strict alignments.
Low H (BLOSUM45): Long, distant alignments.

Bit Score & Constraint

\( S' = \frac{\lambda S - \ln K}{\ln 2} \)

Critical Rule: \( E = \sum p_i p_j S_{ij} < 0 \)
Expected score must be negative to prevent infinite random alignment growth.

The Algorithm Engine

Affine Gap Penalties

A simple matrix score isn't enough. Biology prefers one long gap over many short gaps.

\( \text{Gap Cost} = \text{Open} + (\text{Extend} \times \text{Length}) \)
  • Gap Open (e.g., -11): High penalty to start a gap.
  • Gap Extend (e.g., -1): Low penalty to continue it.
Note: If Matrix is scaled (e.g. 1/3 bits), Gap Penalties must scale proportionally.

BLAST (Local Heuristic)

  • Seeding (W=3): Breaks query into 3-mer words.
  • Neighborhood (T): Finds database words scoring > \(T\) (Threshold).
    Example: P-Q-G scores 18. P-E-G scores 15. If T=11, both are seeds.
  • Extension (X-dropoff): Extends until score drops below \(X\) from max.

Visual Case Study: Strict vs. Lenient

Strict PAM30 / BLOSUM80

Alignment of close homologs (Human vs Chimp).

Seq A: L-K-W-V
Seq B: L-R-W-V
Score: Low (Penalty for K-R is high)

High penalty for even conservative changes to ensure exact matching.

Lenient PAM250 / BLOSUM45

Alignment of distant homologs (Human vs Yeast).

Seq A: L-K-W-V
Seq B: I-R-F-L
Score: High (Positives for L-I, K-R, W-F)

Allows physicochemical substitutions to detect ancient structural similarity.

Equivalence & Specialized Matrices

BLOSUM PAM Equiv. Entropy (Bits)
BLOSUM 80PAM 30~2.5
BLOSUM 62PAM 160~1.0
BLOSUM 45PAM 250~0.4
Note: BLOSUM62 corresponds to PAM160, not PAM250.
PSSM (PSI-BLAST)

Position-Specific Scoring Matrix: Instead of a fixed 20x20 matrix, scores depend on the specific position in the alignment (e.g., Position 42 is strictly conserved Cysteine).

Specialized Matrices

PHAT: For Transmembrane proteins.
VTML: Maximum Likelihood approach.

Styczynski Correction (CorBLOSUM)

Corrected versions (CorBLOSUM) fix the clustering bug in original BLOSUM code.

Final Decision Matrix
Close (>85%) Strict boundaries, recent divergence BLOSUM 80 PAM 30
Moderate (60-85%) General Purpose ("Best Guess") BLOSUM 62 PAM 160
Distant (30-50%) Remote homology BLOSUM 50 PAM 200
Twilight (<25%) "The Twilight Zone" BLOSUM 45 PAM 250
AI
BioCode Support
Online

Please provide your details below to start a conversation with our smart assistant.

Course Enrollment

×
Select your currency
Hurry up! Sale ends in:
Days
Hours
Minutes
Seconds