The Digital Laboratory
A comprehensive mapping of the Entrez engine, curated standards, variation frequencies, and high-throughput analytical workflows.
1. The Entrez Engine
Operating across 40+ databases, Entrez uses pre-computed statistical associations to traverse the "biological logic" of disparate datasets.
Hard Links
Logical, verifiable connections established at submission. Example: A sequence record linked to its PubMed citation via accession number.
Neighbor Links
PubMed: "Related Articles" via text-similarity weights.
Sequence: Pre-computed BLAST clusters identifying statistically relevant relatives.
LinkOut: Federated Portal
E-Utilities API
ESearch → ELink → EPost → EFetch
2. The Gene Nexus & Standards
Anatomy of a Gene Record
MeSH: Literature Normalization
Standardizes 36M+ PubMed abstracts. Queries for "Cancer" or "Tumors" are automatically mapped to the concept "Neoplasms", ensuring exhaustive retrieval.
Archival vs. Curated
GenBank (The Archive)
Redundant author submissions. May contain errors, cloning artifacts, or haplotypes.
RefSeq (The Standard)
Expert-synthesized standard sequences. Non-redundant biological baseline. Optimized for analysis.
3. Variation & Evolutionary Toolbox
dbSNP: ss# vs rs#
Submission IDs (ss#) cluster into Reference SNPs (rs#). The rsID is the universal key for scientific communication; multiple studies reporting the same variant merge into one rsID.
ALFA Aggregator
Pre-computes and aggregates allele frequencies from 1M+ dbGaP subjects across 12 major populations.
South Asian Case Study: Variant found <0.1% in Europeans but 15% in ALFA South Asians? Likely a benign lineage-specific polymorphism.
Advanced Evolutionary Toolkit
COBALT (Alignment)
Constraint-based Multiple Alignment. Sirf sequences nahi, balki CDD (Conserved Domain) domains ko use karke alignment ko structural sense deta hai.
TreeViewer (Phylogeny)
Visualizes Paralogs (duplications) vs. Orthologs (speciation). Ex: Creatine Kinase Muscle (M) vs Brain (B) isoforms duplication history.
CD-Search (Domain Detection)
Protein BLAST ke sath automatic identify karta hai functional units (e.g., ATP-binding cassette ya Zinc-finger domains).
MMDB Structure Neighbors
Detects relationships where sequence identity < 20% hai magar 3D folds conserved hain. Sequence BLAST se gayab distant cousins ko pakadta hai.
4. BLAST Suite & Algorithms
| Algorithm | Type | Best For... |
|---|---|---|
| megablast | Nuc → Nuc | Identical sequences (same species). Optimized for speed. |
| blastp | Prot → Prot | Functional annotation and domain analysis. |
| tblastn | Prot → Nuc | Gene Prediction (e.g. Grey Whale Kreatine Kinase discovery). |
| blastx | Nuc → Prot | Translating novel transcripts/ESTs into proteins. |
Search Parameters
Reduce (11 for blastn vs 28 for megablast) to increase sensitivity for distant homologs.
Lower threshold (e.g. 1e-5) filters random statistical noise in massive databases.
Database Bias: ClusteredNR
Standard nr is biased toward humans/mice. ClusteredNR groups sequences with 90% identity.
Reduces database size by ~40% and helps find Sponge/Jellyfish homologs buried under mammalian data.
Lynch Syndrome Precision Workflow
Clinical Suspicion to Targets
Clinician suspects Lynch Syndrome (CRC). MedGen links to GeneReviews. Target MMR genes identified: MLH1, MSH2, MSH6, PMS2. GTR finds lab tests.
Standardization Baseline
Retrieves RefSeq NM_000249.4. Maps patient sequences against this curated standard to ensure consistent HGVS variant naming (e.g. c.123G>A).
Variant Interpretation & Filtering
Expert Panel Review status (3-star) provides definitive pathogenicity evidence over unreviewed submissions.
If variant frequency > 5% in the ancestry group, it is likely a benign polymorphism.
Evidence Synthesis
LinkOut to PDB (Structure). If mutation disrupts the MLH1-PMS2 interface, structural evidence supports pathogenicity. PubMed search using rsID retrieves case reports.