← Back to Home

Memory Is All You Need

March 8, 2024

View on GitHub

Bits Per Character (Lower is Better)

Tokens/Sec (Higher is Better)

Goal

This project investigates persistent memory in transformers, balancing:

  • Low compression loss (bpc reduction)
  • Recall of distant tokens
  • Runtime speed optimization

Version 1 proved the concept. Version 2 delivers the optimization.

Architecture

Base: RevenaHybridTiny

  • 70M parameters
  • 6 layers, 8 heads, 128 context length
  • GPT-2 tokenizer

Memory System (v2)

LTM Upgrade:

  • Storage: numpy ring buffer, max 1024 vectors
  • Indexing: FAISS with IndexFlatIP (cosine similarity via normalized dot product)
    • Uses GPU FAISS if available, falls back to CPU
  • Retrieval:
    • Batch query embeddings from current forward pass
    • Compute cosine similarity with all LTM vectors
    • Top-k=5 weighted sum of similar memories (softmax-weighted)
  • Write Gating:
    • Store tokens whose surprisal exceeds 4.0 bits
    • FIFO eviction if full
  • Integration: LTM output is residually added to token stream before output projection

Code Changes (Summarized)

  • Added _normalize_embeddings, _process_ltm_batch, and _update_ltm for memory operations
  • Batched LTM lookup with FAISS
  • All memory I/O is decoupled from core transformer blocks
  • Forward pass now optionally routes through memory if model.ltm_enabled = True
  • add_memory_v2_to_model(model) upgrades any RevenaHybrid instance with enhanced memory logic
  • Note: This design keeps LTM modular and opt-in, usable in both training and inference

Dataset

  • OpenWebText, 10k samples
  • Loaded via HuggingFace streaming
  • Tokenized to GPT-2
  • Sliding window, stride 64, context 128
  • Split:
    • 90% → train.bin
    • 10% → val.bin

Training

Config Value
Optimizer AdamW
Learning Rate 5e-4
Batch Size 8
Steps 10,000
Scheduler Cosine
Warmup 500 steps
Memory Size 1024 entries
Top-k Retrieval 5
Device RTX 4060 8GB

Results

Bits Per Character (Lower is Better)

Model BPC
Baseline 6.30
with_ltm 1.47
with_ltm_v2 1.40

Tokens/Sec (Higher is Better)

Model Tokens/Sec
Baseline 9,489
with_ltm 1,638
with_ltm_v2 21,366

V2 is faster than baseline, achieving full redemption.

What Changed

V1 used in-place Python loops and detached tensor operations to run cosine similarity, which significantly impacted performance.

V2 switched to FAISS (GPU-indexed if available) and implemented:

  • Batched queries
  • Quantized numpy storage
  • Fused similarity → weighted sum retrieval
  • Memory use decoupled from transformer path

BPC dropped slightly versus V1, indicating cleaner signal and better compression.

Conclusion

  • Dual memory works
  • Batching + FAISS makes it viable on consumer GPUs
  • Dynamic routing via surprisal is still an effective low-cost attention heuristic
  • Recall improves, compression improves, and speed is no longer a tradeoff

Future Work

  • Query-side routing: only attend to memory for surprising tokens
  • Test on reasoning tasks (GSM8K, code completions)
  • Multimodal memory store (e.g., vision+text)
  • Try LTM prefill for few-shot adaptation

AGI is not just about scale; it's about structure. The brain doesn't attend to everything — it remembers what matters. Now so does your transformer.

— Revena