Memory Is All You Need

March 8, 2024

View on GitHub

Bits Per Character (Lower is Better)

Tokens/Sec (Higher is Better)

Goal

This project investigates persistent memory in transformers, balancing:

Version 1 proved the concept. Version 2 delivers the optimization.

Base: RevenaHybridTiny

LTM Upgrade:

Storage: numpy ring buffer, max 1024 vectors
Indexing: FAISS with IndexFlatIP (cosine similarity via normalized dot product)
- Uses GPU FAISS if available, falls back to CPU
Retrieval:
- Batch query embeddings from current forward pass
- Compute cosine similarity with all LTM vectors
- Top-k=5 weighted sum of similar memories (softmax-weighted)
Write Gating:
- Store tokens whose surprisal exceeds 4.0 bits
- FIFO eviction if full
Integration: LTM output is residually added to token stream before output projection

Added _normalize_embeddings, _process_ltm_batch, and _update_ltm for memory operations
Batched LTM lookup with FAISS
All memory I/O is decoupled from core transformer blocks
Forward pass now optionally routes through memory if model.ltm_enabled = True
add_memory_v2_to_model(model) upgrades any RevenaHybrid instance with enhanced memory logic
Note: This design keeps LTM modular and opt-in, usable in both training and inference

Bits Per Character (Lower is Better)

Tokens/Sec (Higher is Better)

V2 is faster than baseline, achieving full redemption.

V1 used in-place Python loops and detached tensor operations to run cosine similarity, which significantly impacted performance.

V2 switched to FAISS (GPU-indexed if available) and implemented:

BPC dropped slightly versus V1, indicating cleaner signal and better compression.

Dual memory works
Batching + FAISS makes it viable on consumer GPUs
Dynamic routing via surprisal is still an effective low-cost attention heuristic
Recall improves, compression improves, and speed is no longer a tradeoff

AGI is not just about scale; it's about structure. The brain doesn't attend to everything — it remembers what matters. Now so does your transformer.

— Revena