Bits Per Character (Lower is Better)
Tokens/Sec (Higher is Better)
Goal
This project investigates persistent memory in transformers, balancing:
- Low compression loss (bpc reduction)
- Recall of distant tokens
- Runtime speed optimization
Version 1 proved the concept. Version 2 delivers the optimization.
Architecture
Base: RevenaHybridTiny
- 70M parameters
- 6 layers, 8 heads, 128 context length
- GPT-2 tokenizer
Memory System (v2)
LTM Upgrade:
- Storage: numpy ring buffer, max 1024 vectors
- Indexing: FAISS with IndexFlatIP (cosine similarity via normalized dot product)
- Uses GPU FAISS if available, falls back to CPU
- Retrieval:
- Batch query embeddings from current forward pass
- Compute cosine similarity with all LTM vectors
- Top-k=5 weighted sum of similar memories (softmax-weighted)
- Write Gating:
- Store tokens whose surprisal exceeds 4.0 bits
- FIFO eviction if full
- Integration: LTM output is residually added to token stream before output projection
Code Changes (Summarized)
- Added
_normalize_embeddings
,_process_ltm_batch
, and_update_ltm
for memory operations - Batched LTM lookup with FAISS
- All memory I/O is decoupled from core transformer blocks
- Forward pass now optionally routes through memory if
model.ltm_enabled = True
add_memory_v2_to_model(model)
upgrades any RevenaHybrid instance with enhanced memory logic- Note: This design keeps LTM modular and opt-in, usable in both training and inference
Dataset
- OpenWebText, 10k samples
- Loaded via HuggingFace streaming
- Tokenized to GPT-2
- Sliding window, stride 64, context 128
- Split:
- 90% → train.bin
- 10% → val.bin
Training
Config | Value |
---|---|
Optimizer | AdamW |
Learning Rate | 5e-4 |
Batch Size | 8 |
Steps | 10,000 |
Scheduler | Cosine |
Warmup | 500 steps |
Memory Size | 1024 entries |
Top-k Retrieval | 5 |
Device | RTX 4060 8GB |
Results
Bits Per Character (Lower is Better)
Model | BPC |
---|---|
Baseline | 6.30 |
with_ltm | 1.47 |
with_ltm_v2 | 1.40 |
Tokens/Sec (Higher is Better)
Model | Tokens/Sec |
---|---|
Baseline | 9,489 |
with_ltm | 1,638 |
with_ltm_v2 | 21,366 |
V2 is faster than baseline, achieving full redemption.
What Changed
V1 used in-place Python loops and detached tensor operations to run cosine similarity, which significantly impacted performance.
V2 switched to FAISS (GPU-indexed if available) and implemented:
- Batched queries
- Quantized numpy storage
- Fused similarity → weighted sum retrieval
- Memory use decoupled from transformer path
BPC dropped slightly versus V1, indicating cleaner signal and better compression.
Conclusion
- Dual memory works
- Batching + FAISS makes it viable on consumer GPUs
- Dynamic routing via surprisal is still an effective low-cost attention heuristic
- Recall improves, compression improves, and speed is no longer a tradeoff
Future Work
- Query-side routing: only attend to memory for surprising tokens
- Test on reasoning tasks (GSM8K, code completions)
- Multimodal memory store (e.g., vision+text)
- Try LTM prefill for few-shot adaptation
AGI is not just about scale; it's about structure. The brain doesn't attend to everything — it remembers what matters. Now so does your transformer.
— Revena