← Back to Home

RWKV Was a Lie

Reconstructing Attention Through True Recurrence

March 29, 2025

The RWKV architecture sold itself as a linear-time alternative to attention. People called it "the transformer killer." Papers got published. Implementations proliferated. And yet: they were wrong.


Not "suboptimal," not "inefficient." Wrong.


Every public implementation of RWKV—every scan, every recurrence, every supposed affine trick—fundamentally misunderstood what the recurrence relation was doing. And the consequences are real: invalid gradients, broken attention paths, and performance bottlenecks at scale.


In this post, I'm going to show you what everyone got wrong, how to fix it, and why it matters.


RWKV Runtime Error Comparison

Figure 1: Comparison of runtime and accuracy between standard RWKV implementations and our corrected DPLR approach.


The Myth of the Affine Scan


Most RWKV variants treat the time-mixing process as an affine map scan:

output[t] = G[t] @ state[t-1] + V[t]

This looks like a recurrent relation, but it's a misrepresentation. In particular, it fails to capture the actual RWKV recurrence, which is:

output[t] = ∑ₛ₌₀ᵗ v[s] · (k[s]ᵀ · Pₛ₊₁→ₜ · k[t])

See the difference? This is not a vanilla recurrence. It's a weighted key-value summation over the entire past, with transition matrices Ps+1→t mediating the influence of each timestep. And those matrices change at every step.


The affine scan shortcut ignores this. Worse: it's mathematically incompatible with the intended behavior.


The Fix — True Recurrence with DPLR Transitions


We rebuilt RWKV from scratch, starting from the math.


Our insight: the recurrence must explicitly model the transition matrices Ps→t as a chain of diagonally-perturbed low-rank (DPLR) matrices. We define:

G[t] = diag(w) - z @ zᵀ * a

And compose them using exact matrix products. We cache intermediate transitions using a binary tree structure for O(log T) reuse, and support blocked matvecs, sparse diagonals, and randomized SVD compression to keep memory bounded.


Yes, this is more complex than a fake affine scan. But it's also mathematically correct. Gradients flow properly. Influence decays appropriately. And inference doesn't hallucinate structure that never existed.


Performance and Scaling


You'd think all this correctness would be slow. But it isn't.


With efficient blocking, matrix caching, and custom autograd scan logic, we get:


  • O(T log T) runtime for training
  • O(T) runtime for inference
  • Constant memory overhead via checkpointing
  • Compatibility with torch.compile, AMP, Triton, and sparse matmul

On sequences >32K, this crushes the old RWKV implementations—both in speed and accuracy.


Implications


RWKV wasn't just an implementation flaw. The entire model family was being trained and evaluated on broken assumptions.


What we've done is restore the actual recurrence that RWKV claimed to use. This isn't a "better version." It's the version that should've existed from the start.


The implications are severe:


  • Many published RWKV models are improperly trained
  • Prior scaling analyses are invalid
  • Downstream tasks relying on precise recurrence may underperform

If you're using RWKV, you should consider switching to this implementation immediately.


Check out our code here: DPLR Implementation


The Lesson


Don't trust clever approximations unless you've read the math. Especially in recurrence-heavy models, approximations tend to accumulate pathological error.


RWKV was never affine. It was always DPLR.


Now you know.