How it works
          Our attention mechanism is a drop-in replacement of the scaled dot-product attention found in large language models. 
Imagine being able to pre-train large language models within a more reasonable budget, without having to worry about the attention fading problem that comes with large context lengths. 
True linear attention - both the memory and compute requirements scale linearly - in both the token dimension and the context length.