Visualizing interaction with softmax (dense) and α-entmax (sparse) attention.
The RoPE contribution in isolation: a sum of periodic terms with frequencies
gk = θ-2k/d.
With α-entmax, dead zones appear where frequencies interfere destructively.
The α-entmax transformation generalizes softmax with a sparsity parameter α:
where $H_\alpha^T(\bm{p}) = \frac{1}{\alpha(\alpha-1)} \sum_j (p_j - p_j^\alpha)$ is the Tsallis entropy.
α → 1 yields softmax (dense)
α = 2 yields sparsemax (sparse)
α > 1 yields sparsity
For α > 1, the solution has closed form:
where $\tau(\bm{z})$ ensures $\sum_j p_j = 1$, and $[x]_+ = \max(0, x)$.
Entries with $z_j < \frac{\tau(\bm{z})}{\alpha - 1}$ become exactly zero.
📖 Citation:
@inproceedings{vasylenko2026longcontext,
title={Long-Context Generalization with Sparse Attention},
author={Vasylenko, Pavlo and Pitorro, Hugo and Martins, Andr{\'e} FT and Treviso, Marcos},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=PsB6Lynznk}
}