Positional Encodings × Sparse Attention

Visualizing interaction with softmax (dense) and α-entmax (sparse) attention.

Positional Encoding

Attention Transformation

Input Content Distribution (q/k)

0.00
0.70
0.50

Parameters

32
31
1.50

Attention Logits (before transformation)

Full Causal Attention Matrix

Low
Medium
High
Dead entry

Attention Weights (after transformation)

Attention vs Distance Profile

-
Normalized Entropy
-
Max Weight
-
Non-zero
-
Sparsity %
-
Dead Zones

Math Details

The α-entmax transformation generalizes softmax with a sparsity parameter α:

$$\alpha\text{-entmax}(\bm{z}) = \arg\max_{\bm{p} \in \triangle_{n-1}} \bm{p}^\top \bm{z} + H_\alpha^T(\bm{p})$$

where $H_\alpha^T(\bm{p}) = \frac{1}{\alpha(\alpha-1)} \sum_j (p_j - p_j^\alpha)$ is the Tsallis entropy.

α → 1 yields softmax (dense)
α = 2 yields sparsemax (sparse)
α > 1 yields sparsity

For α > 1, the solution has closed form:

$$[\alpha\text{-entmax}(\bm{z})]_j = \left[(\alpha - 1)z_j - \tau(\bm{z})\right]_+^{1/(\alpha-1)}$$

where $\tau(\bm{z})$ ensures $\sum_j p_j = 1$, and $[x]_+ = \max(0, x)$.

Entries with $z_j < \frac{\tau(\bm{z})}{\alpha - 1}$ become exactly zero.

📖 Citation:

@inproceedings{vasylenko2026longcontext,
title={Long-Context Generalization with Sparse Attention},
author={Vasylenko, Pavlo and Pitorro, Hugo and Martins, Andr{\'e} FT and Treviso, Marcos},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=PsB6Lynznk}
}