Logarithmic Attention New Twist on Transformer Sparks AI Debate

[City, Date] – A new blog post is reigniting discussions within the AI community regarding the attention mechanism, a core component of the groundbreaking Transformer architecture introduced seven years ago. The author posits that the attention mechanism implemented in Transformers should be considered to have a logarithmic, rather than quadratic, computational complexity.

This intriguing perspective has garnered significant attention, including high praise from Andrej Karpathy, a prominent figure in the field. Karpathy commented, Sometimes I describe this in my head as the full compute graph of a neural network as ‘breadth is free, depth is expensive.’ As far as I can tell, this was the main insight/inspiration behind the Transformer. The first time I was really struck by it was when I read the Neural GPU paper (https://arxiv.org/abs/1511.08228) a long time ago.

The core of the debate revolves around the perceived computational cost of the attention mechanism. Standard attention, as implemented in Transformers, involves the following steps:

Dot-product calculation: This involves a matrix multiplication of query (Q) and key (K) matrices (QK^⊤), resulting in a complexity of O(n^2d), where ‘n’ is the sequence length and ‘d’ is the feature dimension.
Softmax normalization: This step normalizes the attention weights for each position, also contributing a complexity of O(n^2).

Traditionally, researchers have understood the overall complexity to scale quadratically with the sequence length ‘n’. This quadratic scaling has been a key factor in limiting the application of Transformers to very long sequences.

The blog post challenges this conventional wisdom, arguing that the inherent structure of the attention mechanism allows for a more efficient, logarithmic computation. While the specific details of the argument are beyond the scope of this brief report, the core idea seems to be that the breadth of the attention mechanism (the ability to attend to all parts of the input sequence) comes at a relatively low cost compared to the depth (the number of layers in the network).

The implications of this new perspective are potentially significant. If the attention mechanism truly scales logarithmically, it could pave the way for more efficient and scalable Transformers, enabling them to process much longer sequences and tackle more complex tasks. This could lead to breakthroughs in areas such as natural language processing, computer vision, and other fields where Transformers are already making a significant impact.

The debate is ongoing, and further research is needed to fully validate the claims made in the blog post. However, the fact that this seven-year-old architecture is still yielding new insights underscores the profound impact and enduring relevance of the Transformer in the rapidly evolving landscape of artificial intelligence.

Further Reading: