Root Bottlenecks in Modern AI

Introductory note

This text is the first publication in a series devoted to the root computational challenges of modern AI. Its purpose is to set out six key problem classes in a rigorous form so that any future solution can be evaluated against the same standard.

The bottlenecks are formulated here in such a unified and strictly defined form for the first time. The text also establishes the criteria by which, in the author’s view, proposed solutions should be assessed.

Each problem is stated not as an isolated engineering limitation, but as a root mathematical constraint underlying a broad class of computational systems.

General principle

This leads directly to the main principle by which any proposed solution will be judged.

A problem is considered solved only when the root mathematical bottleneck itself is removed, rather than merely one of its particular symptoms.

That is, for example, for Problem 1 it is not enough to eliminate storage of the $(n \times n)$ matrix if all $(n^{2})$ interactions are still being computed anyway.

General criterion for all 6 problems

This principle determines the general criterion by which any claimed solution to any of the six problems must be evaluated.

For any problem, one must provide:

1. a strict mathematical formulation;

2. a new computational scheme or a new mathematical form of the problem, different from the direct baseline one;

3. a proof of correctness

— either exact equality of the result,

— or a strictly controlled deviation;

4. a strict improvement specifically in the root parameter of the problem, rather than in a secondary one;

5. a CPU demonstration on an open test.

The work does not count if:

• the definition has merely been rewritten without changing the mathematical method of computation;

• only a secondary parameter has been improved;

• the root bottleneck remains.

Problem 1. Global pairwise interactions

Mathematical core

Given $Q, K, V \in ℝ^{n \times d}$ , and the operator

S = Q K^{⊤}, P_{i j} = \frac{e^{S_{i j}}}{\sum_{m = 1}^{n} e^{S_{i m}}}, O = P V

O_{i} = \frac{\sum_{j = 1}^{n} e^{⟨ q_{i}, k_{j} ⟩} v_{j}}{\sum_{j = 1}^{n} e^{⟨ q_{i}, k_{j} ⟩}}

Required result

One must provide such a mathematical scheme for computing O that:

• either gives exactly the same result,

• or gives a result with an explicitly bounded error, and at the same time eliminates the root necessity of full all-to-all interaction.

In other words, it must no longer be mandatory to compute, or equivalently service, the entire set

{⟨ q_{i}, k_{j} ⟩ : 1 \leq i, \leq n, 1 \leq j \leq n} .

What counts as a solution

A solution counts only if it improves the interaction cost itself, rather than merely improving:

• storage of the matrix S,

• storage of the matrix P,

• memory layout,

• block wise organization of the computation.

That is, it must remove the root class: global pairwise interaction cost.

What does not count

The work does not count if:

• only storage of the $(n \times n)$ -matrix has been removed;

• but all $n^{2}$ pairwise interactions are still being computed or equivalently serviced.

Problem 2. Communication-dominated dense algebra

Mathematical core

Given a computation of the form

C = A B, c_{i j} = \sum_{k = 1}^{m} a_{i k} b_{k j}

or a more general dense tensor contraction.

Required result

One must provide a mathematical scheme in which the root cost of data movement ceases to be the determining bottleneck.

That is, the solution must remove not merely the number of FLOPs, but specifically communication cost between memory levels and/or between compute nodes.

What counts as a solution

The solution must show that the computation no longer requires such a volume of data transfer as makes the problem dependent on an “ideal processor” with infinite bandwidth.

The improvement must concern precisely:

• data movement,

• rather than only local arithmetic optimization.

What does not count

The work does not count if:

• only the constants in front of the FLOPs have been reduced;

• but the root communication complexity has not been removed.

Problem 3. Irregular sparse mathematics

Mathematical core

Given a sparse operator, for example:

y = A x, y_{i} = \sum_{j \in N (i)} a_{i j} x_{j}

where the structure of the nonzero elements is irregular.

Required result

One must provide a scheme in which irregularity and sparsity cease to be the root computational penalty.

That is, the cost must no longer be determined by the fact that data access is:

• sparse,

• irregular,

• poorly matched to the standard flow of dense operations, but instead by something substantially weaker.

What counts as a solution

The solution must eliminate precisely the class: sparse/irregular execution penalty.

In other words, sparsity must cease to penalize the system as a special kind of computation.

What does not count

The work does not count if:

• the fundamental penalty for irregular access remains;

• while only layout, caching, or local reordering heuristics are improved without removing the root cause.

Problem 4. Long causal depth

Mathematical core

Given a causally dependent process

x_{t} = F_{t} (x_{1}, \dots, x_{t - 1}), t = 1, \dots, T

or the probabilistic form

p (x_{1 : T}) = \prod_{t = 1}^{T} p (x_{t} ∣ x_{< t})

Required result

One must provide such a mathematical scheme in which sequential depth ceases to be the root computational cost.

That is, the problem is considered solved only when the mandatory computational cost of the order of the causal chain length is no longer required. In other words, the following class must be removed: serial causal depth.

What counts as a solution

The solution must remove not a secondary symptom such as latency, but the root dependence itself: the next step requires the previous step as a computational bottleneck.

What does not count

The work does not count if:

• only a single step has been accelerated;

• constants have been reduced;

• but the computation still must pass through a long sequential chain of the same nature.

Problem 5. Wide branching tree of alternatives

Mathematical core

Given candidates $r_{1}, \dots, r_{N}$ , as core $s_{i} = g (r_{i})$ , and the requirement

r^{⋆} = \underset{1 \leq i \leq N}{arg max} g (r_{i})

or the general case of a branching tree.

Required result

One must provide a scheme in which the need for massive enumeration of a large number of alternatives ceases to be the root cost of quality. That is, the following class must be removed: wide branching / many-candidate search cost.

What counts as a solution

A solution counts only if:

• it is not necessary, in essence, to maintain, generate, or equivalently service a large fraction of the alternatives;

• while exact correctness or strictly controlled quality is preserved.

What does not count

The work does not count if:

• only the ranking of already generated candidates has been accelerated;

• but the root cost of search width remains.

Problem 6. Large computational graph: width + depth + communication

If the previous problems isolate dominant bottlenecks one by one, the sixth problem addresses their joint coupling inside a single large computational DAG.

Mathematical core

Given a computation in the form of a directed acyclic graph

G = (V, E), where each vertex is an operation z_{v} = f_{v} (z_{u_{1}}, \dots, z_{u_{k}}), (u_{i}, v) \in E

The graph may simultaneously have:

• large width,

• large depth,

• heavy connectivity and exchange of intermediate results.

Required result

One must provide a mathematical scheme in which a large computational DAG ceases to require an “ideal CPU” simultaneously along all three axes:

• width,

• depth,

• communication.

In other words, the following root coupling must be removed: large DAG width + depth + communication coupling.

What counts as a solution

A solution counts only if the class of the joint bottleneck itself has been eliminated, rather than separately improving only:

• width,

• or depth,

• or communication.

What does not count

The work does not count if:

• only one of the three components has been removed,

• while the other two still continue to require the same “ideal CPU.”

Short summary

Problem 1. Remove the cost of full global interaction, rather than only storing its results.

Problem 2. Remove the cost of communication, rather than merely reducing FLOPs.

Problem 3. Remove the cost of irregular sparsity, rather than only slightly improving layout.

Problem 4. Remove the cost of causal sequential depth, rather than merely accelerating a step.

Problem 5. Remove the cost of search width, rather than merely iterating faster.

Problem 6. Remove the joint cost of width, depth, and communication in a large DAG, rather than improving only one component.

Publication path

What comes next in the series

Episode 2

Steps to solution

What approach should we take to create the solution?

Episode 3

One solution across all six

A single common solution replaces six local ones.

Episode 4

Reference code

The first implementation comes with tests.

Episode 5

To understand the root

It provides a detailed mathematical description, not just a programmatic one.

Updates

Subscribe to updates

Follow the series so you do not miss the next release.

Subscribe to updates

About the author

Sergei Sychev is a member of the TRIZ-RI Group research community (Israel, Slovakia, Czech Republic), and has been an expert in the Theory of Inventive Problem Solving (TRIZ) since 1985.

The author would prefer that the reader's attention be focused on the material itself, rather than on the author's details, as would be appropriate when reading scientific papers. If you would like to get in touch, write to or reach out on.

Six Root Bottlenecks and Contradictions

Sergei Sychev formulated these six tasks

Introductory note

General principle

General criterion for all 6 problems

Problem 1. Global pairwise interactions

Mathematical core

Required result

What counts as a solution

What does not count

Problem 2. Communication-dominated dense algebra

Mathematical core

Required result

What counts as a solution

What does not count

Problem 3. Irregular sparse mathematics

Mathematical core

Required result

What counts as a solution

What does not count

Problem 4. Long causal depth

Mathematical core

Required result

What counts as a solution

What does not count

Problem 5. Wide branching tree of alternatives

Mathematical core

Required result

What counts as a solution

What does not count

Problem 6. Large computational graph: width + depth + communication

Mathematical core

Required result

What counts as a solution

What does not count

Short summary

What comes next in the series

Steps to solution

One solution across all six

Reference code

To understand the root

Subscribe to updates