Technology· January 21, 2026

Latency Is the AI Feature Nobody Puts in the Press Release

Speed is usually framed as a compute cost problem. In AI products, it is actually a product design problem, and the distinction matters more than most teams realize.

By Theo Okafor, Staff Reporter · Technology Desk

There is a number that determines whether an AI feature gets used or abandoned, and it almost never appears in a product announcement. It is not accuracy. It is not context window size. It is the time between when a user submits a request and when the interface becomes responsive again.

Latency is boring to talk about and expensive to fix, which is probably why the coverage of AI products keeps treating it as an infrastructure footnote. It is not a footnote. It is closer to the whole story.

Here is the practical problem. Human conversation has a rhythm. Studies on interaction design have consistently put the threshold for perceived responsiveness at somewhere between 100 and 300 milliseconds for interface feedback, and under about one second for a response to feel immediate. Language models operating at the frontier frequently exceed that by a factor of five to twenty, depending on output length and whether the provider is under load. That gap is not a minor inconvenience. It restructures how people interact with the product.

When a tool responds in under a second, users iterate. They treat it like a colleague: ask something, hear back, redirect. When it responds in eight to twelve seconds, users batch their requests. They write longer, more hedged prompts. They do not follow up. They switch tabs. The behavioral difference between these two modes is the difference between a product that changes how work gets done and a product that gets used once or twice as a demo and then sits in a browser bookmark folder.

This is not a new insight in software. The research on web page load times and conversion rates has been replicated enough times that it should be institutional knowledge by now. What is different about AI is that the latency problem is architecturally harder to solve, and the failure mode is subtler. A slow web page is obviously slow. A slow AI response gets rationalized as the cost of doing something complex, which means users are less likely to report it as a problem. They just quietly stop using the feature.

Streaming outputs partially address this. Showing tokens as they generate gives the user something to read and creates a sense of progress. This is a real improvement in perceived latency, not a fake one. But streaming is a compensation mechanism, not a solution. If the time to first token is high, the stream does not help the iteration pattern described above. The user still had to wait before they got anything, and that wait is when attention drifts.

The teams that have built AI products people actually integrate into daily work have generally made one of two architectural choices. Either they are operating on smaller, faster models where the task fits the capability, or they have done the engineering work to pre-compute, cache, or speculate on what the user is likely to do next. Neither of these choices shows up in a benchmark chart or a model card. Both of them show up in retention.

The broader point is that AI product quality is not reducible to model quality. A capable model with poor latency characteristics will lose to a less capable model that responds quickly, in the same way that a more accurate search engine with a two-second load time would have lost to a less accurate one that loaded in under half a second. Users do not optimize for accuracy in the abstract. They optimize for the experience of getting something useful done.

Latency is that experience. It deserves a line item in the product spec and a paragraph in the coverage.

Reporting by Theo Okafor, Staff Reporter, for the Technology desk · ETL Newswire staff