Thinking Machines Lab Challenges OpenAI with Interactive AI

⚡

Key Takeaways

1Thinking Machines Lab, founded by former OpenAI CTO Mira Murati, launches an AI model that processes audio, video, and text in 200 ms.

2The model surpasses benchmarks set by OpenAI and Google, integrating rapid interaction and background reasoning.

3The startup, valued at $12 billion, faces internal challenges with key departures and an uncertain funding round.

💡Why it matters — This innovation could redefine the standards of interaction for voice AIs, impacting competition with OpenAI and Google.

Thinking Machines Lab Revolutionizes AI Interaction with an Innovative Model

Thinking Machines Lab, a startup founded by Mira Murati, former CTO of OpenAI, recently unveiled its first artificial intelligence model. This model stands out for its ability to simultaneously process audio, video, and text in 200-millisecond segments, enabling smooth, real-time conversations, unlike traditional, more rigid exchanges.

The Thinking Machines Lab model surpasses the interaction quality and latency standards set by OpenAI with its GPT-Realtime-2 and by Google with its Gemini Live. It combines a rapid interaction model with a background reasoning model, offering enhanced performance.

However, despite these promising technical advancements, the startup faces internal pressures, particularly following the recent departure of several key employees.

An AI Model Redefining Voice Interaction

Thinking Machines Lab has published a research overview of its AI model, designed to revolutionize voice interaction by moving away from the traditional question-and-answer format. This model processes audio, video, and text in parallel in 200-millisecond segments, and the startup claims it outperforms OpenAI and Google in terms of interaction quality.

The startup introduced the concept of Interaction Models, AI models that manage interaction natively rather than through external structures. The central idea is that interactivity must evolve alongside intelligence, rather than being seen as a mere add-on.

The Limitations of Current Voice Systems

Current real-time systems, such as GPT-Realtime or Gemini Live, process audio continuously, but the language model never sees it directly. According to Thinking Machines, a "harness" of separate components sits in front of the model, including elements like a voice activity detector that decides when a speaker's turn is over. Only then is the final statement handed to the model, which generates a complete response. While speaking, its perception is frozen, receiving no new information until it finishes or is interrupted.

These components are far less intelligent than the model itself. This means that behaviors defining a true conversation simply do not work, according to Thinking Machines: intervening proactively ("interrupt me if I say something wrong"), reacting to visual cues ("tell me when I’ve written a bug"), or speaking simultaneously, which would be useful for something like live translation. Citing Sutton's "Bitter Lesson," the lab argues that these handcrafted systems will ultimately be surpassed by advancements in general capabilities.

Thinking Machines' Interaction Models

The Interaction Models replace the harness with a model that directly processes the audio and video stream rather than receiving pre-segmented statements. The approach resembles full-duplex models like Moshi or Nemotron VoiceChat, which operate similarly but are smaller-scale models focused on latency rather than intelligence benchmarks.

A 200-Millisecond Clock Replaces Artificial Turn Boundaries

The true breakthrough with existing architectures is what the team calls time-aligned micro-turns. The model continuously processes 200 milliseconds of input and generates 200 milliseconds of output, with both token streams operating in an intertwined manner. Input and output no longer occur sequentially. Instead, they share the same clock cycle.

This eliminates artificial turn boundaries, allowing the model to decide for itself whether to remain silent, intervene, or speak simultaneously with the user. Audio and images are not pre-processed by large autonomous encoders but are fed directly into the transformer with minimal preprocessing. This reduces latency, although it may also limit the model's ability to capture fine visual details like text.

However, the real-time model faces another challenge. If it must respond every 200 milliseconds, it cannot spend minutes reasoning or searching the web. Thinking Machines addresses this issue by pairing the interaction model with a second asynchronous background model that handles longer tasks such as reasoning, tool usage, and research.

The two models share the same conversation context. The interaction model delegates tasks while maintaining the conversation, then integrates the background model's results into the conversation as they arrive, at an appropriate moment relative to what the user is currently doing, rather than as an abrupt context switch. The goal is to combine the quick response of a fast model with the depth of a reasoning model.

Benchmarks Suggest the Approach Works

The model is called TML-Interaction-Small, a mixture of experts model with 276 billion parameters and 12 billion active parameters. On the FD-bench v1.5, which measures interaction quality in scenarios such as user interruptions, backchanneling, and background speech, it significantly outperforms both OpenAI's GPT-Realtime-2 and Google's Gemini-3.1-flash-live. The response latency is 0.40 seconds, compared to 1.18 seconds for GPT-Realtime-2 (minimum) and 0.57 seconds for Gemini.

On the Audio MultiChallenge, which tracks intelligence and instruction following, the model scores 43.4%, above the fast variants of its competitors but below GPT-Realtime-2 in "xhigh" thinking mode, which achieves 48.5%. On the lab's own benchmarks for temporal awareness (TimeSpeak, CueSpeak) and visual proactivity (RepCount-A, ProactiveVideoQA, Charades), Thinking Machines reports that no existing model can significantly perform any of these tasks. The tested competitors either remain silent or provide incorrect responses.

A $2 Billion Startup with Something to Prove

Thinking Machines Lab was founded in February 2025 by Mira Murati and other former OpenAI researchers. In July 2025, the company closed a $2 billion funding round at a valuation of $12 billion, all without a product. An additional funding round, reportedly in preparation around $50 billion, did not materialize by the end of 2025, and several key employees have since left the company. The interaction model is the first internal AI model supporting Murati's claim that she can build a true competitor alongside OpenAI, Anthropic, and Google DeepMind.

Prior to this, the company launched Tinker, a tool designed to enable developers to effectively fine-tune open models using LoRAs without having to manage distributed training.