Oppo X-OmniClaw: The Open-Source AI Agent Revolutionizing Android

⚡

Key Takeaways

1Oppo has launched X-OmniClaw, an open-source AI agent for Android that operates directly on the device without relying on the cloud.

2X-OmniClaw uses the camera, screen, and voice to perform complex tasks, such as comparing prices or creating photo albums.

3The system processes data locally, thus avoiding the risks associated with cloud platforms, and learns to replicate user actions.

💡Why it matters — X-OmniClaw could transform user interaction by making smartphones more autonomous and secure.

Oppo Introduces X-OmniClaw, a Revolutionary AI Agent

Oppo has recently launched X-OmniClaw, an open-source artificial intelligence agent designed for Android devices. This innovative system operates directly on the phone, without requiring a connection to a cloud server. It utilizes the camera, screen, and voice to perform various tasks within applications, all while remaining on the physical device.

The system combines multiple channels of perception, processing photos from the gallery locally to transform them into a searchable text memory. It also learns by cloning the user's behavior, enabling it to autonomously replicate actions.

Demonstrated Features of X-OmniClaw

During demonstrations, X-OmniClaw showcased its ability to compare prices of products captured by the camera, act as a floating assistant to solve exercises, and independently create photo albums from a user's gallery.

Oppo's Multi-X team has released this open-source agent that uses the camera, screen, and voice to perform tasks in real Android applications, without relying on a cloud copy of your phone. This contrasts with platforms like RedFinger, Wuying from Alibaba, and Tencent Cloud Phone, which run agents inside virtualized Android instances in a data center, thus limiting access to local sensors, cameras, or private data.

A Unique On-Device Approach

X-OmniClaw takes a reverse approach by functioning directly on the physical Android device. The core logic for perception, control, and interaction with applications resides on the phone itself. A cloud language model is only called upon as "fuel" for higher-level reasoning when necessary, according to the technical report. While the report does not specify the local models involved, it mentions components such as an on-device grounding model and an OCR for detecting clickable user interface elements.

Integrated Pipeline for Camera, Screen, and Voice

The agent consolidates three perception channels into a single pipeline. A vision-language model first interprets the scene as well as the user's request before triggering an action. The perception stack combines textual, vocal, camera, and screen signals, aligns them temporally, and transmits a structured intent to the language model.

In an example provided by the researchers, a user asks, "How much does this cost on Taobao?" while pointing the camera at a product. The system internally reformulates this to "price of Evian spray on Taobao" and only then transmits the structured intent for execution.

Photo Memory and Behavior Cloning

For long-term memory, X-OmniClaw condenses local data into semantic entries. During idle periods, photos from the gallery are processed into compact descriptions of objects, scenes, and events, which are then stored in a Markdown file. Each entry goes through a filter designed to eliminate sensitive information before being saved. The report highlights the risks of uploading related to cloud vision and indicates that moving to on-device models is the next step so that raw images never have to leave the phone.

Instead of planning each action from scratch, the agent clones the user's behavior into reusable skills. It extracts the complete launch command for an application page and accesses it directly via a deeplink the next time, rather than replaying the original tapping path. If that fails, the system reverts to simpler launch methods one by one. To detect clickable elements, X-OmniClaw combines XML structural data with a grounding model and text recognition. This assists with ad-laden interfaces where XML alone cannot pinpoint an exact tapping target.

Varied Applications of X-OmniClaw

In the first scenario, a user points the camera at a product and asks for the price. The agent accesses the shopping app, scrolls, takes screenshots, and reads prices and sales figures via a vision-language model. A follow-up like "open the second item" works without any additional anchoring.

In another example, X-OmniClaw acts as a "ScreenAvatar," a "digital prodigy" that solves on-screen tasks on command, such as working through a series of practical problems one by one.

A third demonstration shows the system responding to a request to transform all parrot photos into a highlight album. It gathers the corresponding files, accesses a one-click composition tool from a video editing app via a deeplink, and selects the images with multiple taps.

In the fourth example, the user clones the path to a deeply nested discount page once. The next time, a voice command suffices to reopen that exact subpage, even if the app does not offer public deeplinks.

Conclusion and Outlook

The project builds on the open-source codebase HermesApp and sits between OpenClaw, which focuses more on PCs, and the Hermes agent centered on the emerging capabilities of Nous Research. The code and resources are available on GitHub.

Google recently demonstrated with Gemma 4 that a fully local model on a smartphone can already act as an agent. In the demo application "Google AI Edge Gallery," the model uses agent skills to query Wikipedia, generate QR codes, or open mood trackers with trend graphs.

In terms of methodology, the system relies on UI-TARS from ByteDance, a purely visual GUI agent that relies solely on screenshots and coordinates. X-OmniClaw combines this approach with structural XML data and on-device execution to reduce the error rate that pure vision pipelines encounter with dynamic interfaces.