AI Confronts the Challenge of Web Data: A New Infrastructure at Stake

⚡

Key Takeaways

1The rise of AI requires infrastructure capable of handling massive and dynamic web data.

2Companies need access to real-time data to enhance the accuracy and relevance of AI models.

397% of AI organizations rely on real-time web data, but 90% face restrictions.

💡Why it matters — Efficient access to web data is crucial for the development of responsive and reliable AI, directly influencing business decisions and customer satisfaction.

AI and the Need for Web Data Infrastructure

Artificial intelligence (AI) is experiencing explosive growth, with new applications emerging daily. To fully harness the potential of this technology, companies need access to massive volumes of data. However, this information is often inaccessible or unstructured, limiting its use by AI models.

To understand this challenge, it is essential to return to the foundations of the web. Initially, the web was not designed to enable the automated discovery and retrieval that modern AI applications require. To overcome this structural limitation, a new infrastructure is necessary.

The next advancement in AI could rely on a new layer of web data infrastructure, allowing models to navigate and map this constantly evolving digital space. This layer must be capable of traversing hundreds of millions of existing web domains and managing the billions of new URLs created each week, while providing real-time information and overcoming technical hurdles.

Or Lenchner, CEO of Bright Data, a platform specializing in web data collection, emphasizes: “Data shows that there is much more data available. Think of the universe: it’s there, but you don’t know what you don’t know.”

Accessing Fresh, Relevant, and Reliable Data

The early advancements of AI were driven by the increase in training data and model size. Today, organizations face a major obstacle: keeping up with the dynamic, unstructured, and ever-evolving nature of web data to base their outcomes on current and verifiable information. The performance of AI now depends not only on the model architecture but also on the computing, networking, retrieval, and data engineering capabilities of the system. This means that the system must be able to quickly and reliably retrieve fresh, relevant, and trustworthy data.

Traditional model training relies on snapshots of information collected at a given time. However, training AI on static data is no longer sufficient. To keep pace with fluctuations such as competitor pricing, consumer sentiment, and market trends, companies need a constant stream of new information, extracted in real-time with the relevant context. Their infrastructure must therefore be capable of handling millions of simultaneous interactions across websites that vary by geography, language, format, and access rules.

“If it can’t retrieve information in real-time, it lacks context,” explains Lenchner. “In a business setting, that is no longer acceptable. Outdated responses lead to poor decisions and disappointed consumers.”

Speed is not just a matter of convenience; it is a necessity. Today’s organizations operate in environments where prices, inventories, markets, security threats, and customer behavior are continuously changing. A delay in data retrieval can diminish the usefulness of an otherwise sophisticated model.

Using live, high-quality web data can also reduce AI hallucinations, as the model has a more relevant knowledge base. This enhances user trust. In fact, a survey revealed that 56% of AI practitioners stated that companies need access to real-time web data to improve trust in AI outcomes. To ensure that the model operates effectively and efficiently, the information must also be distilled to the appropriate essentials.

Despite the introduction of retrieval-augmented generation (RAG), where models integrate external data at the time of a query, many AI systems still struggle to provide results that are current, contextually relevant, and reliable in operational environments. According to Gartner, 60% of AI projects that are not supported by AI-ready data—accurate, structured, organized, and contextualized—will be abandoned by the end of the year.

This is because large-scale retrieval does not solve the problem. As Lenchner puts it, “You need to retrieve data at scale, but also in real-time. Latency becomes an issue because the end user is waiting for the output.”

The Challenges of Accessing Fresh, AI-Ready Data at Scale

Accessing fresh and AI-ready data at scale introduces technical and structural challenges. In practice, many enterprise systems combine the retrieval of public web data with APIs, licensed datasets, and proprietary internal data in their AI applications. Integrating these fragmented sources into a usable and timely knowledge layer requires specialized capabilities. Some research has shown that 97% of AI organizations rely on real-time web data infrastructure, but 90% feel constrained by various restrictions. Companies are increasingly developing technical approaches to navigate these constraints.

Lenchner makes this metaphor: “Think of the trained model as intelligence and the relevant data as knowledge. A powerful intelligence layer built on a hollow knowledge layer is like a genie that knows nothing—practically useless. Intelligence and knowledge must come together.”

The Promise of a New Infrastructure

A new layer of web data infrastructure can meet the growing need for stronger AI inputs by enabling data discovery, real-time access, and adaptation to specific contexts. As Lenchner describes, “It’s all about large-scale data collection, with super low latency, without being blocked.”

Rather than relying on increased computing power, this type of platform mimics human browsing behavior to access available content and transform raw code into structured data streams. It can work with websites that may not interact with traditional scraping tools, such as those heavy in JavaScript, or with aggressive anti-bot software.

As Lenchner explains, “It’s essentially about having an infrastructure capable of mimicking a web user with credentials—IP address, location, and 1,000 other parameters. And at scale. Think of doing this 80 billion times a day for millions of websites. And each time, you appear exactly as the website expects you to appear.”

Of course, continuous retrieval introduces new challenges in data governance. To address this, platforms can apply strict compliance protocols aligned with global privacy frameworks, such as the EU General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA). They may also be limited to publicly accessible information, avoiding paywalls or private logins. All networks used can be verified and consent-based, and incentives can be provided to IP address owners. In this way, systems can be designed to comply with increasingly stringent regulations.

Such complex capabilities are not easy to implement. “When it becomes critical infrastructure for a business,” says Lenchner, “doing it in-house becomes a full-time engineering problem that competes with actual AI work.” Tackling this complexity requires organizations to commit to investing significant resources, prompting many to seek specialized platforms designed specifically for data retrieval, orchestration, and observability.

An Infrastructure for the Real World

Real-time data retrieval changes what AI systems can do within organizations. For example, a retail company can use public information to power a dynamic pricing engine, and global brands can track trademark infringements.

As the ecosystem matures, organizations that invest in this emerging layer of data infrastructure will be better positioned to build more responsive, reliable AI systems aligned with real-world conditions—AI systems capable of continuously adapting using current web data. Over time, the distinction between AI models and the infrastructure that powers them may even begin to blur.

As Lenchner states, “The world is changing. And everything happening in the world is being uploaded to the public web. The amount of new data generated is growing and accelerating.”