Language Models Tackling Weather Forecasting Challenges

⚡

Key Takeaways

1Language models (LLMs) fail to accurately indicate significant weather changes.

2Chaos theory reveals that deterministic systems can remain unpredictable, complicating the use of LLMs.

3Skygent uses a hybrid architecture for reliable weather alerts, limiting the role of LLMs to explanation.

💡Why it matters — Critical weather decisions require clear thresholds, not probabilistic interpretations from LLMs.

Language Models and Their Limitations in Meteorology

Language models, often referred to by the acronym LLM, have become common tools for providing weather forecasts. However, they are not designed to indicate precisely when a significant weather change occurs. This gap, while it may seem trivial, has important implications in practice.

Modern numerical weather prediction systems, such as the ECMWF IFS, provide remarkably accurate forecasts. With a resolution of about 9 km and frequent updates, this data is of exceptional quality. However, the real challenge lies in interpreting these changes to determine their actual significance. The issue is not with the forecast itself, but with the attention given to when a change in this data is truly significant.

The Impact of Chaos Theory

My interest in this problem does not stem from my experience in software engineering, but from a previous study of chaos theory at the Instituto Balseiro. There, I discovered a fascinating idea: a system can be entirely deterministic while remaining practically unpredictable. This notion has profoundly influenced my understanding of AI systems.

In developing artificial intelligence systems, I found that many of them did not account for this inherent complexity of chaotic systems. Chaos theory taught me that an apparently orderly system can become unpredictable, which is an essential lesson for anyone working with forecasting models.

The Errors of Intuitive Approaches

Observing how developers create weather agents, I noticed a tendency to simplify the process: retrieve forecast data, integrate it into an LLM, and then ask if the weather has changed significantly. While this seems logical, it is problematic for systems where decision thresholds are well-defined.

In a chaotic system, the significance of a change is not a matter of language, but of precise thresholds on variables such as temperature or precipitation. LLMs, designed to generate language, are not suited to impose deterministic boundaries on these systems. An LLM is a stochastic process, excellent for generating language, but not for imposing deterministic limits on physical systems.

The Failures of LLMs

LLMs exhibit several subtle modes of failure:

They may infer trends from wording rather than actual data.
They sometimes make inconsistent decisions for similar inputs.
Their outputs are often difficult to test or reproduce.

In sectors like agriculture or energy, where a temperature variation of 3°C can have major consequences, these failures are unacceptable. Decisions must be stable and explainable. For example, a drop in temperature can represent a phase transition for a crop or a peak in energy demand.

A Simple Rule for Using LLMs

I have established a simple rule: if a statement can be formulated, it is better not to use an LLM prompt. This rule stems from the experience I gained working with systems where precision and traceability of decisions are crucial.

My Professional Journey

My professional background is diverse, ranging from a Marie Curie PhD in climate dynamics to leading R&D at the national meteorological institute of Uruguay. I have worked on wildfire prevention, seasonal forecasting, and climate adaptation, before turning to machine learning at Microsoft and Mercado Libre.

This experience has allowed me to understand the physics of data and what a significant change in a physical system truly means, beyond software abstractions. I have learned to view data not as abstractions, but as measurable deltas on variables with known uncertainty limits.

The Architecture of Skygent

Skygent is structured in five distinct layers, with a single layer dedicated to calling the LLM.

The Deterministic Guardian

At the heart of Skygent is a Python evaluator that does not interpret but performs precise calculations. It compares validated forecasts, evaluates deltas against configurable thresholds, and integrates context and the forecast horizon.

Decisions are made on this basis, and each alert is traceable, indicating which variable has changed and which threshold has been crossed. In a professional setting, this traceability is essential. An alert is triggered only if a threshold is crossed, allowing for a binary, testable condition, rather than a subjective judgment.

The Limited Role of the LLM

The LLM only comes into play after the decision-making process, to translate structured data into natural language. For example, an increase in the probability of rain from 10% to 50% is explained in clear terms, but the LLM makes no decisions. It merely transforms structured JSON data into an understandable narrative.

Why This Architecture is Testable

It is difficult to test a pure LLM agent at 100%, as the outputs are probabilistic. Skygent's hybrid approach, with decision logic in Python, allows for comprehensive unit testing without dependence on the LLM. The decision logic is purely in Python, with 204 unit tests and no LLM dependency in the test suite.

Event-Based LLM Invocation

Unlike a naive agent that calls the LLM at every cycle, Skygent evaluates data every six hours and only calls the model if a threshold is crossed. This significantly reduces the number of calls, making the system more efficient and cost-effective. At a rate of GPT-4o-mini, the cost is negligible and proportional to the actual information.

A Concrete Example

Let's take an example: if the probability of rain increases from 10% to 50%, an alert is triggered, and the LLM generates a narrative explanation. This process is repeated every six hours. For instance, an alert might be triggered if the delta in rain probability exceeds 20 percentage points.

Limitations of This Approach

This method is not universal. It fails when inputs are ambiguous or when decision boundaries cannot be defined by clear thresholds. In such cases, LLM-based architectures, like ReAct, are more appropriate.

Conclusion

The most crucial decision in building this system was determining where not to use an LLM. While LLMs are increasingly used to solve various problems, some decisions require a clear and defined structure. In these cases, it is better to explicitly encode decisions rather than approach them through language.

The complete implementation of this system is available at: github.com/ferariz/skygent