660 AI agents reveal a well-known secret: Kaiming initialization

⚡

Key Takeaways

1A project involving 660 AI agents conducted 27,000 experiments, highlighting Kaiming initialization.

2This discovery, already integrated into PyTorch since 2015, is taught as early as the second week of deep learning courses.

3The infrastructure uses advanced techniques like DiLoCo, but does not constitute a true AGI.

💡Why it matters — This underscores the current limitations of autonomous AI systems in producing truly novel discoveries.

A Surprising Discovery

In an ambitious project, 660 artificial intelligence agents conducted an impressive total of 27,000 experiments within a self-research framework operating on a peer-to-peer network. However, the biggest discovery of this project turned out to be a concept that has been well-known since 2015, raising questions about the actual effectiveness of this system.

The Highlighted Discovery

The discovery that has been most prominently featured by this project is Kaiming initialization. This concept, integrated into the standard library of PyTorch since 2015, is covered as early as the second week of all deep learning courses. Researcher Kaiming He published the original paper on this topic eleven years ago. In reality, a master's student could have uncovered this information in just a few hours, which puts the impact of the results obtained by the AI agents into perspective.

An Impressive Infrastructure

Despite the unoriginal discovery, the technical infrastructure set up for this project is impressive. It employs advanced techniques such as DiLoCo gradient compression, libp2p gossip, and CRDT ranking tables. However, this does not mean that the system constitutes a true artificial general intelligence (AGI). What has been built resembles more of a random parallel search engine, equipped with a shared scoreboard and excellent branding.

The Technology That Actually Works

Standard distributed training typically requires each GPU to synchronize gradients after each forward/backward pass, which poses problems over the Internet due to latency and bandwidth variability. DiLoCo addresses this issue by allowing each node to train independently for several steps before synchronizing. For example, nodes A, B, and C can each train 100 steps locally before sharing their respective deltas. Then, an average of the deltas is calculated, and all nodes update their parameters before repeating the process.

Compression Techniques

To reduce the size of the data sent, the project employs two compression techniques: SparseLoCo and Parcae. SparseLoCo focuses on sending only the weight updates of the largest magnitude, while Parcae groups adjacent layers of transformers before selecting the top-k. Together, these techniques enable a compression ratio of 195 times, reducing the data size to 5.5 MB per round instead of around 1 GB.

Architectural Problem

A fundamental issue with the system lies in the architecture of the agents. Their intelligence loop is simple and allows for neither persistence nor causal understanding. Each agent reads the results of previous experiments, generates hypotheses via a language model, executes an experiment, and records the result. However, when the session resets, all information is lost. Thus, the discovery of Kaiming initialization was merely a retrieval of a pre-trained model, presented as an original discovery.

Conclusion

This project, while technically impressive, demonstrates the current limitations of autonomous artificial intelligence systems. The highlighting of Kaiming initialization as a major discovery underscores the need for significant advancements for these systems to produce truly new and meaningful discoveries.