Talkie-1930: The Revolutionary Retro Language Model

⚡

Key Takeaways

1Talkie-1930-13b-base is a language model with 13 billion parameters, trained on historical texts prior to 1931.

2The fine-tuned model, talkie-1930-13b-it, uses instruction-response data from ancient reference works.

3The models are licensed under Apache 2.0, with training data that has fallen into the public domain, allowing for unrestricted exploration.

💡Why it matters — Talkie-1930 explores the ability of models to predict and invent beyond their original data, opening new perspectives in AI.

A Language Model Inspired by the Past

The language model talkie-1930-13b-base, weighing 53.1 GB, stands out with its 13 billion parameters. This model has been trained on a vast corpus of 260 billion tokens, exclusively consisting of historical texts in English, all predating the year 1931. This unique approach aims to explore the capabilities of language models based on data that has long escaped copyright restrictions.

In parallel, the model talkie-1930-13b-it, with a size of 26.6 GB, has been specifically fine-tuned. This fine-tuning process relies on an innovative dataset composed of instruction-response examples extracted from reference works dating back to before 1931. This model is designed to power a chat interface, enabling smoother and more contextual interaction with users.

License and Data Access

Both models are available under the Apache 2.0 license, ensuring free and open use. The training data for the base model, having fully entered the public domain, offers a unique opportunity for researchers and developers. The copyright deadline in the United States, set for January 1, 1931, allows this data to be used without legal restrictions. It is hoped that the creators of talkie will also consider publishing this training data to enrich research in this field.

Research Goals and Challenges

The report accompanying these models highlights fascinating research goals. Among them, the ability of models to predict future events is particularly intriguing. For example, researchers have assessed the degree of surprise experienced by a 13 billion parameter model when faced with descriptions of historical events, all derived from texts prior to 1931.

Another question raised is whether these models can invent concepts that exceed their initial knowledge. A famous inquiry posed by Demis Hassabis is whether a model trained up to 1911 could, independently, discover general relativity, as Einstein did in 1915.

Teaching and Programming

Can models be trained to program? This question has been explored by testing the ability of models trained on texts from before 1931 to write new correct programs in Python after receiving a few examples. Figure 3 of the report illustrates an early example of this type of test, demonstrating the potential capabilities of these models in the programming domain.

Vegan Models and Data Ethics

Interest in vegan models, meaning those entirely trained on licensed or public domain data, is a topic of debate. The base model of talkie appears to conform to this ethic, although the chat model is not entirely pure due to its reliance on non-vegan models for fine-tuning.

Data Generation and Optimization

To refine the model, instruction-response pairs were generated from structured historical texts, such as etiquette manuals, cookbooks, and encyclopedias. The base model was then fine-tuned on this data using a simple chat format.

To enhance the model's ability to follow instructions, synthetic prompts were created, covering a variety of tasks such as summarizing documents or responding to information requests. Direct preference optimization was conducted online, with Claude Sonnet 4.6 serving as the judge for the generated results.

Fine-Tuning and Technical Challenges

Another round of supervised fine-tuning was conducted, this time on multi-turn synthetic chats, sampled by rejection between Claude Opus 4.6 and talkie. The goal was to correct persistent imperfections in the model's conversational abilities.

A major challenge in training talkie has been avoiding accidental contamination by texts postdating 1931 or by introducing anachronistic knowledge through the assistance of modern LLMs in the fine-tuning process.

Towards Total Autonomy

The team behind talkie aspires to transcend these limitations. Although reinforcement learning with AI feedback inevitably influences the model in an anachronistic manner, they hope to use their vintage base models as judges for a fully autonomous and era-appropriate post-training pipeline.

Practical Test and Curiosity

As a test, a demo of talkie was conducted with a classic prompt: Generate an SVG of a pelican riding a bicycle. The model generated an image dating back to 1860, depicting a pelican perched on a saddle, with its beak pointed forward and its feet on the handlebars, inspired by observations of pelicans fishing while riding along the banks of the Rhine.