Kaggle Founder Claims He Created the First LLM Sparks AI Debate

San Francisco, CA – A seemingly off-the-cuff remark by Jeremy Howard, the former president and chief scientist of Kaggle, has sparked a fascinating debate within the artificial intelligence research community. Howard, currently the founder of answer.ai and fast.ai, claimed to have created the first LLM, triggering a wave of cyber-archaeology to verify the assertion.

The incident began when someone questioned the effectiveness of Howard’s recent project, llms.txt, in aiding large language models (LLMs) in crawling internet information. Howard’s response – Dude, I created the first large language model – quickly drew attention and ignited a flurry of investigation.

The Case for ULMFiT

The ensuing research has revealed a surprisingly strong case for Howard’s claim. In early 2018, Howard co-authored a paper titled Universal Language Model Fine-tuning for Text Classification (ULMFiT). This paper introduced a novel unsupervised pre-training and fine-tuning paradigm that achieved state-of-the-art (SOTA) results in various natural language processing (NLP) tasks at the time.

Crucially, even Alec Radford, the lead author of the groundbreaking GPT-1, publicly acknowledged ULMFiT as a source of inspiration. This acknowledgement adds significant weight to the argument that ULMFiT played a pivotal role in the development of subsequent LLMs.

Adding fuel to the fire, some researchers have pointed to academic survey papers that suggest, from a genetic perspective, ULMFiT represents the last common ancestor of all modern large language models.

The Archaeological Dig: Defining the First LLM

Software engineer Jonathon Belotti took the debate a step further, publishing a comprehensive investigation titled Who Was the First Large Language Model? Belotti’s research meticulously examines the lineage of LLMs, starting with the generally accepted benchmark of GPT-1.

To establish criteria for what constitutes a large language model, Belotti extracted key characteristics from GPT-1 and its successors, GPT-2 and GPT-3:

Language Modeling: The model must predict components of human-written language (tokens) based on input.
Self-Supervised Training: The model is trained on vast amounts of unlabeled text data, a departure from task-specific datasets.
Adaptability: The model can adapt to new tasks with few-shot or even one-shot learning, without requiring architectural modifications.
Generality: The model demonstrates advanced performance across a range of text-based tasks, including classification, question answering, and parsing.

Belotti then analyzed models that GPT-1 cited as influential, including the original Transformer, CoVe, ELMo, and ULMFiT. While the Transformer architecture is foundational to modern LLMs, the original Transformer was primarily used for machine translation and lacked the necessary generality.

ULMFiT’s Key Contributions

The ULMFiT paper, accepted to ACL 2018, introduced an effective transfer learning method applicable to any task in NLP. It highlighted key techniques for fine-tuning language models and demonstrated significant improvements over existing SOTA methods on six text classification tasks, reducing error rates by 18-24% on most datasets. Furthermore, ULMFiT achieved performance comparable to models trained from scratch on 100 times more data, using only 100 labeled examples.

The Verdict: A Complex Legacy

While the debate continues, the evidence suggests that ULMFiT was a crucial stepping stone in the evolution of large language models. Whether it qualifies as the first LLM depends on the specific criteria used. However, its influence on subsequent models, including GPT-1, is undeniable.

The discussion highlights the complex and often iterative nature of scientific progress. It also serves as a reminder that the history of AI is still being written, and that revisiting past innovations can provide valuable insights into the future of the field.

References:

Howard, J., & Ruder, S. (2018). Universal Language Model Fine-tuning for Text Classification. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
Belotti, J. (2024). Who Was the First Large Language Model? [Blog Post]. Retrieved from [Insert hypothetical blog post link here]

>>> Read more <<<