Today’s reflection is based on the recent reflection by Andrej Karpathyco-founder of OpenAI and former head of AI at Tesla. Karpathy comes up with a provocative idea: large language models (large language models or LLM) may not have much to do with language per se. This reflection develops his discussion of LLMs as universal tools and outlines the potential implications of this view for the future of AI.
Listen to the article also as an audio transcript
It is remarkable, and perhaps a little puzzling, that the big language models may not actually have much to do with language per se; rather, it is the historical development that has led to this association. That is, although they are called “linguistic”, their underlying principles and mechanisms of operation are far from limited to natural language processing.
“What I’ve seen though is that the word “language” is misleading people to think LLMs are relegated to text applications.” Andrej Karpathy at X.com
Originally, these models were developed to work with text, which led to their name. However, their capabilities extend beyond this area and can be applied to different types of data. These models represent a highly versatile technology for statistical modelling of token flows. Tokens are basic units of information that can represent words, characters, or other discrete elements.
In the context of natural language processing, tokens can be, for example, words or even single characters. You can find out what OpenAI considers a token in their language models on their website. However, in other domains, tokens can represent, for example, pixels in an image, audio frequencies in an audio recording, or movements within a robot’s action plan.
A more accurate name, according to Karpathy, would be“autoregressive transformers“. This would much better describe their true nature and mechanism of operation. Autoregressive transformers are models that predict the next token based on the previous tokens in the sequence, which is the basic principle of how LLM works.
Autoregressive means that the model builds on previous tokens to generate each new token. Let’s think of this as a sentence generation example: when the model generates a new word, it takes into account all the previous words it has already generated to correctly predict the next word. This process is similar to how we humans would compose a sentence – we choose each new word based on the context of the previous words. This approach is “autoregressive” because the model returns to its own output to generate the next step.
And “transformer” has nothing to do with movie robots from the planet Cybertron transforming into cars and back. Or even an unsightly cube with hundreds of electrical wires that increase or decrease voltage. It’s simply the name of a specific architecture that ensures the model can process long sequences of data efficiently and with minimal constraints. The transformer is the basis of many modern LLMs requiring their high performance.
LLM does not inherently care whether tokens represent text fragments, image sections, audio sections, or action choices. That is, these models are capable of handling any data as long as that data is converted into a sequence of discrete tokens. A discrete token is essentially a basic unit of information that is discrete and immutable, i.e. “discrete” in the mathematical sense. Thus, the key point is that data can be represented as a sequence of tokens, which allows a unified approach to processing.
If we can reduce our problem to modeling such flows, we can apply LLM to it. This capability means that LLMs can be used in a variety of domains, not just natural language processing.
The versatility we are talking about suggests that the potential of these models extends far beyond language processing and can affect a wide range of disciplines and applications. It can lead to new possibilities in areas ranging from computer vision, audio processing or even bioinformatics. In chemistry, where molecules can be represented as sequences of atoms and bonds, the use of LLM will lead to the prediction of new chemical properties or synthesis pathways.
In biology, when researching proteins, specifically when generating new amino acid sequences, discrete tokens can be thought of as individual amino acids that make up a protein sequence. Proteins are made up of chains of amino acids, and each amino acid can be thought of as a single “token” within that sequence. Specifically, proteins have complex structures and functions that are determined by the amino acid sequence. Using LLM models, new protein structures can be investigated and predictions made about how they will function or what properties they will have. New proteins then bind to specific molecules, which is essential in the development of new drugs or treatments.
With the progressive development of LLM technology, we can witness the convergence of many problems into this unified modeling framework. This means that various problems that were previously solved using specific models and techniques can now be handled through LLM.
The basic task boils down to predicting the next token, with the meaning and interpretation of these tokens varying according to the specific domain. This unification could facilitate solving complex problems through a unified approach. It can lead to simplification of processes, reduction of development costs and acceleration of innovation in different domains.
If this trend does take hold, it would suggest that current deep learning frameworks are perhaps too general for most practical applications. These frameworks offer thousands of operations and layers for arbitrary configuration, providing tremendous flexibility. However, if the vast majority of problems could be solved using LLM, this flexibility might be redundant. A Swiss Army knife is a great thing, but sometimes it’s better to just use a screwdriver. This will lead to the development of more specialized tools and frameworks optimized for LLM implementation and training, simplifying the process of model development and deployment. Specialized tools will be more efficient, user-friendly and better adapted to the specific needs of LLM-based applications.
One ring to rule them all? Not always!
To claim that this view fully reflects reality would be simplistic. It is likely to be only partially true. For example. real-time systems such as self-driving cars require immediate reactions to a changing environment. Here, models are used that can process parallel sensor inputs and quickly generate responses, which may not be optimal for sequential models like LLM.
Another aspect is the structure of the data. Some data have complex relationships that are not linear. For example, graph neural networks are designed to work with data that can be represented as nodes and edges. These structures cannot be easily converted to a sequence of tokens without losing important information.
Although LLMs offer a powerful and versatile tool, there are areas where specific architectures and approaches are needed that cannot be easily translated to the problem of next-token prediction. Some tasks, such as modeling physical systems, simulating complex interactions, or solving problems with high levels of causality, require a deeper understanding of the structure of the data and the relationships between them. These tasks go beyond the capabilities of current LLMs, which are optimized for sequential data processing.
Sequential data processing means that the data is processed as a sequence, where the order of the elements (tokens) is important. For text, this is obvious because the meaning of a sentence depends mostly on the order of the words. In audio recordings, on the other hand, it is the sequence of sound frequencies that preserve the meaning of speech or music. Alternatives to sequential processing include, for example, processing data in a matrix or graph structure where the relationships between elements are not linear. For example, in computer vision, convolutional neural networks are used that process an image as a two-dimensional array of pixels rather than as a sequence.
Although it is theoretically possible to represent different types of data as tokens, in practice important information may be lost or the difficulty of training the model may increase. Optimizing and adapting LLM for specific tasks can be complex and may not always yield the best results. In some cases, specialized models may simply provide better performance and efficiency.
Large language models represent a huge step forward due to their ability to model a wide range of problems through a unified paradigm. This versatility opens up new possibilities and can accelerate development in many areas. At the same time, however, it is essential to maintain a critical perspective and be aware of their limitations.
The future of artificial intelligence will not be a one-size-fits-all approach, but a combination of different methods and tools that will work together and complement each other. Different problems may require different approaches. It is therefore important to have tools and frameworks that can handle these different situations.
Only in this way will we be able to effectively solve both general and highly specific problems, and fully exploit the potential that these great technologies offer us!