A layman’s review of the scientific debate on what the future holds for the current artificial intelligence paradigm
A little over a year ago, OpenAI released ChatGPT, taking the world by storm. ChatGPT encompassed a completely new way to interact with computers: in a less rigid, more natural language than what we have gotten used to. Most importantly, it seemed that ChatGPT could do almost anything: it could beat most humans on the SAT exam and access the bar exam. Within months it was found that it can play chess well, and nearly pass the radiology exam, and some have claimed that it developed theory of mind.
These impressive abilities prompted many to declare that AGI (artificial general intelligence — with cognitive abilities or par or exceeding humans) is around the corner. Yet others remained skeptical of the emerging technology, pointing out that simple memorization and pattern matching should not be conflated with true intelligence.
But how can we truly tell the difference? In the beginning of 2023 when these claims were made, there were relatively few scientific studies probing the question of intelligence in LLMs. However, 2023 has seen several very clever scientific experiments aiming to differentiate between memorization from a corpus and the application of genuine intelligence.
The following article will explore some of the most revealing studies in the field, making the scientific case for the skeptics. It is meant to be accessible to everyone, with no background required. By the end of it, you should have a pretty solid understanding of the skeptics’ case.
But first a primer on LLMs
In this section, I will explain a few basic concepts required to understand LLMs — the technology behind GPT — without going into technical details. If you are somewhat familiar with supervised learning and the operation of LLMs — you can skip this part.
LLMs are a classic example of a paradigm in machine learning, called “supervised learning”. To use supervised learning, we must have a dataset consisting of inputs and desired outputs, these are fed to an algorithm (there are many possible models to choose from) which tries to find the relationships between these inputs and outputs. For example, I may have real estate data: an Excel sheet with the number of rooms, size, and location of houses (input), as well as the price at which they sold (outputs). This data is fed to an algorithm that extracts the relationships between the inputs and the outputs — it will find how the increase in the size of the house, or the location influences the price. Feeding the data to the algorithm to “learn” the input-output relationship is called “training”.
After the training is done, we can use the model to make predictions on houses for which we do not have the price. The model will use the learned correlations from the training phase to output estimated prices. The level of accuracy of the estimates depends on many factors, most notably the data used in training.
This “supervised learning” paradigm is extremely flexible to almost any scenario where we have a lot of data. Models can learn to:
- Recognize objects in an image (given a set of images and the correct label for each, e.g. “cat”, “dog” etc.)
- Classify an email as spam (given a dataset of emails that are already marked as spam/not spam)
- Predict the next word in a sentence.
LLMs fall into the last category: they are fed huge amounts of text (mostly found on the internet), where each chunk of text is broken into the first N words as the input, and the N+1 word as the desired output. Once their training is done, we can use them to auto-complete sentences.
In addition to many of texts from the internet, OpenAI used well-crafted conversational texts in its training. Training the model with these question-answer texts is crucial to make it respond as an assistant.
How exactly the prediction works depends on the specific algorithm used. LLMs use an architecture known as a “transformer”, whose details are not important to us. What is important is that LLMs have two “phases”: training and prediction; they are either given texts from which they extract correlations between words to predict the next word or are given a text to complete. Do note that the entire supervised learning paradigm assumes that the data given during training is similar to the data used for prediction. If you use it to predict data from a completely new origin (e.g., real estate data from another country), the accuracy of the predictions will suffer.
Now back to intelligence
So did ChatGPT, by training to auto-complete sentences, develop intelligence? To answer this question, we must define “intelligence”. Here’s one way to define it:
Did you get it? If you didn’t, ChatGPT can explain:
It certainly appears as if ChatGPT developed intelligence — as it was flexible enough to adapt to the new “spelling”. Or did it? You, the reader, may have been able to adapt to the spelling that you haven’t seen before, but ChatGPT was trained on huge amounts of data from the internet: and this very example can be found on many websites. When GPT explained this phrase, it simply used similar words to those found in its training, and that does not demonstrate flexibility. Would it have been able to exhibit “IN73LL1G3NC3“, if that phrase did not appear in its training data?
That is the crux of the LLM-AGI debate: has GPT (and LLMs in general) developed true, flexible, intelligence or is it only repeating variations on texts that it has seen before?
How can we separate the two? Let’s turn to science to explore LLMs’ abilities and limitations.
Suppose I tell you that Olaf Scholz was the ninth Chancellor of Germany, can you tell me who the ninth Chancellor of Germany was? That may seem trivial to you but is far from obvious for LLMs.
In this brilliantly straightforward paper, researchers queried ChatGPT for the names of parents of 1000 celebrities, (for example: “Who is Tom Cruise’s mother?”) to which ChatGPT was able to answer correctly 79% of the time (“Mary Lee Pfeiffer” in this case). The researchers then used the questions that GPT answered correctly, to phrase the opposite question: “Who is Mary Lee Pfeiffer’s son?”. While the same knowledge is required to answer both, GPT was successful in answering only 33% of these queries.
Why is that? Recall that GPT has no “memory” or “database” — all it can do is predict a word given a context. Since Mary Lee Pfeiffer is mentioned in articles as Tom Cruise’s mother more often than he is mentioned as her son — GPT can recall one direction and not the other.
Be the first to comment