What makes a language model to understand your instruction?

Most marketing news about recent language models is limited to benchmark comparisons. Even research often adopts well-selling claims on the miraculous "emergence" of models' skills, attributing it broadly to the vastness of their size and training data. Can we do any better than that? What is it, specifically, that enables a language model to perform a completely new task merely from your instruction?

8 min read6 days ago

The capabilities of language models depend on the data they consume in training. But what is it about the data that matters?

The core of practical usability of general-purpose language models lies in their ability to understand a completely new task from the users' instruction. This ability is formulated in a sort of meta-task called in-context learning. Models capable of in-context learning are not trained for a specific task, and yet, when instructed to perform a new task with sole natural-language instruction, they perform very well.

In-context learning was initially uncovered in GPT-3. With an unusual size of 175 billion parameters, a common assumption at the time was that in-context learning is conditioned by scale. But newer in-context learners, like FLAN, were trained on massive mixtures of over 1,000 diverse tasks and instructions and performed well despite being over 100x smaller. Instead, these attribute in-context learning ability to data rather than model scale. So, what about data is it?

Data features fostering in-context learning

A strange thing about in-context learning is that it was never uncovered in visual models, pointing researchers towards features specific to language:

Hahn & Goyal state that training in-context learners requires compositional training data, where, just like in language, the predictions depend on a hierarchical structure with compositional co-references.
Chan+ find that training in-context learners paradoxically emerge with data which violate rudimentary assumptions of machine learning — that the data is IID: independent and identically distributed — and thus, require a skewed distribution of elements, such as a Zipfian distribution of tokens in language.
Xie+ show that in-context learning emerges with targets conditioned by latent concepts which the model needs to extract and apply. These concepts must not be substitutable by any simpler, less generalizable rules, such as mere co-occurrences of tokens.

Fig 1: Xie+ show that in-context learning emerges from data where correct prediction is conditioned on underlying reasoning concepts recoverable from context, such as that “the previous text discusses nationality".

Intriguingly, all these works create functional in-context learning models with both small (synthetic) data and small models, disrupting our initial assumptions about the scale as a necessity for models' "understanding".

So here’s an answer to our title question: to learn the model to follow an arbitrary instruction, its training data must exploit specific features. These features, uncovered in existing work, might feel very abstract, but from a machine learning perspective, it makes sense that they are: the model has to learn to manipulate features that are general enough to be applicable to any possible task that the user comes up with.

Putting the theory into practice

A crucial limitation is that all the experiments in the work we mention were done in silico: the tasks were not the actual tasks that users care about, and hence, the resulting models could not be compared to any scale-driven models. How can we apply these findings in practice to create better in-context learners for real-world problems?

A crucial obstacle is that we don't know much about our training data. How can we pick more concept-dependent data if we don't know what latent concepts our texts depend upon? In our previous work, we found that some reasoning concepts can be recovered from structured explanations available for some datasets. But scaling the annotation of reasoning concepts to the size of a practically usable pre-training corpus would be tremendously expensive.

Still, perhaps the concept-learning ability could transfer and be obtained on synthetic data, while later applied with natural-language instructions.

Concept-aware Data Construction

In our ACL 2024 paper Concept-aware Data Construction Improves In-context Learning of Language Models, we propose a framework for constructing a training dataset where labels depend on latent reasoning concepts. We call this framework Concept-aware Training (CoAT).

General idea of CoAT is to enable a language model to learn from analogy: whenever the model observes in its input an analogical reasoning concept, it should be able to benefit from it. Regardless of how "deep" this pattern is.

More specifically, CoAT constructs training samples from a set of concatenated examples (i.e. demonstrations) composed of input (x) and expected output (y), followed by the input (x_pred) for which the model is expected to predict its corresponding label (y_pred). This format is sometimes also called an instruction-tuning format:

input text: "{x_1, y_1}, <sep>, …, {x_k, y_k}, <sep>, x_pred"
label text: "y_pred"

However, in CoAT, these examples are picked such that they all link their input x_i to output y_i through the same reasoning concept C (x_i — C →y_i) as the expected correct prediction (x_pred — C →y_pred). This way, we guarantee the desirable property identified by Xie+: that the correct prediction depends on a latent reasoning concept recoverable from context (Fig. 1).

In this general form, it does not matter much what concept C we pick as long as the prediction really depends heavily on them.

In our experiments, we first need to come up with how to scale concept-dependent data in CoAT into a sizeable collection. We propose to recover a specific kind of reasoning concept from a scalable TeaBReAC dataset. TeaBReAC is a synthetically augmented question-answering dataset which, thanks to the programmatic augmentation, annotates underlying reasoning chains, i.e. sequences of operations that can lead the model from the question to the correct answer. We use these chains as the shared reasoning concept and construct training examples with CoAT.

Schematic example of our training instruction: all demonstrations share the same concept (Reasoning chain) as the instruction we train the model to predict. Below is an example including TeaBReAC's synthetic context.

The resulting model can not be directly used in real applications because it was trained to generate synthetic (rather than natural) texts. Therefore, to recover the model's ability to interact in natural language, we further fine-tune the model on the natural-language QA dataset, AdversarialQA.

Does concept learning transfer from synthetic to natural concepts?

A fundamental question is whether the concept-learning ability obtained with synthetic datasets transfers to natural language. We assess this by evaluating the model's ability to benefit from in-context examples that we know are useful for prediction: useful examples apply a reasoning concept that the model can also use in the correct prediction. To disentangle the effect of training data, we compare CoAT-trained models to the baseline trained on identical data but without concept-sharing demonstrations.

The ability of models to benefit from informative demonstrations as a relative change of performance between providing them with *useful* and non-useful demonstrations: Tk-random is trained on the same data as Tk-CoAT, but without the concept-sharing demonstrations.

We found that concept-learning models can benefit from informative demonstrations much more than our baselines. This is great because it means that concept learning ability transfers well between different training concepts, even in such an extreme case as transferring from synthetic to natural-language data!

Are concept-learning models more robust?

Previous instructional models achieve admirable evaluation results, but often, they rely on features that make them easily breakable. For instance, Wei+ show that models rely on the meaning of the labels. This makes models fragile because if the user comes up with an instruction asking for unseen or non-intuitive labels, the model inevitably breaks.

Concept-based in-context learning may improve this because it encourages models to focus on general, task-agnostic concepts. To evaluate this hypothesis, we look at relative change in performance when we change labels to non-sensical ("foo", "bar", …) or switched ones ("positive" → "negative").

Evaluation of the model's ability to learn a new function as a change of performance when evaluated with non-sensical or flipped labels in the instruction. CoAT models (green) here perform more consistently compared to previous models (grey) or models trained on identical data without concept sharing (blue and red).

We find that concept-aware in-context learners really are much more agnostic to the semantics of labels, suggesting that they rely more on functional relations of inputs and outputs, which are necessary for robust comprehension of really new user instructions.

Finally, can concept-learning models perform better on real tasks?

In our final test, we compare the performance of CoAT models (trained on only two QA datasets) to two baselines: (1) models trained on the same data without sharing concepts, and (2) previous instructional models trained with huge collections of data from over 1,000 tasks. We evaluate all models on previously unseen tasks of two task collections (SuperGLUE & Natural Instructions) of 70 tasks in total.

First comparison: win rates of CoAT models to models trained without sharing concepts (Tk-Random) show clear dominance of models trained on concept-sharing data (Tk-CoAT): CoAT models win in 41 and 45 tasks by a statistically significant margin. The difference is particularly visible in reasoning tasks where learning of functional relationships applies best.

A comparison to previous models shows that CoAT models perform better than even larger T0 model(s) trained on 35 tasks, but in the full collection, they perform comparably or worse than Tk-Instruct and FLAN models (trained in over 1,600 tasks). However, when we look at the tasks with previously unseen labels, CoAT models fare much better. This further supports that concept-based in-context learners are especially better at learning new functional patterns. This is particularly useful in handling more complex prompts.

Wrap up

A better understanding of what about training data matters will eventually empower us to train better language models faster. To make progress towards this goal, we need an ambition to look beyond the scale. We must keep pushing to decompose the "emergent" abilities into smaller pieces.

You might rightfully object that directly applying concept-dependent training on a larger scale would be difficult. However, there are other ways to put our findings into practice. For instance, many recent language models incorporate programming code into their pre-training mixes, including the original ChatGPT and more recently, also Llama-3 using four times more code than Llama-2, or Microsoft's compact Phi models, trained on textbooks combining code and its natural-language descriptions. Utilising mixtures of code and natural language makes perfect sense if you acknowledge the importance of concepts because code is underpinned by latent concepts in much larger proportions than natural language.

Finally, note that by CoAT, we merely show the importance of one feature of data. Our aim is not to convince you of some infinite opportunities that will open up with enough concept annotations. The takeaway is to keep eyes open for what's happening also on the less-spotlighted but not the less exciting side of research in data theories. There's much more we are yet to understand about the role of data in models' capabilities.

Link to the paper:

Concept-aware Data Construction Improves In-context Learning of Language Models