Programmable Transformers

We propose to hardcode the parameters of a Transformer network in a new human-interpretable notation.

Introduction

Deep learning is very effective at creating networks that can perform complex tasks, but we do not generally have much idea how those tasks are performed by these networks. We here propose some new notation that makes it far more possible to hardcode neural networks that can perform the same sorts of tasks as their learned counterparts. This has several benefits:

The resulting hardcoded networks are extremely interpretable and may be useful for some tasks where that is a requirement.
The process of hardcoding a network provides valuable insight into what kinds of computation are performable with the various components of a particular architecture.
Hardcoding a network gives a better understanding of the space of possible variants of a particular architecture.
Hardcoding neural networks for natural language processing allows us to encode linguistic knowledge in a format that is usable for linguistic competence, thus allowing us to rigorously test linguistic theories and potentially providing a new experimental framework for work in linguistics.
There are several ways to combine hardcoded components with learned components to achieve some of the advantages of both.

Our goal in this article is to successfully hardcode the weights of a Transformer in order to do classification (is this sentence grammatical or not?) and translation (English to French). We present these networks piece-by-piece with editable code blocks so that the reader can follow along. While we intend to make this article self-contained, readers may find it helpful to consult the Illustrated Transformer and the Annotated Transformer to refresh their memories of how Transformer works.

Notation

Our first bit of notation is for sparse vectors whose nonzero entries are small numbers. We first pick an ordered set of natural language words (e.g., English words), one per dimension of our vector space. We will call these words semes.The term comes from semiotics, where it denotes the smallest piece of semantic meaning. Each seme represents a different basis vector. We then represent vectors as sums of these basis vectors. If our semes are "pig", "peregrine", and "wombat", we would represent the vector $$\langle 1, 0, -1\rangle$$ as $$\langle \langle +pig - wombat \rangle \rangle$$, while the vector $$\langle -2.1, 3.3, 0\rangle$$ would be represented as $$\langle\langle -2.1 pig + 3.3 peregrine \rangle \rangle$$.

When writing code, we omit the $$\langle\langle\, \rangle\rangle$$: vec1: +pig -wombat vec2: -2.1pig +3.3peregrine

The next bit of notation is for sparse matrices whose nonzero entries are small numbers. We represent the $$i,j$$ entry of a matrix being $$x\ne 0$$ by $$\{\{ x \, seme_i \rightarrow seme_j\}\}$$. If our three semes are again "pig", "peregrine", and "wombat", then we have \begin{bmatrix} 1 & 0 & 0 \\ 0 & 2 & 3\\ 0 & 0 & 4 \end{bmatrix} = \{\{ pig\rightarrow pig, 2 peregrine\rightarrow peregrine, 3 peregrine \rightarrow wombat, 4 wombat\rightarrow wombat\}\} \begin{bmatrix} 0 & 0 & 3 \\ -1 & 2 & -4\\ 0 & 0 & 0 \end{bmatrix} = \{\{ 3pig\rightarrow wombat, -peregrine\rightarrow pig, 2peregrine \rightarrow peregrine, -4peregrine \rightarrow wombat\}\} and so on.

When writing code, we omit the $$\{\{ \}\}$$ and use $$>$$ in place of $$\rightarrow$$: mat1: pig>pig, 2peregrine>peregrine, 3peregrine>wombat, 4wombat>wombat mat2: 3pig>wombat, -peregrine>pig, 2peregrine>peregrine, -4peregrine>wombat

Discussion

3.1

Interpretability

It is generally accepted that deep neural networks are uninterpretable. We show in this article that they can be made quite interpretable by using better notation, with no or very little change to the architecture. Unfortunately, building a network that is able to perform a nontrivial task is quite complex and requires either a lot of preliminaries or intimate familiarity with the architectures in question; we will gradually build up to such networks. To whet the reader's appetite, we show here a simple (vanilla) RNN designed to do sentiment analysis, based on the VADER algorithm. (VADER itself is already extremely interpretable; this is only meant to demonstrate that a fairly simple vanilla RNN can be made just as interpretable.)

The following figure can be edited interactively. We have provided a bare minimum set of word embeddings (below called "lexicon"), so adding additional examples will probably require adding additional items to the lexicon.

3.2

As a Tool to Build Intuition

The process of hardcoding a network provides valuable insight into what kinds of computation are performable at all with the various components of a particular architecture, as well as what kinds of computation are natural (and thus likely easy to learn) and which kinds are unnatural (and thus at least potentially more difficult to learn). For an example, consider the first layer of a Transformer decoder. This layer takes in an embedding of the previously emitted word and performs various manipulations on it. However, there is a residual connection, so that the manipulations simply add things to the embedding of the previous word. This makes it somewhat difficult to erase the previous word's embedding, which is necessary to avoid simply repeating the previous word ad infinitum. This is relatively easy to address with a self-attention layer, but it suggests that a Transformer network would likely be well-served by adding a pointwise dense layer (without a residual connection) between the embedding of the previous word and the first decoder layer. It also suggests an explanation for the repetitiveness of Transformer networks in early training: the network has not yet learned to erase the embedding of the previous word sufficiently well. Hopefully further explanation will yield many more such insights and interventions.

3.3

Usefulness for Linguistics

Hardcoding neural networks for natural language processing allows us to encode linguistic knowledge in a format that is usable for linguistic competence (e.g., grammaticality judgments, entailment judgments, translation, dialogue), thus allowing us to rigorously test the sufficiency of a linguistic theory for explaining a particular phenomenon. Such work is not in any sense new in linguistics (see e.g. ), but we hope that high-performing architectures from machine learning such as the Transformer add a valuable tool now that we can write interpretable transformers.

3.4

Combining Learning with Programming

There are several ways to combine hardcoded components with learned components to achieve some of the advantages of both. We list several ideas in this vein:

Initializing a Transformer network from a hand-built set of weights, then learning from data on top of this. We can definitely create initializations that are far stronger than any random initialization. Note that hand-built networks tend to have symmetries resulting from their sparse weights, so the weights should have noise added to them to break this symmetry.
Writing weight matrices as sums of word embedding outer products, and then learning only the word embeddings. For instance, we can interpret $$\{\{red\rightarrow blue\}\}$$ as the outer product of the word embedding for "red" and the word embedding for "blue". This would allow us to use traditional learned word embeddings while still retaining a large amount of insight into the internal workings of the network. It might also allow us to encode certain semantic relationships solely by enumerating examples of the semantic relationship. For example, $$\{\{eat\rightarrow animal, drink\rightarrow animal, sleep\rightarrow animal, see\rightarrow animal\}\}$$ might encode that a subject that eats, drinks, sleeps, or sees is probably an animal. Given appropriate word embeddings, this might then generalize to such activities as hearing, snoozing, sipping, or chewing.
Initializing a Transformer network from hand-built weights, then doing some number of steps of gradient descent to get proposed new rules to add to the existing weight specification rules.

Ingredients of Transformer Networks

In this section, we lay out the workings of a Transformer piece-by-piece, indicating how each can be programmed. In the next section, we will actually build a network that classifies sentences as grammatical or not.

4.1

Pointwise Dense Layers

As the next step, we will consider a pointwise dense layer, which requires one matrix parameter (the weights) and one vector parameter (the biases). Pointwise dense layers are used to compute the inputs to dot-product attention, as well as in the feed-forward layers of the Transformer.

As a basic example, consider semes $$apple,banana,cherry,durian$$. Consider the dense layer with weights $$\{\{apple\rightarrow apple, -apple\rightarrow banana, -apple \rightarrow cherry, -banana \rightarrow apple,$$ $$banana \rightarrow banana, -banana \rightarrow cherry, -cherry \rightarrow apple, -cherry \rightarrow banana,$$ $$cherry \rightarrow cherry\}\}$$ and bias $$\langle\langle -durian\rangle\rangle$$. The following figure allows you to change both the parameters of the dense layer and the vectors that get run through it.

4.2

Word Embeddings

Our ultimate goal is to write down all the weights for a Transformer model that can perform a natural language task. We next discuss word embeddings. For simplicity, we omit the intermediate step of associating each word to an integer identifier, and simply map each word directly to a vector.

Word	Embedding
I	$$\langle\langle +nom +sg +first +pro\rangle\rangle$$
you	$$\langle\langle +nom +sg +second +pro\rangle\rangle$$
he	$$\langle\langle +masc +nom +sg +third +pro\rangle\rangle$$
she	$$\langle\langle +fem +nom +sg +third +pro\rangle\rangle$$
it	$$\langle\langle +neut +sg +third +pro +expletive\rangle\rangle$$
me	$$\langle\langle +acc +sg +first +pro\rangle\rangle$$
you	$$\langle\langle +sg +pl +second +pro\rangle\rangle$$
him	$$\langle\langle +masc +acc +sg +third +pro\rangle\rangle$$
her	$$\langle\langle +fem +acc +sg +third +pro\rangle\rangle$$
we	$$\langle\langle +nom +pl +first +pro\rangle\rangle$$
they	$$\langle\langle +nom +pl +third +pro\rangle\rangle$$
us	$$\langle\langle +acc +pl +first +pro\rangle\rangle$$
them	$$\langle\langle +acc +pl +third +pro\rangle\rangle$$
my	$$\langle\langle +gen +sg +first +pro\rangle\rangle$$
our	$$\langle\langle +gen +pl +first +pro\rangle\rangle$$
his	$$\langle\langle +masc +gen +sg +third +pro\rangle\rangle$$
her	$$\langle\langle +fem +gen +acc +sg +third +pro\rangle\rangle$$
its	$$\langle\langle +neut +gen +sg +third +pro\rangle\rangle$$
their	$$\langle\langle +gen +pl +third +pro\rangle\rangle$$
meet	$$\langle \langle+meet +verb +plain +agentlack +patientlack\rangle\rangle$$
meets	$$\langle \langle +meet +verb +thirdsg +agentlack +patientlack \rangle\rangle $$
met	$$\langle\langle +meet +verb +preterite +agentlack +patientlack\rangle\rangle$$
pig	$$\langle\langle +pig +noun +sg\rangle\rangle$$
pigs	$$\langle\langle +pig +noun +pl\rangle\rangle$$

There are several things to note in these word embeddings. Firstly, a "content" word like "meet" or "pig" will generally have either itself or some other form of itself as one of its componentsThis happens primarily because we do not know of any completely interpretable encoding of semantic meanings, and we need some notion of semantics in order to perform some tasks, e.g. translation. Pretrained word embeddings are at least somewhat interpretable. We will revisit this point later when we talk about hybrid hand/learned approaches. , while pronouns are fully specifiable in terms of various axes having to do with classic grammatical notions like case, gender, person, number, and so on. Secondly, content words come with a few extra syntactic components describing how the particular form of the word expresses grammatical notions like person (for verbs) and number (for both verbs and nouns). For instance, "pig" is singular $$\langle\langle +sg\rangle\rangle$$, while "pigs" is plural $$\langle\langle +pl \rangle\rangle$$.

We describe the intended use of some the semes we use here:

Type	Seme	Meaning	Example
Weirdness	$$weird$$	Some linguistic expectation has been violated	I are
Number	$$sg$$	Singular	dog
Number	$$pl$$	Plural	dogs
Case	$$nom$$	Nominative case	I/they/she/he
	$$acc$$	Accusative case	me/them/her/him
	$$gen$$	Genitive case	my/their/her/his
Person	$$first$$	first person	I/me/mine/myself
	$$second$$	second person	you/your/yourself
	$$third$$	third person	she/her/he/him/his/they/them/their
Role	$$agent$$	One who performs an action	She threw the ball
	$$experiencer$$	One who experiences some perception	He saw a dog
	$$percept$$	Something that is perceived	He saw a dog
Role Requirement	$$agentlack$$	Used for verbs that require an agent	She threw the ball
	$$experiencerlack$$	Used for verbs that require an experiencer	He saw a dog
	$$perceptposs$$	Used for verbs that can take but do not require a percept	He saw a dog / He saw

4.3

Transformer Feed-Forward Layers

Next we consider the feed-forward layers of Transformer. These are typically two dense layers with a ReLU nonlinearity between them. The intermediate dimension (referred to as the "filter size") is usually larger (typically by a factor of four) than the hidden size of the network.

One of the primary uses we have found for the Transformer feed-forward layer is to allow us to reason about logical conjunctions ($$a$$ AND $$b$$) and logical disjunctions ($$a$$ OR $$b$$). A single dense layer does not have the representational capacity to represent either of these notions in a satisfactory manner. In a Transformer feed-forward layer, we can represent $$a$$ AND $$b$$ as $$f_{a\, \mathrm{AND}\, b}(v) = \mathrm{ReLU}(v\cdot \{\{a\rightarrow x, b\rightarrow x\}\} - \langle\langle x \rangle\rangle)$$. Here we can read the value of $$a$$ AND $$b$$ out from the coefficient of $$x$$ in $$f(v)$$. Similarly we can represent $$a$$ OR $$b$$ as $$f_{a\, \mathrm{OR}\, b}(v) =v\cdot \{\{ +a\rightarrow x, +b\rightarrow x\}\} - f_{a\, \mathrm{AND}\, b}(v)$$ $$=v \cdot \{\{ +a\rightarrow x, +b\rightarrow x\}\} -\mathrm{ReLU}(v\cdot \{\{a\rightarrow x, b\rightarrow x\}\} - \langle\langle x \rangle\rangle)$$. We now present editable code. The provided code maps $$apple$$ OR $$banana$$ to $$yum$$ and $$cherry$$ AND $$durian$$ to $$yuck$$.No offense to lovers of either fruit. As a challenge, consider how you would represent $$apple$$ OR $$banana$$ OR $$cherry$$. (You may find you need auxilliary semes!)

4.4

Transformer Attention Layers

Finally, we come to the most iconic layer of the Transformer, the attention layer. This layer is the only layer in which the representations of different words interact.

For now, we deal only with a single attention head for simplicity. Later, we will consider multi-head attention. For the time being, we are not using positional embeddings, which greatly restricts the ability of the network to associate nouns with the correct modifiers (and you will see such errors in the output of the next figure). We will address this shortcoming later, once we have some slightly better notation.

4.5

Positional Encoding, Explained

Here we explain the positional encoding of Transformer. The positional encoding is made of of sines and cosines of different frequencies. It turns out that this makes the embedding a concatenation of a bunch of clock hands. This interpretation is shown in the figure below. There are 7 clocks in the figure below, one for each of the frequencies we are considering. Click on any of the words (or their corresponding number) to see the positional embedding corresponding to that word. Each word receives the positional embedding that is the 14-dimensional concatenation of the seven clock hands described as vectors.

Why is this useful? It allows us to write weights that interact with the positional embeddings (which is necessary to make position-dependent inferences) without losing interpretability. We do this as follows: first, we note that we can identify rotation matrices that move each of the clock-hands forward by the same amount that moving forward one word will move them. We express this like so: querypos: +0 keypos: +1 Here every word will pay attention to the word before it, because the key is being advanced one word while the query is not. We will put this positional embedding to use in the next section.

4.6

Transformer Attention Layers with Positional Encoding

Here we demonstrate the use of the positional embedding in a self-attention layer by fixing the example from Section 4.4 to associate the correct noun with the correct adjective.

Classification Transformer for Grammaticality

In this section, we show a Transformer programmed to make grammaticality judgments in English. This example is still in the process of being translated from one description language to another, and should not be expected to work. The full network that worked on a decent number of examples is shown as a comment in the code below.

Seq2Seq Transformer for Translation

In this section, we show a Transformer programmed to translate English to French. This network is solving a toy version of the problem (it can handle 300 or so sentences); the author has very little knowledge of French, unfortunately, which makes writing the decoder difficult.

Reviewers

Some text with links describing who reviewed the article.