N-gram Language Models

Predicting is difficult — especially about the future, But how about predicting like the next few words in a sentence? We will explore a statistical algorithm which has been a pre-cursor to many advanced algorithms in field of Natural Language Processing.

12 min readDec 22, 2020

--

In the following sections, we will explore the possibility of assigning a probability score to the next word in the sentence.
For example: what word is likely to follow:
Please turn your notebook… most likely word is ‘over’ but probably not ‘food’.

The pertaining question can be raised as to why would we want to assign probability score to a sentence? — Because probabilities are essential to identify words in a noisy input like speech recognition. It can also help in task like spelling correction or grammatical error correction. It has excessive usage in field of machine translation.

Models that can assign probabilities to sequence of the words are called Language Models (LMs)

N-Grams

Let’s begin with the task of computing P(w|H) — probability of word ‘w’, given some history ‘H’.
Suppose the ‘H’ is ‘its water is so transparent that’, and we want to know the probability of next word ‘the’: P(the|its water is so transparent that).
One way to estimate this probability — relative frequency counts. Take a large corpus, count the number of time ‘its water is so transparent that’ and also count the number of times it has been followed by ‘the’.

Picture-I: Counting Conditional Probability with Relative Frequency Counts

While this method of estimating probabilities directly from counts works fine in many cases, it turns out that even the web isn’t big enough to give us good estimates in most cases. Why? Because language is creative, and there are new sentences added everyday, which we wont be able to count.

For this reason’s we need to introduce cleverer ways to calculate the probability of word w given history H.
To represent the probability of a particular random variable Xi taking on the value “the”, or P(Xi = “the”), we will use the simplification P(the). We’ll represent a sequence of N words either as w1 . . . wn or wn (so the expression wn−1 means the string w1,w2,…,wn−1). For the joint probability of each word in a sequence having a particular value P(X = w1,Y = w2,Z = w3,…,W = wn) we’ll use P(w1,w2,…,wn).
How to compute probability of entire sequence? — Using chain rule of probability

Picture-II: Using Chain rule of probability to calculate joint probability distribution

The chain rule shows the link between computing joint probability of a sequence and computing the conditional probability of word given previous words. But there is a caveat — We don’t know any way to compute the exact probability of a word given a long sequence of preceding words.

The intuition of the n-gram model is that instead of computing the probability of a word given its entire history, we can approximate the history by just the last few words.

The bigram model approximates the probability of word given all the previous words, by using only the conditional probability of the last preceding word:

Picture-III: Approximation used in Bigram Model

The assumption that the probability of a word depends only on the previous word is called a Markov assumption.

Markov models are the class of probabilistic models that assume we can predict the probability of some future unit without looking too far into the past.

We can generalize the bigram (which looks one word into the past) to the trigram (which looks two words into the past) and thus to the n-gram (which looks n − 1 words into the past)

Given the bigram assumption for the probability of an individual word, we can compute the probability of a complete word sequence by substituting

Picture-IV: Bigram Model

How do we estimate bi-gram probabilities (or n-gram probabilities)? — An intuitive way to estimate probabilities is called maximum likelihood estimation or MLE. We get the MLE estimate for the parameters of an n-gram model by getting counts from a corpus, and normalizing the counts so that they lie between 0–1.

For example, to compute a particular bigram probability of a word y given a previous word x, we’ll compute the count of the bigram C(xy) and normalize by the sum of all the bigrams that share the same first word x:

Picture-V: Estimation of bigram probability using MLE

We can simplify this equation, since the sum of all bigram counts that start with a given word wn−1 must be equal to the unigram count for that word wn−1:

Picture-VI: Using MLE to estimate bi-gram probability

This use of relative frequencies as a way to estimate probabilities is an example of maximum likelihood estimation or MLE. Although we have calculated the bi-gram statistic, what linguistic phenomena does it captures?
Some of the bigram probabilities above encode some facts that we think of as strictly syntactic in nature.

For pedagogical purposes, we have used bi-gram models, but in practise we use tri-gram or 4-gram models. In language modelling, we use the log format for computing the probabilities — log probabilities. Since the probabilities (<1) and while multiplying together, the product becomes smaller, may result in numerical underflow if computed for a 4-gram/5-gram.

Evaluating Language Models

The best way to evaluate the performance of a language model is to embed it in an application and measure how much the application improves. Such end- to-end evaluation is called extrinsic evaluation. Unfortunately, running big NLP systems end-to-end is often very expensive. Instead, we formulated a metric that can be used to quickly evaluate potential improvements in a language model. An intrinsic evaluation metric is one that measures the quality of a model independent of any application.

For an intrinsic evaluation of a language model we need a test set. As with many of the statistical models in our field, the probabilities of an n-gram model come from the corpus it is trained on, the training set or training corpus. We can then measure the quality of an n-gram model by its performance on some unseen data called the test set or test corpus.

So if we are given a corpus of text and want to compare two different n-gram models, we divide the data into training and test sets, train the parameters of both models on the training set, and then compare how well the two trained models fit the test set.

Whichever model assigns a higher probability to the test set — meaning it more accurately predicts the test set — is a better model.
In practice, we often just divide our data into 80% training, 10% development, and 10% test.

Perplexity: The perplexity (sometimes called PP for short) of a language model on a test set is the inverse probability of the test set, normalized by the number of words.

Picture-VII: General Formula of Perplexity

We can use the chain rule to expand the joint probability distribution:

Picture-VIII: Expanding the probability distribution using chain rule

Thus if we are computing the perplexity with the bi-gram model:

Picture-IX: Perplexity for Bigram Model (LM)

Note that because of the inverse, the higher the conditional probability of the word sequence, the lower the perplexity. Thus, minimising perplexity is equivalent to maximising the test set probability according to the language model.

Smoothing

What do we do with words that are in our vocabulary (they are not unknown words) but appear in a test set in an unseen context? To keep a language model from assigning zero probability to these unseen events, we’ll have to shave off a bit of probability mass from some more frequent events and give it to the events we’ve never seen. This modification is called smoothing or discounting.

Laplace Smoothing

The simplest way to do smoothing is to add one to all the bigram counts, before we normalize them into probabilities. All the counts that used to be zero will now have a count of 1, the counts of 1 will be 2, and so on. This algorithm is called Laplace smoothing.

Laplace smoothing does not perform well enough to be used in modern n-gram models, but it usefully introduces many of the concepts that we see in other smoothing algorithms.
The unsmoothed maximum likelihood estimate of the unigram probability of the word wi is its count ci normalized by the total number of word tokens N:

Picture-X: Unigram probability of a word

Since there are V words in the vocabulary and each one was incre- mented, we also need to adjust the denominator to take into account the extra V observations:

Picture-XI: Laplace Smoothing of Unigram Probability

It is convenient to describe how a smoothing algorithm affects the numerator, by defining an adjusted count c∗. To define this count, since we are only changing the numerator in addition to adding 1 we’ll also need to multiply by a normalization factor.

Picture-XII: Adjusted Count for Laplace Smoothing

This adjusted count can than be normalised by N words resulting in a probabilistic value.

Picture-XIII: Formula for Laplace Smoothing on Bigram Model

A related way to view smoothing is as discounting (lowering) some non-zero counts in order to get the probability mass that will be assigned to the zero counts. Instead of referring to the discounted counts c∗, we might describe a smooth- ing algorithm in terms of a relative discount dc, the ratio of the discounted counts to the original counts:

Picture-XIV: Discounting factor

2. Add-k smoothing
One alternative to add-one smoothing is to move a bit less of the probability mass from the seen to the unseen events. Instead of adding 1 to each count, we add a frac- tional count k (.5? .05? .01?). This algorithm is therefore called add-k smoothing.

Picture-XV: Add-k Smoothing Algorithm

Add-k smoothing requires that we have a method for choosing k; this can be done, for example, by optimizing on a devset. Although add-k is useful for some tasks (including text classification), it turns out that it still doesn’t work well for language modeling.

3. Backoff and Interpolation
If we are trying to compute P(wn|wn−2,wn−1) but we have no examples of a particular trigram wn−2,wn−1,wn. We can instead estimate its probability by using the bigram probability P(wn|wn−1). Similarly, if we don’t have counts to compute P(wn|wn−1), we can look to the unigram P(wn).

Sometimes using less context is a good thing, helping to general- ize more for contexts that the model hasn’t learned much about.
There are two ways to use this n-gram “hierarchy”.

i. In backoff, we use the trigram if the evidence is sufficient, otherwise we use the bigram, otherwise the unigram.
ii. In interpolation, we always mix the probability estimates from all the n-gram estimators, weighing and combining the trigram, bigram, and unigram counts.

In simple linear interpolation, we combine different order n-grams by linearly interpolating all the models. Thus, we estimate the trigram probability by mixing together the unigram, bigram and trigram probabilities

Picture-XVI: Linear Interpolation of Trigram Proabability

Such that the estimate’s add up to 1.

In this article, we have explored one of the most base line concepts of natural language processing : n-gram language models which are used for many applications and also how to test this models using intrinsic evaluation using Perplexity score. To conclude the article, I have written a brief summaries on Smoothing techniques used for language modeling. Fin.

Hands-on real-world examples, research, tutorials, and cutting-edge techniques delivered Monday to Thursday. Make learning your daily ritual

N-gram Language Models

Predicting is difficult — especially about the future, But how about predicting like the next few words in a sentence? We will explore a statistical algorithm which has been a pre-cursor to many advanced algorithms in field of Natural Language Processing.

N-Grams

Evaluating Language Models

Smoothing

Written by fujinews