Dive into Deep Learning
Table Of Contents
Dive into Deep Learning
Table Of Contents

13.5. Subword Embedding (fastText)

English words usually have internal structures and formation methods. For example, we can deduce the relationship between “dog”, “dogs”, and “dogcatcher” by their spelling. All these words have the same root, “dog”, but they use different suffixes to change the meaning of the word. Moreover, this association can be extended to other words. For example, the relationship between “dog” and “dogs” is just like the relationship between “cat” and “cats”. The relationship between “boy” and “boyfriend” is just like the relationship between “girl” and “girlfriend”. This characteristic is not unique to English. In French and Spanish, a lot of verbs can have more than 40 different forms depending on the context. In Finnish, a noun may have more than 15 forms. In fact, morphology, which is an important branch of linguistics, studies the internal structure and formation of words.

In word2vec, we did not directly use morphology information. In both the skip-gram model and continuous bag-of-words model, we use different vectors to represent words with different forms. For example, “dog” and “dogs” are represented by two different vectors, while the relationship between these two vectors is not directly represented in the model. In view of this, fastText [4] proposes the method of subword embedding, thereby attempting to introduce morphological information in the skip-gram model in word2vec.

In fastText, each central word is represented as a collection of subwords. Below we use the word “where” as an example to understand how subwords are formed. First, we add the special characters “<” and “>” at the beginning and end of the word to distinguish the subwords used as prefixes and suffixes. Then, we treat the word as a sequence of characters to extract the \(n\)-grams. For example, when \(n=3\), we can get all subwords with a length of 3:

(13.5.1)\[\textrm{"<wh"}, \ \textrm{"whe"}, \ \textrm{"her"}, \ \textrm{"ere"}, \ \textrm{"re>"},\]

and the special subword \(\textrm{"<where>"}\).

In fastText, for a word \(w\), we record the union of all its subwords with length of 3 to 6 and special subwords as \(\mathcal{G}_w\). Thus, the dictionary is the union of the collection of subwords of all words. Assume the vector of the subword \(g\) in the dictionary is \(\mathbf{z}_g\). Then, the central word vector \(\mathbf{u}_w\) for the word \(w\) in the skip-gram model can be expressed as

(13.5.2)\[\mathbf{u}_w = \sum_{g\in\mathcal{G}_w} \mathbf{z}_g.\]

The rest of the fastText process is consistent with the skip-gram model, so it is not repeated here. As we can see, compared with the skip-gram model, the dictionary in fastText is larger, resulting in more model parameters. Also, the vector of one word requires the summation of all subword vectors, which results in higher computation complexity. However, we can obtain better vectors for more uncommon complex words, even words not existing in the dictionary, by looking at other words with similar structures.

13.5.1. Summary

  • FastText proposes a subword embedding method. Based on the skip-gram model in word2vec, it represents the central word vector as the sum of the subword vectors of the word.
  • Subword embedding utilizes the principles of morphology, which usually improves the quality of representations of uncommon words.

13.5.2. Exercises

  • When there are too many subwords (for example, 6 words in English result in about \(3\times 10^8\) combinations), what problems arise? Can you think of any methods to solve them? Hint: Refer to the end of section 3.2 of the fastText paper[1].
  • How can you design a subword embedding model based on the continuous bag-of-words model?

13.5.3. Scan the QR Code to Discuss