Build a Language Model on Your WhatsApp Chats | by Bernhard Pfann, CFA | Nov, 2023

To train a language model, we need to break language into pieces (so-called tokens) and feed them to the model incrementally. Tokenization can be performed on multiple levels.

  • Character-level: Text is perceived as a sequence of individual characters (including white spaces). This granular approach allows every possible word to be formed from a sequence of characters. However, it is more difficult to capture semantic relationships between words.
  • Word-level: Text is represented as a sequence of words. However, the model’s vocabulary is limited by the existing words in the training data.
  • Sub-word-level: Text is broken down into sub-word units, which are smaller than words but larger than characters.

While I started off with a character-level tokenizer, I felt that training time was wasted, learning character sequences of repetitive words, rather than focusing on the semantic relationship between words across the sentence.

For the sake of conceptual simplicity, I decided to switch to a word-level tokenizer, keeping aside the available libraries for more sophisticated tokenization strategies.

from nltk.tokenize import RegexpTokenizer

def custom_tokenizer(txt: str, spec_tokens: List[str], pattern: str="|\d|\\w+|[^\\s]") -> List[str]:
Tokenize text into words or characters using NLTK's RegexpTokenizer, considerung
given special combinations as single tokens.

:param txt: The corpus as a single string element.
:param spec_tokens: A list of special tokens (e.g. ending, out-of-vocab).
:param pattern: By default the corpus is tokenized on a word level (split by spaces).
Numbers are considered single tokens.

>> note: The pattern for character level tokenization is '|.'
pattern = "|".join(spec_tokens) + pattern
tokenizer = RegexpTokenizer(pattern)
tokens = tokenizer.tokenize(txt)
return tokens

["Alice:", "Hi", "how", "are", "you", "guys", "?", "<END>", "Tom:", ... ]

It turned out that my training data has a vocabulary of ~70,000 unique words. However, as many words appear only once or twice, I decided to replace such rare words by a “<UNK>” special token. This had the effect of reducing vocabulary to ~25,000 words, which leads to a smaller model that needs to be trained later.

from collections import Counter

def get_infrequent_tokens(tokens: Union[List[str], str], min_count: int) -> List[str]:
Identify tokens that appear less than a minimum count.

:param tokens: When it is the raw text in a string, frequencies are counted on character level.
When it is the tokenized corpus as list, frequencies are counted on token level.
:min_count: Threshold of occurence to flag a token.
:return: List of tokens that appear infrequently.
counts = Counter(tokens)
infreq_tokens = set([k for k,v in counts.items() if v<=min_count])
return infreq_tokens

def mask_tokens(tokens: List[str], mask: Set[str]) -> List[str]:
Iterate through all tokens. Any token that is part of the set, is replaced by the unknown token.

:param tokens: The tokenized corpus.
:param mask: Set of tokens that shall be masked in the corpus.
:return: List of tokenized corpus after the masking operation.
return [t.replace(t, unknown_token) if t in mask else t for t in tokens]

infreq_tokens = get_infrequent_tokens(tokens, min_count=2)
tokens = mask_tokens(tokens, infreq_tokens)

["Alice:", "Hi", "how", "are", "you", "<UNK>", "?", "<END>", "Tom:", ... ]

Source link

Be the first to comment

Leave a Reply

Your email address will not be published.