In this step, we will process the text. Firstly, we will create a Tokenizer object and fit it on our training data.
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(training_data)
After fitting the data, we can observe some metrics from the tokenizer, such as the word index and calculate the total number of words.
total_words = len(tokenizer.word_index) + 1print(tokenizer.word_index)
print(total_words)
One of the most important tasks in text preprocessing is splitting the sentences into n grams. Before doing that, we need to transform each sentence into a sequence of integers. As we know, deep learning networks like to work with numbers and they can’t handle string directly. That’s why we transform each line of the training data into a sequence of integers.
Then, we use this sequence of integers to split it into n grams and store into input_sequences array.
input_sequences = []
for single_line in training_data:
# transform each sentence into a sequence of integers
token_list = tokenizer.texts_to_sequences([single_line])[0]
for i in range(1, len(token_list)):
n_gram_sequence = token_list[:i+1]
input_sequences.append(n_gram_sequence)
Afterwards, we need to get the maximum sequence length and pad with zeros some of the sequences up to this maximum length.
import numpy as np
from tensorflow.keras.preprocessing.sequence import pad_sequencesmax_sequence_len = np.int32(np.percentile([len(x) for x in input_sequences], 75))
# padd the input_sequences until the max_sequence_len
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))
Be the first to comment