Text Classification: usage of bag-of-words or embedding layer | by Practicing DatScy | Feb, 2024


Recently I ran into a ‘difficult’ text classification dataset on Kaggle, news-category-dataset by the Huffington Post [1], because I could not accurately classify the sentences using an Embedding layer. This text classification dataset contains 10 classes: ‘politics’, ‘food & drink’, ‘travel’, ‘business’, ‘sports’, ‘style & beauty’, ‘world news’, ‘entertainment’, ‘parenting’, ‘wellness’.

Since the popularity of Transformers, using an Embedding layer had been shown to be more useful than bag-of-words (ie: word count), in terms of text classification because it allowed for text to be clustered in a feature-space based on similar lexical meaning. Sentences on different topics can be classified because each topic is likely to use specific unique words that are likely to be clustered in embedding space. Thus, if similar words are used to several classes it may be more difficult to distinguish between classes because the embedding space will share the same location. One way to distinguish each class separately in embedding space via words, would be to intelligently clean or pad specific class data without or with keywords for specific classes.

In this post, I classify sentences using two datasets: news-category-dataset mentioned above and a ‘less challenging’ text classification dataset (bbc-new-dataset), that was a Kaggle competition [2].

import numpy as np
import pandas as pd

import tensorflow as tf

from collections import Counter

from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer

import regex

import csv

def clean_dataset(df, list_of_columns_2_drop):

# Remove columns that are not needed
for i in list_of_columns_2_drop:
df = df.drop(i, axis=1)

# Rename X and Y
if len(df.iloc[0,0]) > len(df.iloc[0,1]):
df.columns = ['X', 'Y']
else:
df.columns = ['Y', 'X']

# Make every column lowercase
df = df.applymap(str.lower)

print("class count:", df['Y'].value_counts())

n_classes = len(df['Y'].value_counts())
print('number of classes: ', n_classes)

return df

def count_the_number_of_times_a_char_appears(text, char2find):
c = 0
for char in text:
if char == char2find:
c = c + 1
# print('c: ', c)
return c
def remove_text_from_start_end_marker_for_a_string(sentence):

start_marker = ['(', '{', '[']
end_marker = [')', '}', ']']

clean_sen = sentence

for ind in range(len(start_marker)):
# print('start_marker[ind]: ', start_marker[ind])

# Count the number of times the marker appears
loops = count_the_number_of_times_a_char_appears(clean_sen, start_marker[ind])

for x in range(loops):
if start_marker[ind] in clean_sen and end_marker[ind] in clean_sen:
start_ind = clean_sen.find(start_marker[ind])
# print('start_ind: ', start_ind)

end_ind = clean_sen.find(end_marker[ind])
# print('end_ind: ', end_ind)

if start_ind == 0:
clean_sen = clean_sen[end_ind+1::]
else:
clean_sen = clean_sen[0:start_ind-1] + clean_sen[end_ind+1::]

# print('clean_sen: ', clean_sen)

return clean_sen

def remove_characters_from_string(sen_str):

# Remove undesireable long characters repeatively, matching characters in the string
# These words should be unique words, such that parts of [the string "word"] is not be modified
to_replace = ['</p>', '<a', 'id=', "href=", 'title=', 'class=', '</a>', '</sup>', '<p>', '</b>', '<sup',
'?']
replace_with = ''

word_array = sen_str.split()
# print('word_array: ', word_array)

word_array_new = []
for wind, word in enumerate(word_array):
# print('word: ', word)

# I do the same thing as regex sub, I search across the word string and replace short repeating words.
# If there is nothing to replace, out just stays the same.
out = word # initialization

for ind, to_replace_val in enumerate(to_replace):
# print('to_replace_val: ', to_replace_val)
out_b4 = out
out = word.replace(to_replace_val, replace_with)

# Take the shortest out to ensure previous changes are stored
if len(out_b4) < len(out):
out = out_b4
# print('out: ', out)

# Stores the last changed word
word_array_new.append(out)

sen_str_clean = ' '.join(word_array_new)

return sen_str_clean

def clean_procedure_per_string(sen_str):

# [Step 0] Make sentences lowercase
sen_str = sen_str.lower()

# [Step 1] Remove parentheses and text in between parentheses, so that phrases are gramatically correct
sen_str = remove_text_from_start_end_marker_for_a_string(sen_str)

# [Step 3] Remove long undesireable characters repeatively, matching characters in the string
sen_str = remove_characters_from_string(sen_str)

return sen_str

def sequential_padder(X, Y):

# Train class count
from collections import Counter
c = Counter(Y)

# 1 = apple, 0 = tomatoe
class_key = list(c.keys())
print('class_key: ', class_key)

class_value = list(c.values())
print('class_value: ', class_value)

max_class = np.argmax(class_value)
print('max_class: ', max_class)

samples_to_add_per_class = [class_value[max_class] - i for i in class_value]
print('samples_to_add_per_class: ', samples_to_add_per_class)

# --------------------------------------------

# Need to account for when a class does not need padding: assign directly
X_pad = X
Y_pad = Y

for ind, samples2pad in enumerate(samples_to_add_per_class):

print('Number of values to pad:', samples2pad)

# Identify class number
class_num = class_key[ind] # class_key: [1, 0]
print('Class number that needs padding:', class_num)

# Need to find the index of Y_train_filtered/X_train_filtered that belong to class_num
class_num_index = [index for index, val in enumerate(Y) if val == class_num]
print('class_num_index:', class_num_index)

# Select X and Y for the specified class number index
X_class_num_select = [X[i] for i in class_num_index]
Y_class_num_select = [Y[i] for i in class_num_index]

# Find the number of samples for the specified class number
curSamples = len(X_class_num_select)
print('Number of samples to repeat for this specific class:', curSamples)

# If the number of class samples are greater than the amount to pad, use class samples in sequential order
if curSamples > samples2pad:
# Loop over every element in a because it is 3d matrix, and add each 3d piece of information one loop at a time
for i in range(samples2pad):
X_pad.append(X_class_num_select[i])
Y_pad.append(Y_class_num_select[i])
else:
# If the number of class samples are less than the amount to pad, repeat the full class sample an even amount of times
num_of_full_loops = int(samples2pad/curSamples)
print('num_of_full_loops:', num_of_full_loops)

for i in range(num_of_full_loops):
for j in range(curSamples):
X_pad.append(X_class_num_select[j])
Y_pad.append(Y_class_num_select[j])

remaining_vals = samples2pad - num_of_full_loops*curSamples
print('remaining_vals:', remaining_vals)
for i in range(remaining_vals):
X_pad.append(X_class_num_select[i])
Y_pad.append(Y_class_num_select[i])

# --------------------------------------------

print('Length of Y matrix after padding:', len(Y_pad))

# --------------------------------------------

# Confirm that classes are even after padding
c = Counter(Y_pad)

# 1 = apple, 0 = tomatoe
class_key = list(c.keys())
print('class_key: ', class_key)

class_value = list(c.values())
print('class_value: ', class_value)

max_class = np.argmax(c)
print('max_class: ', max_class)

samples_to_add_per_class = [class_value[max_class] - i for i in class_value]
print('samples_to_add_per_class: ', samples_to_add_per_class)

return X_pad, Y_pad


def create_tokenizer0(sentences):

vocabulary = list(set(' '.join(list(set(sentences))).split(' ')))
NUM_WORDS = len(vocabulary)
#NUM_WORDS = 2000

# Instantiate the Tokenizer class, passing in the correct values for num_words and oov_token
tokenizer = Tokenizer(num_words=NUM_WORDS, oov_token="<OOV>")

# Fit the tokenizer to the training sentences
tokenizer.fit_on_texts(sentences)

return tokenizer

def encode_labels(labels):

unq_labels = list(set(labels))
NUM_OF_CLASSES = len(unq_labels)

# Assign a number to each unique label
y_assignment = dict(zip(unq_labels, np.arange(NUM_OF_CLASSES)))
print('y_assignment ', y_assignment)

label_sequences = [y_assignment[i] for i in labels]

return label_sequences, y_assignment


# Dataset 0: Original data in Coursera Natural Language Processing Tensorflow (DeepLearning_AI_TensorFlow_Developer_Specialization)
# https://www.kaggle.com/competitions/learn-ai-bbc
df = pd.read_csv('/kaggle/input/bbc-new-dataset/BBC News Train.csv')

# Clean the columns of the Dataframe
list_of_columns_2_drop = ['ArticleId']
df = clean_dataset(df, list_of_columns_2_drop)
df# Dataset 1: Another news dataset (Huffington Post)
df = pd.read_csv('/kaggle/input/news-category-dataset/NewsCategorizer.csv')

# Clean the columns of the Dataframe
list_of_columns_2_drop = ['headline', 'links', 'keywords']
df = clean_dataset(df, list_of_columns_2_drop)
d

# Dataset 1: Another news dataset (Huffington Post)
df = pd.read_csv('/kaggle/input/news-category-dataset/NewsCategorizer.csv')

# Clean the columns of the Dataframe
list_of_columns_2_drop = ['headline', 'links', 'keywords']
df = clean_dataset(df, list_of_columns_2_drop)
df

X = []
Y = []

for i in range(len(df)):

sen_str = df["X"].iloc[i] # per row of the DataFrame is a string
y_str = df["Y"].iloc[i]

# ----------------------------
# Clean/pre-process the sentence
# ----------------------------
# [0] Perform two types of a string cleaning procedure: handmade and regex
sen_str = clean_procedure_per_string(sen_str)

# ----------------------------

# [Step 0] Make sentences lowercase
# Performed by FASTER sen_str = clean_procedure_per_string(sen_str)

# [Step 1] Remove parentheses and text in between parentheses, so that phrases are gramatically correct
# Performed by FASTER sen_str = clean_procedure_per_string(sen_str)

# [Step 2] Remove a single undesireable character
patterns_to_remove = r'[—"\.\€\$\£\%\d,\[\]\(\)\{\}\!-><\n]'
sen_str = regex.sub(patterns_to_remove, "", sen_str)

# [Step 3] Remove long undesireable characters repeatively, matching characters in the string
# Performed by FASTER sen_str = clean_procedure_per_string(sen_str)

# [Step 4] Remove exact stopwords that are separated by spaces or [space and newline character]
stopwords = ["a", "about", "above","after", "again", "against", "and", "anything", "are", "amongst", 'almost',
'always', 'again', 'also',
"because", "become", "becomes", "been", "before", "being", "below", "between", "both", "but",
"called", "could",
"did", "didnt", "does", "doing", "during",
"each",
"few", "from", "further",
"having", "he'd", "he'll", "he's", "here", "here's", 'her', "hers", "heres",
"herself", "him", "himself", "his", "how", "how's",
"i", "i'd", "i'll", "i'm", "i've", "into", "it", "it's", "its", "itself", "including", 'if',
"let's",
"myself", "means",
"once", "only", "other", "ought", "ourselves",
'part', 'parts' "probably",
"she'd", "she'll", "she's", "should", "such", "seems", 'something',
"the", "than", "that", "that's", "thats", "their", "theirs", "themselves", "then", "there",
"there's", "theres", "these", "they", "they'd", "they'll", "they're", "they've", 'theyre', "this",
"those", "through", 'things', 'thing', "truly",
"until", "up",
"very",
"we'd", "we'll", "we're", "we've", "were", "what", "what's", "whats", "when",
"when's", "where", "where's", "which", "while", "whoever", "who's", "whom", "why", "why's",
"would", "whatever",
"you'd", "you'll", "you're", "you've", "youve", "yourself", "yourselves"]
for word in stopwords:
sen_str = regex.sub(r'(?<=^)' + word + '(?=\s)', '', sen_str)
sen_str = regex.sub(r'(?<=\s)' + word + '(?=\s)', '', sen_str)
sen_str = regex.sub(r'(?<=\s)' + word + '(?=\n)', '', sen_str)

# [Step 5] Remove text that are 1 or 2 characters long - should remove all alphabet and two letter characters
sen_str = regex.sub(r'(?<=^)[A-Za-z]{1,3}(?=\s)', '', sen_str)
sen_str = regex.sub(r'(?<=\s)[A-Za-z]{1,3}(?=\s)', '', sen_str)
sen_str = regex.sub(r'(?<=\s)[A-Za-z]{1,3}(?=\n)', '', sen_str)

# [Step 6] Finally remove all multiple spaces, and replace with a single space
sen_str = regex.sub(r'\s+', ' ', sen_str)

# ----------------------------

# [1] Remove sentences with less than 10 words. Narrow the sentences down to realistic sentences.
if (len(sen_str.split()) > 10) & (len(y_str) > 0):
X.append(sen_str)
Y.append(y_str)

# General information about the dataset
print('There are ', len(X), 'sentences provided.')
unq_labels = list(set(Y))
NUM_OF_CLASSES = len(unq_labels)
print('There are ', NUM_OF_CLASSES, 'prediction categories.')
print('The prediction categories include: ', unq_labels)

# Dataset0
# There are 1490 sentences provided.
# There are 5 prediction categories.
# The prediction categories include: ['entertainment', 'politics', 'sport', 'business', 'tech'

# Dataset1 : good cleaning
# There are 16222 sentences provided.
# There are 10 prediction categories.
# The prediction categories include: ['politics', 'food & drink', 'travel', 'business', 'sports', 'style & beauty', 'world news', 'entertainment', 'parenting', 'wellness']

Another pre-processing step Used for Huffington Post dataset: concatenate sentences from the same class label to make the data have more descriptive features per category (MAXLEN larger) instead of making the data have more samples with respect to a class (padding samples per category)

# Verify that Y classes have similiar count values
c = Counter(Y)
c
class_label = list(c.keys())
# print('class_label:',class_label)

num_of_sen2cat = 2 # concatenate 2 sentences

X_longer = []
Y_longer = []
for i in class_label:
# print('i:', i)

# Make an index for X and Y, for each category
index = []
for ind, val in enumerate(Y):
if val == i:
index.append(ind)
# print('index:', index)

tot_len = len(index)
new_tot_len = int(tot_len/num_of_sen2cat)
# print('new_tot_len:', new_tot_len)

for k in range(new_tot_len):

# for each k I need to index temp [start_ind:end_ind]
start_ind = k*num_of_sen2cat
# print('start_ind:', start_ind)

if (start_ind + num_of_sen2cat) > tot_len:
# at the end and there are not enough sentences
end_ind = tot_len
else:
end_ind = start_ind + num_of_sen2cat
# print('end_ind:', end_ind)

indicies = index[start_ind:end_ind]
#print('indexes:', indexes)

X_cat = [X[i] for i in indicies]
X_cat = np.ravel(X_cat)
X_cat = ' '.join(X_cat)

X_longer.append(X_cat)
Y_longer.append(i)

which_one = 0

if which_one == 0:
X_padded, Y_padded = sequential_padder(X, Y)
else:
X_padded, Y_padded = sequential_padder(X_longer, Y_longer)

# Tokenizer 0: tensorflow Tokenizer
tokenizer0 = create_tokenizer0(X_padded)

word_index = tokenizer0.word_index # is dict[word] = index

NUM_WORDS = len(tokenizer0.word_index) + 1 # add 1 because count starts at 0
print('NUM_WORDS: ', NUM_WORDS)

word_index

# Convert X to sequences
sequences = tokenizer0.texts_to_sequences(X_padded)

# Calculate the maximum sentence length
sen_len = [len(i.split(' ')) for i in X_padded]
max_sen_len = np.max(sen_len)
print('Maximum sentence length: ', max_sen_len)
print('Minimum sentence length: ', np.min(sen_len))

MAXLEN = max_sen_len # make each sequence this length # accuracy 0.17
# MAXLEN = int(max_sen_len/2) # accuracy 0.15
# MAXLEN = int(max_sen_len/4) # accuracy 0.2
print('MAXLEN: ', MAXLEN)

# Pad the sequences using the correct padding and maxlen
sequences = pad_sequences(sequences, maxlen=MAXLEN, padding='post', truncating='post')

# Using scikit functions : 
# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
from sklearn.feature_extraction.text import CountVectorizer

# Get term-frequency matrix
vectorizer = CountVectorizer()

# Learn the vocabulary dictionary and return
# document-term matrix.
X = vectorizer.fit_transform(X_longer)

# Print keywords
# after version 1.0 = get_feature_names_out()
keywords = vectorizer.get_feature_names_out()
# print('keywords: ', keywords)
print('length of keywords: ', len(keywords))

# Term-frequency matrix OR matrix of counts
tf_mat = X.toarray()

print('size of tf_mat (sentences, keywords) : ', tf_mat.shape)
X_freq_count = tf_mat

# Encode labels 
label_sequences, y_assignment = encode_labels(Y_longer)
# Train-test split on X and Y
TRAINING_SPLIT = 0.7

which_one = 0

if which_one == 0:
train_size = int(TRAINING_SPLIT*len(label_sequences))
X_train = [sequences[i] for i in range(train_size)]
Y_train = [label_sequences[i] for i in range(train_size)]

X_test = [sequences[i] for i in range(train_size, len(label_sequences))]
Y_test = [label_sequences[i] for i in range(train_size, len(label_sequences))]

else:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X_freq_count, label_sequences,
train_size=TRAINING_SPLIT,
random_state = 0)

X_train = np.array(X_train)
X_test = np.array(X_test)
Y_train = np.array(Y_train)
Y_test = np.array(Y_test)

print('X_train.shape: ', X_train.shape)
print('Y_train.shape: ', Y_train.shape)
print('X_test.shape: ', X_test.shape)
print('Y_test.shape: ', Y_test.shape)

BATCH_SIZE = 1
ds_train = tf.data.Dataset.from_tensor_slices((X_train, Y_train)).batch(BATCH_SIZE)
ds_test = tf.data.Dataset.from_tensor_slices((X_test, Y_test)).batch(BATCH_SIZE)
EMBEDDING_DIM = 64

which_one = 0

if which_one == 0:

# Sort of like clustering: learning which words are grouped together
# with respect to the label [classification by token]
kernel_regularizer=tf.keras.regularizers.l2(0.1)
initializer = tf.keras.initializers.HeUniform()
num_of_cols = len(X_train[1])
print('num_of_cols: ', num_of_cols)
inputs = tf.keras.Input(shape=(num_of_cols,))

# Basic model for text classification and embeddings
x = tf.keras.layers.Embedding(input_dim=NUM_WORDS, output_dim=EMBEDDING_DIM, input_length=MAXLEN)(inputs)
x = tf.keras.layers.GlobalAveragePooling1D()(x)
x = tf.keras.layers.Dropout(0.2)(x)
outputs = tf.keras.layers.Dense(NUM_OF_CLASSES,
activation='softmax',
kernel_regularizer=kernel_regularizer,
kernel_initializer=initializer)(x)

model = tf.keras.Model(inputs=inputs, outputs=outputs)

base_learning_rate = 0.01
# from_logits=False says to NOT calculate sigmoid/softmax, because it is already used in the last Dense layer
optimizer = tf.keras.optimizers.Adam(learning_rate=base_learning_rate) # OR optimizer='adam'
loss = tf.losses.SparseCategoricalCrossentropy(from_logits=False) # OR loss='sparse_categorical_crossentropy'
metrics = ['accuracy'] # OR metrics=['acc']
model.compile(optimizer=optimizer, loss=loss, metrics=metrics)

elif which_one == 1:

# Bidirectional RNN: learning which words are grouped together with respect to the label and then
# learning sequencial ordering of likely grouped words [classification by token]
kernel_regularizer=tf.keras.regularizers.l2(0.1)
initializer = tf.keras.initializers.HeUniform()
num_of_cols = len(X_train[1])
print('num_of_cols: ', num_of_cols)
inputs = tf.keras.Input(shape=(num_of_cols,))

# In notes: Text_classification_example14.ipynb for overfitting
x = tf.keras.layers.Embedding(input_dim=NUM_WORDS, output_dim=EMBEDDING_DIM, input_length=MAXLEN)(inputs)
# ndim=3 after Embedding Vector
x = tf.keras.layers.Dropout(0.3)(x)
# BidirectionalLSTM requires ndim=3
n_a = 64
x = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(n_a))(x)
x = tf.keras.layers.Dense(n_a, activation='relu', kernel_regularizer=kernel_regularizer)(x)

outputs = tf.keras.layers.Dense(NUM_OF_CLASSES,
activation='softmax',
kernel_regularizer=kernel_regularizer,
kernel_initializer=initializer)(x)

model = tf.keras.Model(inputs=inputs, outputs=outputs)

base_learning_rate = 0.0001
# from_logits=False says to NOT calculate sigmoid/softmax, because it is already used in the last Dense layer
optimizer = tf.keras.optimizers.Adam(learning_rate=base_learning_rate) # OR optimizer='adam'
loss = tf.losses.SparseCategoricalCrossentropy(from_logits=False) # OR loss='sparse_categorical_crossentropy'
metrics = ['accuracy'] # OR metrics=['acc']
model.compile(optimizer=optimizer, loss=loss, metrics=metrics)

elif which_one == 2:

# Deep layer NN for classifying frequency count [bag-of-words]
# Learning which words repeat (or 'are important') with respect to the label,
# but it does not consider the sequencial ordering of the words in a 'projected word dimensional space'

kernel_regularizer=tf.keras.regularizers.l2(0.1)
initializer = tf.keras.initializers.HeUniform()

num_of_rows, num_of_cols = X_freq_count.shape
print('num_of_cols: ', num_of_cols)
inputs = tf.keras.Input(shape=(num_of_cols,))

# Good architecture for overfitting
x = tf.keras.layers.Dense(128, input_dim=num_of_cols, activation='relu')(inputs)
x = tf.keras.layers.Dropout(0.4)(x)
x = tf.keras.layers.Dense(128, activation='relu')(x)
x = tf.keras.layers.Dropout(0.3)(x)
x = tf.keras.layers.Dense(128, activation='relu')(x)
x = tf.keras.layers.Dropout(0.2)(x)

outputs = tf.keras.layers.Dense(NUM_OF_CLASSES,
activation='softmax',
kernel_regularizer=kernel_regularizer,
kernel_initializer=initializer)(x)

model = tf.keras.Model(inputs=inputs, outputs=outputs)

base_learning_rate = 0.0001
# from_logits=False says to NOT calculate sigmoid/softmax, because it is already used in the last Dense layer
optimizer = tf.keras.optimizers.Adam(learning_rate=base_learning_rate) # OR optimizer='adam'
loss = tf.losses.SparseCategoricalCrossentropy(from_logits=False) # OR loss='sparse_categorical_crossentropy'
metrics = ['accuracy'] # OR metrics=['acc']
model.compile(optimizer=optimizer, loss=loss, metrics=metrics)

model.summary()
# Embedding layer: bbc
early_stopping = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=60, mode='min')

EPOCHS = 30
STEPS_PER_EPOCH = 200

history = model.fit(X_train, Y_train,
validation_data=(X_test, Y_test),
batch_size=BATCH_SIZE,
epochs=EPOCHS, steps_per_epoch=STEPS_PER_EPOCH,
callbacks=[early_stopping])

Using the Embedding layer for the BBC dataset I reached roughly 0.8 accuracy for 30 epochs, it could be trained longer for 50–100 epochs for stable accuracy results.
# Frequency term: huffington post
EPOCHS = 10

history = model.fit(ds_train,
validation_data=ds_test,
batch_size=BATCH_SIZE,
epochs=EPOCHS
)

Using the Frequency Term matrix [bag-of-words] for the Huffington Post dataset I could get 0.9 accuracy in 10 epochs, using the which_one =2 model selection.
# Embedding layer: huffington post
EPOCHS = 10
STEPS_PER_EPOCH = 20

# validataion_data does not work AND train accuracy is different for using X_train, Y_train vs ds_train
history = model.fit(X_train, Y_train,
validation_data=(X_test, Y_test),
batch_size=BATCH_SIZE,
epochs=EPOCHS, steps_per_epoch=STEPS_PER_EPOCH)

Using the Embedding layer for the Huffington Post dataset I reached roughly 0.2 accuracy for 10 epochs, not a reliable accuracy trend as was seen in the BBC dataset.

In the test, model I only evaluate the Huffington Post dataset because it had difficulty training with the Embedding layer architecture.

# ---------------------------
# Obtain a sentence
# ---------------------------
i = np.random.permutation(np.arange(len(X_train)))[0]
print('i:', i)

which_way = 'input_seq' # input_sen

if which_way == 'input_seq':
seq_example = X_train[i]

print('sentence:', X_longer[i])

# Print sentence: decode the first sequence using the Tokenizer class
# https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer
#sen_example = tokenizer0.sequences_to_texts([X_train[i]])
#print('sen_example: ', sen_example)

else:
# Sentence example
# ---------------------------
# Transform the sentence into a sequence
# ---------------------------
# Prepare seed_text
seq_example = tokenizer0.texts_to_sequences([sen_example])[0]

# Pad the sequence
seq_example = pad_sequences([seq_example], maxlen=MAXLEN, padding='post', truncating='post')
# print('seq_example: ', seq_example)

seq_example = tf.constant(seq_example, dtype=tf.float32)
seq_example = tf.reshape(seq_example, [len(seq_example), 1])
seq_example = tf.expand_dims(seq_example, axis=0)

# ---------------------------
# Predict with pre-trained model
# ---------------------------
probabilities = model.predict(seq_example, verbose=0)
predicted_index = np.argmax(probabilities)
print('predicted_index: ', predicted_index)

# ---------------------------
# Print result
# ---------------------------
y_assignment_reverse = dict((v, k) for k, v in y_assignment.items())
print('y_assignment_reverse: ', y_assignment_reverse)
print('predicted_label:', y_assignment_reverse[predicted_index])
print('true_label:', y_assignment_reverse[Y_train[i]])

Below we can see that the model predicts extremely well for the Frequency term matrix, however we can also see that the sentences do not literally correspond to the class label topic. In example 1 the sentence words are related to the topic of travel, but in example 2 the sentence words could either correspond to wellness or business.

Example 1. A Huffington Post dataset sentence that corresponds to the class label travel.
Example 2. A Huffington Post dataset sentence that sort of corresponds to the class label wellness, the sentence could be similar to the topic of business.

It is likely that the Embedding layer failed to capture differences between classes because many of the sentences had mixed keywords across different class labels.

I evaluated both the word embedding vectors for both datasets. The BBC sentence embedding results per class are shown below because they produced the most contrast between classes.

Get the Embedding weights

def normalize_nestedarrs_by_max(arr):
# Normalize the arr = [[1, 2, 3], [4, -5, 6]] from 0 to [-1 or 1]
rows_of_dist = len(arr)

# Find the maximum value
max_val = np.max([np.max(np.abs(arr[i])) for i in range(rows_of_dist)])
# print('max_val: ', max_val)

# Normalize
arr_nor = [arr[i]/max_val for i in range(rows_of_dist)]

return arr_nor

# Get the embedding layer from the model (i.e. first layer)
embedding_layer = model.layers[1] # layer embedding_1 is layer 1

# Get the weights of the embedding layer
embedding_weights = embedding_layer.get_weights()[0]
print(embedding_weights.shape) # (vocab_size, embedding_dim)

# Get the index-word dictionary: so
reverse_word_index = tokenizer0.index_word # is dict[index] = word
reverse_word_index

We can see the assignment of words per token.
# Get the max embedding_weights value
max_emb = np.max(abs(embedding_weights))
print('max_emb: ', max_emb)
# Create a [sentence embedding] from word embeddings

# Loop over each sequence
num_of_seq, num_of_words = X_train.shape
avg_seqemb_per_sentence = []
for i in range(num_of_seq):

seq_emb = np.zeros((len(embedding_weights[0]),))
sequence = X_train[i]
sequence_nozeros = [i for i in sequence if i > 0]

# Per sequence, loop over each word and add up all the word embeddings to get a sentence embeddings
temp_seq_emb = []
for word_num in sequence_nozeros:
# get embedding for each word
# Get the embedding weights associated with the current index, scale it from 0 to [-1 or 1]
word_embedding = embedding_weights[word_num]/max_emb

# Without evaluation process of word embedding direction
seq_emb = seq_emb + word_embedding

# Sentence embedding: the average vector could represent the entire [sequence] in vector space
avg_seqemb_per_sentence.append(seq_emb/len(sequence_nozeros))

# Normalize the sentence embeddings so they are from 0 to [-1 or 1]
max_val = np.max([np.max(abs(avg_seqemb_per_sentence[i])) for i in range(num_of_seq)])
print('max_val: ', max_val)

# Normalize the average sentence embedding from 0 to [-1 or 1]
seq_emb_per_sentence_nor = [avg_seqemb_per_sentence[i]/max_val for i in range(num_of_seq)]

# Sum the normalized [sentence embeddings] per class
class_count = Counter(Y_train)
class_num = sorted(list(class_count.keys()))
print('class_num: ', class_num)

# Initialize the [sentence embedding class dictionary]
seq_emb_avg_dict = {}
for i in class_num:
seq_emb_avg_dict[i] = np.zeros((len(embedding_weights[0]),))

# Loop over each sequence
for i in range(num_of_seq):

class_number = Y_train[i]

# Sum up each sentence embedding per class
seq_emb_avg_dict[class_number] = seq_emb_avg_dict[class_number] + seq_emb_per_sentence_nor

# Divid by the class count to find the [average sentence embedding per class]
for i in class_num:
seq_emb_avg_dict[i] = seq_emb_avg_dict[i]/class_count[i]

# Verify that the embeddings are from -1 to 1
for cn in class_num:
max_val = np.max([np.max(seq_emb_avg_dict[cn][i]) for i in range(num_of_seq)])
print(f'max_val class {cn}: ', max_val)

min_val = np.min([np.min(seq_emb_avg_dict[cn][i]) for i in range(num_of_seq)])
print(f'min_val class {cn}: ', min_val)

Indeed the average sentence embedding vector per class are each normalized from -1 to 1.
# Measure distance between average class embeddings
dist = []
for i in class_num:
temp = []
for j in class_num:
temp.append( float(tf.norm(tf.subtract(seq_emb_avg_dict[i], seq_emb_avg_dict[j]),
ord='euclidean', axis=None, keepdims=None, name=None)) ) # dot product
dist.append(temp)

dist_nor = normalize_nestedarrs_by_max(dist)

import seaborn as sns
sns.heatmap(dist_nor, annot=True, linewidths=.5)

L2 distance between average sentence embeddings per class. The x an y axis are the classes {‘entertainment’: 0, ‘politics’: 1, ‘sport’: 2, ‘business’: 3, ‘tech’: 4}.

One can see that entertainment, politics, and tech categories share similar words because their average sentence embedding vectors are pointing a similar direction. Where as sports and business class appear to have distinguishing words that cause their average sentence embedding vectors to point in different directions.

I could accurately train a model using an Embedding layer with the bbc-new-dataset because the words in the sentence often literally corresponded to the class topic label. Therefore, therefore the embedding space was more clustered per class. However, the news-category-dataset was less organized in the sense that:

  1. the sentences did not always logically correspond to the class label topic
  2. the sentences were shorter (100 words in comparison to 200–1000)
  3. the label had 10 classes instead of 5 classes; the more the classes the more difficult it is to classify sentences.

Thus, for these reasons the news-category-dataset could not be accurately classified using an Embedding layer; the embedding space per class was too mixed. Using a simple term-frequency X-matrix allowed for accurate classification with a deep layer neural network. Transforming the sentences into a term-frequency matrix is simpler than finding outlier embeddings per class, however I think another viable algorithm solution to this problem would be to : calculate word embeddings, calculate sentence embeddings with only similar word embeddings and identify words for the non-similar word embeddings, calculate average sentence embeddings per class and compile a list of non-similar word embeddings per class.

Happy Practicing! 👋

🎁 Donate: support the blog! | 💻 GitHub | 🔔 Subscribe

  1. Huffington Post dataset with 10 categories: https://www.kaggle.com/datasets/rmisra/news-category-dataset
  2. Original data in Coursera Natural Language Processing Tensorflow (DeepLearning_AI_TensorFlow_Developer_Specialization). BBC new dataset Kaggle competition:
    https://www.kaggle.com/competitions/learn-ai-bbc
  3. Kaggle notebook: https://www.kaggle.com/jamilahfoucher/standard-text-classification



Source link

Be the first to comment

Leave a Reply

Your email address will not be published.


*