A Simple Guide to Sentiment Analysis with TensorFlow | by Sri Ram Prakhya | @venkataprakhya7 | Aug, 2024


Introduction

In this article, we’ll walk through the process of building a sentiment analysis model using TensorFlow. This model can help determine if a movie review is positive or negative. We’ll use the IMDb movie review dataset and cover each step, from data preparation to running predictions on real movie reviews, including normal and Oscar-winning films.

Before training our model, we need to prepare the data. The IMDb dataset, which comes preprocessed as sequences of integers, represents each word in a review.

  1. Loading the Dataset: We start by loading the IMDb dataset, which is already divided into training and testing sets.
  2. Tokenization: We convert text into numbers (tokens) because computers understand numbers better than words.
  3. Padding: We make sure that all reviews are the same length by adding zeros (padding) to shorter reviews.

Here’s the code for this step:

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split

# Load the IMDb dataset
imdb = tf.keras.datasets.imdb
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=10000)

# Convert the data back to text (just for tokenization and padding demonstration, usually not needed)
word_index = imdb.get_word_index()
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

raw_reviews_train = [" ".join([reverse_word_index.get(i - 3, "?") for i in review]) for review in X_train]
raw_reviews_test = [" ".join([reverse_word_index.get(i - 3, "?") for i in review]) for review in X_test]

# Step 1: Tokenization (already done during dataset loading, but let's assume we're starting fresh)
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(raw_reviews_train) # Only fitting on training data
sequences_train = tokenizer.texts_to_sequences(raw_reviews_train)
sequences_test = tokenizer.texts_to_sequences(raw_reviews_test)

# Step 2: Padding
max_length = 100
padded_sequences_train = pad_sequences(sequences_train, maxlen=max_length, padding='post')
padded_sequences_test = pad_sequences(sequences_test, maxlen=max_length, padding='post')

# Splitting the data (if needed, but in this case, we are directly using the train/test split from the dataset)
# Since we already have X_train and X_test, we can skip the train_test_split step

# Now, padded_sequences_train, padded_sequences_test, y_train, and y_test are ready for training and testing the model.

  • Tokenizer: A Tokenizer converts words into numbers. The fit_on_texts() method learns all the unique words in our training data and assigns each word a number (token).
  • Saving the Tokenizer: We save the tokenizer as a JSON file so we can use it later to tokenize new reviews.
  • Padding: All reviews are padded to the same length to ensure the model can handle them consistently.

Now that our data is ready, we’ll build a simple neural network for sentiment analysis. This model will learn to understand if a review is positive or negative.

from tensorflow.keras import layers, regularizers

model = tf.keras.Sequential([
layers.Embedding(input_dim=10000, output_dim=16, input_length=100),
layers.GlobalAveragePooling1D(),
layers.Dense(16, activation='relu', kernel_regularizer=regularizers.l2(0.01)),
layers.Dropout(0.5),
layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
loss='binary_crossentropy',
metrics=['accuracy'])

  • Embedding Layer: This layer helps the model understand the meaning of words by turning them into vectors of numbers.
  • GlobalAveragePooling Layer: This layer takes all the vectors from the Embedding Layer and averages them into one vector.
  • Dense Layer: This is where the model starts to learn patterns in the data.
  • Dropout Layer: This layer helps prevent the model from overfitting (learning too much from the training data and not generalizing well to new data).
  • Sigmoid Activation: The final layer uses a sigmoid activation function to output a value between 0 and 1, representing the probability of the review being positive.

We’ll train our model using the training data. We’ll also implement early stopping to prevent overfitting, which means stopping the training when the model stops improving.

from tensorflow.keras.callbacks import EarlyStopping

early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

model.fit(padded_sequences_train, y_train, epochs=10, validation_data=(padded_sequences_test, y_test), callbacks=[early_stopping])

  • EarlyStopping: This technique monitors the validation loss (how well the model performs on unseen data) and stops the training if the model stops improving for a set number of epochs.

After training, we’ll evaluate the model’s accuracy on the test data to see how well it learned.

loss, accuracy = model.evaluate(padded_sequences_test, y_test)
print(f"Test Accuracy: {accuracy:.4f}")

Finally, we’ll save the trained model so that it can be reused later for making predictions.

model.save('rush_model_v4.keras')
  • Saving the Model: We save the model to a file, allowing us to load and use it later without retraining.

Now, let’s use the trained model to predict the sentiment of reviews from both normal and Oscar-winning movies.

import json

# Load the saved model and tokenizer
model = tf.keras.models.load_model('rush_model_v4.keras')
with open('tokenizer.json') as f:
data = json.load(f)
tokenizer = tf.keras.preprocessing.text.tokenizer_from_json(json.dumps(data))

# Prepare reviews for normal and Oscar-winning movies
normal_movie_reviews = [
"It was an average movie.",
"Not bad, but not great either.",
"The story was weak and the acting was poor."
]

oscar_movie_reviews = [
"An absolute masterpiece!",
"The acting was phenomenal, truly deserving of an Oscar.",
"A brilliant film with a gripping storyline."
]

# Tokenize and pad the reviews
all_reviews = normal_movie_reviews + oscar_movie_reviews
sequences = tokenizer.texts_to_sequences(all_reviews)
padded = pad_sequences(sequences, maxlen=100, padding='post')

# Make predictions
predictions = model.predict(padded)
predicted_classes = (predictions > 0.5).astype("int32")

# Print results
print("Normal Movie Sentiments:")
for i, sentiment in enumerate(predicted_classes[:len(normal_movie_reviews)]):
print(f"Review: {normal_movie_reviews[i]} -> Sentiment: {'Positive' if sentiment == 1 else 'Negative'}")

print("\nOscar-Winning Movie Sentiments:")
for i, sentiment in enumerate(predicted_classes[len(normal_movie_reviews):]):
print(f"Review: {oscar_movie_reviews[i]} -> Sentiment: {'Positive' if sentiment == 1 else 'Negative'}")

  • Loading the Model and Tokenizer: We load the previously saved model and tokenizer to make predictions.
  • Tokenizing and Padding New Reviews: The new reviews are converted into sequences of numbers and padded to the same length as the training data.
  • Making Predictions: The model predicts whether each review is positive or negative.

Here are the results of our sentiment analysis on reviews from both normal and Oscar-winning movies:

Normal Movie Sentiments:

"It was an average movie." → Negative
"Not bad, but not great either." → Negative
"The story was weak and the acting was poor." → Negative

Oscar-Winning Movie Sentiments:

"An absolute masterpiece!" → Positive
"The acting was phenomenal, truly deserving of an Oscar." → Positive
"A brilliant film with a gripping storyline." → Positive

These results show how the model perceives the sentiments of different types of movies. The Oscar-winning films, known for their critical acclaim, tend to receive positive sentiment predictions, while the normal films in this example are seen more negatively.

And that’s it! You’ve just built a simple sentiment analysis model using TensorFlow. This model can now tell whether a movie review is positive or negative. The best part? You can use these same steps to analyze any text data you want!

Feel free to try it out with your own data, and remember, you can name your model anything you like it doesn’t have to be “Rush.” In my case, Rush stands for “Rapid Understanding and Sentiment Handling,” and it’s part of a series of models I’m developing to explore advanced AI capabilities.

Happy coding!



Source link

Be the first to comment

Leave a Reply

Your email address will not be published.


*