Find Text Similarity Using TensorFlow.js | by Kevin Hermawan | Apr, 2024


Finding similar pieces of text is important for applications such as search engines, chatbots, and recommendation systems. It helps provide users with more relevant information. In this article, we’ll learn how to use TensorFlow.js and the Universal Sentence Encoder model to find the similarity between different texts.

TensorFlow.js is a JavaScript library that enables the training and deployment of machine learning models in the browser or on the server side using Node.js.

The Universal Sentence Encoder (Cer et al., 2018) is a model designed to encode text into 512-dimensional embeddings. These embeddings can be used in various natural language processing tasks, including sentiment classification and textual similarity analysis.

First things first, we need to install the necessary TensorFlow.js packages. The installation process varies depending on your environment:

Node.js

For server-side applications using Node.js:

npm install @tensorflow/tfjs-node @tensorflow-models/universal-sentence-encoder

Web Browser

For use directly in web browsers:

npm install @tensorflow/tfjs @tensorflow-models/universal-sentence-encoder

GPU Acceleration

For optimized performance with GPU acceleration:

npm install @tensorflow/tfjs-gpu @tensorflow-models/universal-sentence-encoder

To find text similarity, we need to set up an environment, load the necessary models, and define the calculation functions. Here’s how we can do it step by step in more detail:

Import Libraries

First, we need to configure TensorFlow.js based on our environment. We import the required libraries using the following code:

// Adjust based on environment: tfjs-node for Node.js
import * as tf from "@tensorflow/tfjs-node";
import use from "@tensorflow-models/universal-sentence-encoder";

Setup the Similarity Calculation

Next, we define the main function calculateSimilarity that will handle the text similarity calculation.

async function calculateSimilarity(inputText: string, compareTexts: string[]) {
// Load the model and generate embeddings
const model = await use.load();
const embeddings = await model.embed([inputText, ...compareTexts]);
const baseEmbedding = embeddings.slice([0, 0], [1]);

const results = [];

// Compare each text's embedding with the input text's embedding
for (let i = 0; i < compareTexts.length; i++) {
const compareEmbedding = embeddings.slice([i + 1, 0], [1]);
const similarity = cosineSimilarity(baseEmbedding, compareEmbedding);
const similarityScore = similarity.dataSync()[0].toFixed(4);

results.push({
"Input Text": inputText,
"Comparison Text": compareTexts[i],
"Similarity Score": similarityScore,
});
}

// Sort results by similarity score in descending order
return results.sort((a, b) => parseFloat(b["Similarity Score"]) - parseFloat(a["Similarity Score"]));
}

Cosine Similarity Function

The cosineSimilarity function is a function that calculates the cosine similarity between two vectors (in this case, the text embeddings). Cosine similarity is a measure of how similar two vectors are, based on the cosine of the angle between them.

function cosineSimilarity(a: tf.Tensor, b: tf.Tensor) {
const normalizedA = a.div(tf.norm(a, 'euclidean'));
const normalizedB = b.div(tf.norm(b, 'euclidean'));

return tf.sum(tf.mul(normalizedA, normalizedB));
}

Inside the function above, we first normalize the input vectors a and b using the div and norm methods from TensorFlow.js. Normalization ensures that the vectors have a length of 1, which is necessary for calculating the cosine similarity correctly.

Test the Text Similarity Model

Let’s test the functionality using a specific example. We’ll compare the input text “Secure technology” against a set of different comparison texts.

const inputText = "Secure technology";

const compareTexts = [
"Geometry's elegant shapes define the space around us.",
"Socratic questioning uncovers truth beneath societal norms.",
"Blockchain technology revolutionizes security in digital transactions.",
"Calculus captures the essence of change through derivatives and integrals.",
"Utilitarian ethics seek the greatest good for the greatest number."
];

calculateSimilarity(inputText, compareTexts).then((results) => {
console.table(results);
});

The function output presents the similarity scores as follows:

┌───┬───────────────────┬────────────────────────────────────────────────────────────────────────────┬──────────────────┐
│ │ Input Text │ Comparison Text │ Similarity Score │
├───┼───────────────────┼────────────────────────────────────────────────────────────────────────────┼──────────────────┤
│ 0 │ Secure technology │ Blockchain technology revolutionizes security in digital transactions. │ 0.5221 │
│ 1 │ Secure technology │ Socratic questioning uncovers truth beneath societal norms. │ 0.3258 │
│ 2 │ Secure technology │ Calculus captures the essence of change through derivatives and integrals. │ 0.2328 │
│ 3 │ Secure technology │ Utilitarian ethics seek the greatest good for the greatest number. │ 0.2156 │
│ 4 │ Secure technology │ Geometry's elegant shapes define the space around us. │ 0.1840 │
└───┴───────────────────┴────────────────────────────────────────────────────────────────────────────┴──────────────────┘

Based on the output, the text “Blockchain technology revolutionizes security in digital transactions.” has the highest similarity score of 0.5221 with the input text “Secure technology”, while the text “Geometry’s elegant shapes define the space around us.” has the lowest similarity score of 0.1840.

In this article, we learned how to utilize TensorFlow.js and the Universal Sentence Encoder model to effectively calculate text similarity. The step-by-step guide covered setting up the environment, importing required libraries, and defining core functions for similarity computation using cosine similarity on text embeddings. While powerful, it’s important to recognize the potential limitations of machine learning models, especially in complex scenarios.



Source link

Be the first to comment

Leave a Reply

Your email address will not be published.


*