- The normalization function was slightly biased; it weighed text search higher and gave it more significance in the final results.
Distance-based algorithms such as K-Nearest Neighbors (KNN) calculate distances between data points, whereas BM25 is based on the frequencies of occurances of keywords. Both return scores that are completely on different scales. This can lead to biased results and inaccurate rankings. Our normilization procedure always produced a perfect score (1) for at least one document in the lexical search result set, hence in our situation the results were biased in favor of the lexical search.
To tackle this problem, let’s explore two commonly used functions: Min-Max normalization and Z-Score normalization. The Z-Score method scales the data to achieve a zero mean and unit variance, while Min-Max normalization rescales the data to fit within a specific range.
The key idea is that I can gain a basic understanding of how scores are distributed for each search type with similar queries if I compute the parameters used in these normilizing functions beforehand and apply them during the normlizing stage. The two functions formulas are:
Considering your index’s structure can help you decide which one to choose since each has advantages of its own. If your documents are more similar to one another and the top-k results of a typical query return documents that are very similar to one another and clustered together within the index, as seen in the graph below, Min-Max may be a better option.
However, Z-Score is more suited if the results are more evenly distributed and have some characteristics of a normal distribution, as shown in the example below.
Both approaches call for certain parameters to be determined; we must compute the mean, standard deviation, minimum score, and maximum score. We must determine these values separately for each search type because vector and semantic results have different scoring systems. Let’s run 1000 random queries, a vector search, and a semantic search to do this. If you don’t have queries, you can use OpenSearch’s scroll API to extract text fields to be used as queries from different parts of your index. Set k to a significant value, for example k=1000, to better understand the ratios within our score range. However, be careful not to set k too high since this may have an impact on the Min-Max function. Simply compute the necessary parameters after collecting all of these scores.
# Lexical Search
text_scores = []
for query in queries:
response = client.search(
index=INDEX_NAME,
body={
"query": {
"match": {
"caption": query
}
},
"size": 1000
}
)
scores = [hit['_score'] for hit in response['hits']['hits']]
text_scores.append(scores)# Vector search
vector_scores = []
# Vectorize queries using SentenceTransformer
query_embeddings = model.encode(queries)
# Perform vector search
for query_embedding in query_embeddings:
request_body = {
"size": 1000,
"query": {
"script_score": {
"query": {
"match_all": {}
},
"script": {
"source": "knn_score",
"lang": "knn",
"params": {
"field": "caption_embedding",
"query_value": query_embedding.tolist(),
"space_type": "cosinesimil"
}
}
}
}
}
response = client.search(
index=INDEX_NAME,
body=request_body
)
scores = [hit['_score'] for hit in response['hits']['hits']]
vector_scores.append(scores)
vector_score_mean = np.mean(vector_scores) # Calculate the mean
vector_score_std = np.std(vector_scores, ddof=1) # Calculate standard deviation
vector_score_min = np.min(vector_scores) # Calculate minimum score
vector_score_max = np.max(vector_scores) # Calculate maximum score
text_score_mean = np.mean(text_scores) # Calculate the mean
text_score_std = np.std(text_scores, ddof=1) # Calculate standard deviation
text_score_min = np.min(text_scores) # Calculate minimum score
text_score_max = np.max(text_scores) # Calculate maximum score
The process is shown in the diagram below:
Set aside the parameters you’ve extracted for the lexical and vector results. Each index will need to have this done separately once. Finally, in the normalization step, we’ll use these parameters in the following way:
Be the first to comment