Large Language Model Performance in Time Series Analysis | by Aparna Dhinakaran | May, 2024


Image created by author using Dall-E 3

How do major LLMs stack up at detecting anomalies or movements in the data when given a large set of time series data within the context window?

While LLMs clearly excel in natural language processing tasks, their ability to analyze patterns in non-textual data, such as time series data, remains less explored. As more teams rush to deploy LLM-powered solutions without thoroughly testing their capabilities in basic pattern analysis, the task of evaluating the performance of these models in this context takes on elevated importance.

In this research, we set out to investigate the following question: given a large set of time series data within the context window, how well can LLMs detect anomalies or movements in the data? In other words, should you trust your money with a stock-picking OpenAI GPT-4 or Anthropic Claude 3 agent? To answer this question, we conducted a series of experiments comparing the performance of LLMs in detecting anomalous time series patterns.

All code needed to reproduce these results can be found in this GitHub repository.

Figure 1: A rough sketch of the time series data (image by author)

We tasked GPT-4 and Claude 3 with analyzing changes in data points across time. The data we used represented specific metrics for different world cities over time and was formatted in JSON before input into the models. We introduced random noise, ranging from 20–30% of the data range, to simulate real-world scenarios. The LLMs were tasked with detecting these movements above a specific percentage threshold and identifying the city and date where the anomaly was detected. The data was included in this prompt template:

  basic template = ''' You are an AI assistant for a data scientist. You have been given a time series dataset to analyze.
The dataset contains a series of measurements taken at regular intervals over a period of time.
There is one timeseries for each city in the dataset. Your task is to identify anomalies in the data. The dataset is in the form of a JSON object, with the date as the key and the measurement as the value.

The dataset is as follows:
{timeseries_data}

Please use the following directions to analyze the data:
{directions}

...

Figure 2: The basic prompt template used in our tests

Analyzing patterns throughout the context window, detecting anomalies across a large set of time series simultaneously, synthesizing the results, and grouping them by date is no simple task for an LLM; we really wanted to push the limits of these models in this test. Additionally, the models were required to perform mathematical calculations on the time series, a task that language models generally struggle with.

We also evaluated the models’ performance under different conditions, such as extending the duration of the anomaly, increasing the percentage of the anomaly, and varying the number of anomaly events within the dataset. We should note that during our initial tests, we encountered an issue where synchronizing the anomalies, having them all occur on the same date, allowed the LLMs to perform better by recognizing the pattern based on the date rather than the data movement. When evaluating LLMs, careful test setup is extremely important to prevent the models from picking up on unintended patterns that could skew results.

Figure 3: Claude 3 significantly outperforms GPT-4 in time series analysis (image by author)

In testing, Claude 3 Opus significantly outperformed GPT-4 in detecting time series anomalies. Given the nature of the testing, it’s unlikely that this specific evaluation was included in the training set of Claude 3 — making its strong performance even more impressive.

Results with 50% Spike

Our first set of results is based on data where each anomaly was a 50% spike in the data.



Source link

Be the first to comment

Leave a Reply

Your email address will not be published.


*