Creating a LLaMa 2 Agent Empowered with Wikipedia Knowledge | by Gabriele Sgroi | Sep, 2023


In this blog post, I will explain how to create a simple agent capable of basing its answers on content retrieved from Wikipedia to demonstrate the ability of LLMs to seek and use external information. Given a prompt by the user, the model will search for appropriate pages on Wikipedia and base its answers on their content. I made the full code available in this GitHub repo.

In this section, I will describe the steps needed to create a simple Llama 2 agent that answers questions based on information retrieved from Wikipedia. In particular, the agent will…

  • Create appropriate queries to search pages on Wikipedia that are relevant to the user’s question.
  • Retrieve, from the pages found on Wikipedia, the one with the content most relevant to the user’s question.
  • Extract, from the retrieved page, the most relevant passages to the user’s prompt.
  • Answer the user’s question based on the extracts from the page.

Notice that, more generally, the model could receive a prompt augmented with the full content of the most relevant page or with multiple extracts coming from different top pages ranked by relevance to the user’s prompt. While this could improve the quality of the response from the model, it will increase the required memory as it will inevitably lead to longer prompts. For simplicity, and in order to make a minimal example running on free-tier Google Colab GPUs, I have restricted the agent to use only a few extracts from the most relevant article.

Let us now delve into the various steps in more detail. The first step the agent needs to perform is to create a suitable search query to retrieve content from Wikipedia that contains information to answer the user’s prompt. In order to do that, we will prompt a Llama 2 chat model asking it to return keywords that represent the user prompt. Before going into the specific prompt used, I will shortly recall the general prompt format for Llama 2 chat models.

The template that was used during the training procedure for Llama 2 chat models has the following structure:

<s>[INST] <<SYS>>
{{ system_prompt }}
<</SYS>>

{{ user_message }} [/INST]

The {{ system prompt}} specifies the behavior of the chat model to subsequent prompts and can be useful to adapt the model response to different tasks. The {{user_message}} is the user’s prompt the model needs to answer.

Going back to the problem of obtaining search queries for Wikipedia, our agent will use a Llama 2 model with the following prompt:

<s>[INST] <<SYS>>
You are an assistant returning text queries to search Wikipedia articles containing relevant information about the prompt. Write the queries and nothing else.
Example: [prompt] Tell me about the heatwave in Europe in summer 2023 [query] heatwave, weather, temperatures, europe, summer 2023.
<</SYS>>

[prompt] {prompt} [/INST] [query]

{prompt} will be replaced, before generation, by the user’s input. The example provided as part of the system prompt aims to leverage the in-context learning capabilities of the model. In-context learning refers to the model’s ability to solve new tasks based on a few demonstration examples provided as part of the prompt. In this way, the model can learn that we expect it to provide keywords relevant to the provided prompt separated by commas after the text [query]. The latter is used as a delimiter to distinguish the prompt from the answer in the example and it is also useful to extract the queries from the model output. It is already provided as part of the input so that the model will have to generate only what comes after it.

Once the queries are obtained from the model output, they are used to search Wikipedia and retrieve the metadata and text of the returned pages. In the code accompanying the post, I used the wikipedia package, which is a simple Python package that wraps the MediaWiki API, to search and retrieve the data from Wikipedia.

After extracting the text from the search results, the most relevant page to the original user prompt is selected. This will re-align the retrieved information to the original user’s prompt, potentially eliminating divergences originating from the search queries generated by the model. In order to do so, both the user’s prompt and the summary of the pages from the search result are embedded and stored in a vector database. The article with the closest embedding to the user’s prompt is then retrieved. I used the sentence transformers all-MiniLM-L6-v2 model as the embedding model and a FAISS vector database with the integration provided by the langchain package.

Having found a relevant page from Wikipedia, since adding its whole text to the prompt could require a lot of memory (or surpass the model tokens limit for context length), our agent will find the most relevant extracts to augment the prompt. This is done by first splitting the page’s text into chunks, and then, as before, embedding them into a vector space and retrieving the ones closest to the prompt embedding. I used again the all-MiniLM-L6-v2 model to embed the chunks and a FAISS vector database to store and retrieve them.

Now that we obtained the retrieved passages from the article, we can combine them with the user’s prompt and feed them to the Llama 2 model to get an answer. The template used for the input is the following

<s>[INST] <<SYS>>
You are a helpful and honest assistant. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.
You have retrieved the following extracts from the Wikipedia page {title}:
{extracts}.
You are expected to give truthful answers based on the previous extracts. If it doesn't include relevant information for the request just say so and don't make up false information.
<</SYS>>

{prompt} [/INST]

Before generation, {prompt} is replaced by the user prompt, {title} by the title of the Wikipedia page, and {extracts} is replaced by the extracted passages. One could also provide a few examples to leverage again the in-context learning capabilities of the model, but it would make the prompt significantly longer increasing the memory requirements.

Let us now check if the model is able to answer using up-to-date information that was not present during its training by extracting it from Wikipedia. The following example has been obtained using the smallest chat version of LLaMa 2 with 7 billion parameters running on Google Colab free-tier GPUs.

I have given the model the following prompt:

prompt: "Tell me about the earthquake in Marrakesh"

The model created the following query to search on Wikipedia:

"earthquake, Marrakesh, Morocco, recent earthquakes, seismic activity."

After extracting the passages from the most relevant page, the model returned the following answer:

The earthquake that occurred in Marrakesh on September 8, 2023, was a
devastating event that caused widespread destruction and loss of life in
the region. According to the information provided on Wikipedia, the
earthquake had a moment magnitude of 6.8-6.9 and a maximum Mercalli
intensity of VIII (Severe), making it the strongest instrumentally recorded
earthquake in Morocco and the deadliest since the 1960 Agadir earthquake.
The earthquake's epicentre was located approximately 45 miles (73.4 km)
southwest of Marrakesh, near the town of Ighil and the Oukaïmeden ski
resort in the Atlas Mountains. The earthquake occurred as a result of
shallow oblique-thrust faulting beneath the mountain range. According to
the World Health Organization, around 300,000 people from Marrakesh and the
surrounding areas were affected by the earthquake, including 100,000
children. The earthquake resulted in at least 2,946 deaths, with most
occurring outside of Marrakesh. It is important to note that the
information provided on Wikipedia is based on available data and may not be
exhaustive or up-to-date. If you have any specific questions

The agent also returns the metadata and the extracts of the page it has used for its answer, allowing the user to check its correctness and go into more detail by reading the original page. Here is the metadata for the previous answer

RETRIEVED WIKIPEDIA PAGE: 
title: 2023 Marrakesh–Safi earthquake
url: https://en.wikipedia.org/wiki/2023_Marrakesh%E2%80%93Safi_earthquake
Retrieved extracts:
Extract_0:Earthquake aftermath in Marrakesh and Moulay Brahim Extract_1:.
Damage was widespread, and historic landmarks in Marrakesh were destroyed.
The earthquake was also felt in Spain, Portugal, and Algeria.It is the
strongest instrumentally recorded earthquake in Morocco, the deadliest in
the country since the 1960 Agadir earthquake and the second-deadliest
earthquake of 2023 after the Turkey–Syria earthquake. The World Health
Organization estimated about 300,000 people from Marrakesh and the
surrounding areas were affected, including 100,000 children Extract_2:On 8
September 2023 at 23:11 DST (22:11 UTC), an earthquake with a moment
magnitude of 6.8–6.9 and maximum Mercalli intensity of VIII (Severe) struck
Morocco's Marrakesh–Safi region. The earthquake's epicentre was located
73.4 km (45.6 mi) southwest of Marrakesh, near the town of Ighil and the
Oukaïmeden ski resort in the Atlas Mountains. It occurred as a result of
shallow oblique-thrust faulting beneath the mountain range. At least 2,946
deaths were reported, with most occurring outside Marrakesh

In this post, I explained how to create a simple agent that can respond to a user’s prompt by searching on Wikipedia and base its answer on the retrieved page. Despite its simplicity, the agent is able to provide up-to-date and accurate answers even with the smallest Llama 2 7B model. The agent also returns the extracts from the page it has used to generate its answer, allowing the user to check the correctness of the information provided by the model and to go into more detail by reading the full original page.

Wikipedia is an interesting playground to demonstrate the ability of an LLM agent to seek and use external information that was not present in the training data but the same approach can be applied in other settings where external knowledge is needed. This is the case, for example, for applications that require up-to-date answers, fields that need specific knowledge not present in the training data, or extraction of information from private documents. This approach also highlights the potential of collaboration between LLM and humans. The model can quickly return a meaningful answer searching for relevant information from a very large external knowledge base, while the human user can check the validity of the model answer and delve deeper into the matter by inspecting the original source.

A straightforward improvement of the agent described in this post can be obtained by combining multiple extracts from different pages in order to provide a larger amount of information to the model. In fact, in case of complex prompts, it could be useful to extract information from more than one Wikipedia page. The resulting increase in memory requirements due to the longer contexts can be partially offset by implementing quantization techniques such as GPTQ. The results could be further improved by giving the model the possibility to reason over the search results and the retrieved content before giving its final answer to the users, following for example the ReAct framework described in the paper ReAct: Synergizing Reasoning and Acting in Language Models. That way, for example, it is possible to build a model that iteratively collects the most relevant passages from different pages, discarding the ones that are not needed and combining information from different topics.

Thank you for reading!



Source link

Be the first to comment

Leave a Reply

Your email address will not be published.


*