Recently, I was searching for an open-source recipes dataset for a personal project but I could not find any except for this github repository containing the recipes displayed on publicdomainrecipes.com.
Unfortunately, I needed a dataset that was more exploitable, i.e something closer to tabular data or to a NoSQL document. That’s how I thought about finding a way to transform the raw data into something more suitable to my needs, without spending hours, days and weeks doing it manually.
Let me show you how I used the power of Large Language Models to automate the process of converting the raw text into structured documents.
Dataset
The original dataset is a collection of markdown files. Each file representing a recipe.
As you can see, this is not completely unstructured, there are nice tabular metadata on top of the file, then there are 4 distincts sections:
- An introduction,
- The list of ingredients
- Directions
- Some tips.
Based on this observation, Sebastian Bahr, developed a parser to transform the markdown files into JSON here.
The output of the parser is already more exploitable, besides Sebastian used it to build a recipe recommender chatbot. However, there are still some drawbacks. The ingredients and directions keys contain raw texts that could be better structured.
As-is, some useful information is hidden.
For example, the quantities for the ingredients, the preparation or cooking time for each step.
Code
In the remainder of this article, I’ll show the steps that I undertook to get to JSON documents that look like the one below.
{
"name": "Crêpes",
"serving_size": 4,
"ingredients": [
{
"id": 1,
"name": "white flour",
"quantity": 300.0,
"unit": "g"
},
{
"id": 2,
"name": "eggs",
"quantity": 3.0,
"unit": "unit"
},
{
"id": 3,
"name": "milk",
"quantity": 60.0,
"unit": "cl"
},
{
"id": 4,
"name": "beer",
"quantity": 20.0,
"unit": "cl"
},
{
"id": 5,
"name": "butter",
"quantity": 30.0,
"unit": "g"
}
],
"steps": [
{
"number": 1,
"description": "Mix flour, eggs, and melted butter in a bowl.",
"preparation_time": null,
"cooking_time": null,
"used_ingredients": [1,2,5]
},
{
"number": 2,
"description": "Slowly add milk and beer until the dough becomes fluid enough.",
"preparation_time": 5,
"cooking_time": null,
"used_ingredients": [3,4]
},
{
"number": 3,
"description": "Let the dough rest for one hour.",
"preparation_time": 60,
"cooking_time": null,
"used_ingredients": []
},
{
"number": 4,
"description": "Cook the crêpe in a flat pan, one ladle at a time.",
"preparation_time": 10,
"cooking_time": null,
"used_ingredients": []
}
]
}
The code to reproduce the tutorial is on GitHub here.
I relied on two powerful libraries langchain
for communicating with LLM providers and pydantic
to format the output of the LLMs.
First, I defined the two main components of a recipe with the Ingredient
and Step
classes.
In each class, I defined the relevant attributes and provided a description of the field and examples. Those are then fed to the LLMs by langchain
leading to better results.
"""`schemas.py`"""from pydantic import BaseModel, Field, field_validator
class Ingredient(BaseModel):
"""Ingredient schema"""
id: int = Field(
description="Randomly generated unique identifier of the ingredient",
examples=[1, 2, 3, 4, 5, 6],
)
name: str = Field(
description="The name of the ingredient",
examples=["flour", "sugar", "salt"]
)
quantity: float | None = Field(
None,
description="The quantity of the ingredient",
examples=[200, 4, 0.5, 1, 1, 1],
)
unit: str | None = Field(
None,
description="The unit in which the quantity is specified",
examples=["ml", "unit", "l", "unit", "teaspoon", "tablespoon"],
)
@field_validator("quantity", mode="before")
def parse_quantity(cls, value: float | int | str | None):
"""Converts the quantity to a float if it is not already one"""
if isinstance(value, str):
try:
value = float(value)
except ValueError:
try:
value = eval(value)
except Exception as e:
print(e)
pass
return value
class Step(BaseModel):
number: int | None = Field(
None,
description="The position of the step in the recipe",
examples=[1, 2, 3, 4, 5, 6],
)
description: str = Field(
description="The action that needs to be performed during that step",
examples=[
"Preheat the oven to 180°C",
"Mix the flour and sugar in a bowl",
"Add the eggs and mix well",
"Pour the batter into a greased cake tin",
"Bake for 30 minutes",
"Let the cake cool down before serving",
],
)
preparation_time: int | None = Field(
None,
description="The preparation time mentioned in the step description if any.",
examples=[5, 10, 15, 20, 25, 30],
)
cooking_time: int | None = Field(
None,
description="The cooking time mentioned in the step description if any.",
examples=[5, 10, 15, 20, 25, 30],
)
used_ingredients: list[int] = Field(
[],
description="The list of ingredient ids used in the step",
examples=[[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12]],
)
class Recipe(BaseModel):
"""Recipe schema"""
name: str = Field(
description="The name of the recipe",
examples=[
"Chocolate Cake",
"Apple Pie",
"Pasta Carbonara",
"Pumpkin Soup",
"Chili con Carne",
],
)
serving_size: int | None = Field(
None,
description="The number of servings the recipe makes",
examples=[1, 2, 4, 6, 8, 10],
)
ingredients: list[Ingredient] = []
steps: list[Step] = []
total_preparation_time: int | None = Field(
None,
description="The total preparation time for the recipe",
examples=[5, 10, 15, 20, 25, 30],
)
total_cooking_time: int | None = Field(
None,
description="The total cooking time for the recipe",
examples=[5, 10, 15, 20, 25, 30],
)
comments: list[str] = []
Technical Details
- It is important to not have a model which is too strict here otherwise, the pydantic validation of the JSON outputted by the LLM will fail. A good way to give some flexibility is too provide default values like
None
or empty lists[]
depending on the targeted output type. - Note the
field_validator
on thequantity
attribute of theIngredient
, is there to help the engine parse quantities. It was not initially there but by doing some trials, I found out that the LLM was often providing quantities as strings such as1/3
or1/2
. - The
used_ingredients
allow to formally link the ingredients to the relevant steps of the recipes.
The model of the output being defined the rest of the process is pretty smooth.
In a prompt.py
file, I defined a create_prompt
function to easily generate prompts. A “new” prompt is generated for every recipe. All prompts have the same basis but the recipe itself is passed as a variable to the base prompt to create a new one.
""" `prompt.py`The import statements and the create_prompt function have not been included
in this snippet.
"""
# Note : Extra spaces have been included here for readability.
DEFAULT_BASE_PROMPT = """
What are the ingredients and their associated quantities
as well as the steps to make the recipe described
by the following {ingredients} and {steps} provided as raw text ?
In particular, please provide the following information:
- The name of the recipe
- The serving size
- The ingredients and their associated quantities
- The steps to make the recipe and in particular, the duration of each step
- The total duration of the recipe broken
down into preparation, cooking and waiting time.
The totals must be consistent with the sum of the durations of the steps.
- Any additional comments
{format_instructions}
Make sure to provide a valid and well-formatted JSON.
"""
The communication with the LLM logic was defined in therun
function of thecore.py
file, that I won’t show here for brevity.
Finally, I combined all those components in mydemo.ipynb
notebook whose content is shown below.
# demo.ipynb
import os
from pathlib import Pathimport pandas as pd
from langchain.output_parsers import PydanticOutputParser
from langchain_mistralai.chat_models import ChatMistralAI
from dotenv import load_dotenv
from core import run
from prompt import DEFAULT_BASE_PROMPT, create_prompt
from schemas import Recipe
# End of first cell
# Setup environment
load_dotenv()
MISTRAL_API_KEY = os.getenv("MISTRAL_API_KEY") #1
# End of second cell
# Load the data
path_to_data = Path(os.getcwd()) / "data" / "input" #2
df = pd.read_json("data/input/recipes_v1.json")
df.head()
# End of third cell
# Preparing the components of the system
llm = ChatMistralAI(api_key=MISTRAL_API_KEY, model_name="open-mixtral-8x7b")
parser = PydanticOutputParser(pydantic_object=Recipe)
prompt = create_prompt(
DEFAULT_BASE_PROMPT,
parser,
df["ingredients"][0],
df["direction"][0]
)
#prompt
# End of fourth cell
# Combining the components
example = await run(llm, prompt, parser)
#example
# End of fifth cell
I used MistralAI as a LLM provider, with their open-mixtral-8x7b
model which is a very good open-source alternative to OpenAI. langchain
allows you to easily switch provider given you have created an account on the provider’s platform.
If you are trying to reproduce the results:
- (#1) — Make sure you have a
MISTRAL_API_KEY
in a .env file or in your OS environment. - (#2) — Be careful to the path to the data. If you clone my repo, this won’t be an issue.
Running the code on the entire dataset cost less than 2€.
The structured dataset resulting from this code can be found here in my repository.
I am happy with the results but I could still try to iterate on the prompt, my field descriptions or the model used to improve them. I might try MistralAI newer model, the open-mixtral-8x22b
or try another LLM provider by simply changing 2 or 3 lines of code thanks to langchain
.
When I am ready, I can get back to my original project. Stay tuned if you want to know what it was. In the meantime, let me know in the comments what would you do with the final dataset ?
Large Language Models (LLMs) offer a powerful tool for structuring unstructured data. Their ability to understand and interpret human language nuances, automate laborious tasks, and adapt to evolving data make them an invaluable resource in data analysis. By unlocking the hidden potential within unstructured textual data, businesses can transform this data into valuable insights, driving better decision-making and business outcomes. The example provided, of transforming raw recipes data into a structured format, is just one of the countless possibilities that LLMs offer.
As we continue to explore and develop these models, we can expect to see many more innovative applications in the future. The journey of harnessing the full potential of LLMs is just beginning, and the road ahead promises to be an exciting one.
Be the first to comment