My First Billion (of Rows) in DuckDB | by João Pedro | May, 2024


DuckDB is an Open Source Project [OSD], the author has no affiliation with DuckDB/DuckDB Labs. The data used is available in the ODbL License. This project is completely free to carry out. It does not require payment for any services, data access, or other expenses.

The problem consists of processing records from Electronic Ballot Boxes’ Logs to obtain statistical metrics about the voting time of Brazilian voters. For example, calculate the average time citizens use to vote, collect fingerprints for identification, and so on. These metrics should be aggregated in several granularity levels: at the country level, state, electoral zone, and electoral section.

In case you don’t know, Brazil has a 100% electronic voting system, where all the 100+ million citizens vote on a single day and the election’s result is computed and released near real-time. Votes are collected by thousands of electronic ballot boxes spread all over the country.

Electronic ballot box. Image from the Brazillian Superior Electoral Court.

An electronic ballot box is a microcomputer for specific use for elections, with the following characteristics: resistant, small, light, with energy autonomy, and with security features [4]. Each can hold up to 500 voters, a number chosen to avoid big queues in the voting locations.

The system is administered by the TSE (Supreme Electoral Court), which shares data about the process in its open data portal [ODbL License]. The logs are text files with an exhaustive list of all events in the ballot box.

And that’s where the challenge begins. As the logs register absolutely every single event, it’s possible to calculate an enormous amount of metrics from them; it’s a vibrant information fountain. But what makes them rich, also makes them extremely hard to handle, as the totality of all the country’s records reaches the milestone of 450Gb in TSV files with + 4 billion lines.

Besides the volume, another thing that makes this work a good benchmark, in my opinion, is that the needed transformations to reach our final goal are from all sorts of complexities, from simple (where, group by, order by) to complex SQL operations (like windows functions).

With this relatively high volume of data, one can be willing to evoke traditional Big Data tools, like Apache Spark, and process this data in a cluster with many workers, several gigabytes of RAM, and a dozen CPUs.

DuckDB was created to challenge this status quo.

As its creator defends (in this video), it is a database thought to empower single machines with the ability to process large volumes of data.

I.e., instead of looking for complex industry solutions — like PySpark — or cloud-based solutions — like Google BigQuery — one will use a local in-process database with standard SQL to realize the needed transformations.

So, in a nutshell, DuckDB is an in-process (that runs in the program itself, it has no independent process, resembling SQLite), OLAP (adjusted to analytical loads), that handles data in traditional formats (CSV, parquet), optimized to handle large volumes of data using the power of a single machine (that doesn’t need to be very powerful).

A ballot box’s log is a single TSV file with a standardized name — XXXXXYYYYZZZZ.csv, composed of the box’s location metadata, with the 5 first digits being the city code, the next 4 the electoral zone (a geographical state’s subdivision), and the last 4 the electoral section (the ballot box itself).

There are almost 500,000 ballot boxes in Brazil, so, almost 500.000 files. The file’s size depends on the number of voters in the section, which varies from 1 to 500. This is what the logs look like:

2022-10-02 09:35:17 INFO 67305985 VOTA Voter was enabled
2022-10-02 09:43:55 INFO 67305985 VOTA Vote confirmed for [Federal Deputy]
2022-10-02 09:48:39 INFO 67305985 VOTA Vote confirmed for [State Deputy]
2022-10-02 09:49:10 INFO 67305985 VOTA Vote confirmed for [Senator]
2022-10-02 09:49:47 INFO 67305985 VOTA Vote confirmed for [Governor]
2022-10-02 09:50:08 INFO 67305985 VOTA Vote confirmed for [President]
2022-10-02 09:50:09 INFO 67305985 VOTA The voter's vote was computed
# Literal Translations to English
# Events that represent a vote

We’re interested in transforming this raw information into statistical metrics about voting time(How much time each voter takes to vote? How many votes are computed each minute?) in several granularity levels (country, state, city) and, to achieve that, we’re going to create an OLAP Cube like that:

| State         | City              | Mean Voting Time (seconds) | Max Votes Computed in 5 Min |
|---------------|-------------------|----------------------------|-----------------------------|
| Null | Null | 50 | 260 |
| São Paulo | São Paulo | 30 | 300 |
| São Paulo | Campinas | 35 | 260 |
| São Paulo | Null | 20 | 260 |
| Rio de Janeiro| Rio de Janeiro | 25 | 360 |
| Minas Gerais | Belo Horizonte | 40 | 180 |
| Bahia | Salvador | 28 | 320 |
| Rio Grande ...| Porto Alegre | 30 | 300 |
| ... | ... | ... | ... |

Setup the environment

All that’s needed to run this project is a Python environment with the DuckDB package installed.

pip install duckdb

Transforming the data

In the following sections, I’ll describe each transformation, its objectives, how DuckDB can perform each one, the advantages, challenges, results, and conclusions.

The processing is divided into 4 steps: Convert TSV files to Parquet; Filter and Clear; Isolate votes and their attributes; and Compute metrics to the OLAP Cube.

Processing Steps. Image by Author.

Unfortunately, to avoid making this post enormous, I’ll not explain each transformation in detail. But all the code is available on the GitHub repository.

Converting TSV files to Parquet

A simple and indispensable step for anyone who wants to work with large volumes of data. Doing this on DuckDB is straightforward.

First, create a DuckDB session:

cursor = duckdb.connect("")

In this example, we instantiate the database connector with an empty string. This is done to indicate that DuckDB should not create its own database file; rather, it should only interact with system files. As mentioned earlier, DuckDB is a database, so it has the functionalities to create tables, views, and so on, which we won’t explore here. We’ll focus solely on using it as a transformation engine.

And define the following query:

query = f"""
COPY (
SELECT
*
FROM read_csv('/data/logs/2_{state}/*.csv', filename=True)
) TO '{state}.parquet' (FORMAT 'parquet');
"""
cursor.execute(query)

And that’s all!

Let’s detail the query:

The inner expression is just a standard SELECT * FROM table query, the only difference is that, instead of referencing a table, DuckDB can reference files directly.

The result of this query could be imported to a pandas dataframe for further expression, just like this:

my_df = cursor.execute(query).df()

Which allows seamless integration between DuckDB and pandas.

The outer expression is a simple COPY … TO … , which writes the inner query’s result as a file.

In this first transformation, we can start to see one of the strengths of DuckDB— the ability to interact with files using plain old SQL, without needing to configure anything else. The above query is not different at all from day-to-day operations that we make in standard SGBDs, like PostgreSQL and MySQL, with the only difference being that, instead of manipulating tables, we’re interacting with files.

Originally, we had 450Gb of TSV files and, after ~30min, we ended up with 97Gb of Parquet.

Filter and Clear

As mentioned earlier, the Logs store every event that happens on a ballot box. This first step aims to filter only vote-related events, like ‘The voter voted for PRESIDENT’, ‘The Voter had fingerprints collected’, and ‘The vote was computed’ that happened on the election days (that’s important, as the logs also store training sections and other administrative procedures realized).

A simple query, but with a lot of text and date manipulations:


VOTES_DESCRIPTIONS = [
# VOTES
"event_description = 'Aguardando digitação do título'",
# Awaiting voter's title (Voter Registration ID) input
"event_description = 'Título digitado pelo mesário'",
# Voter's title entered by the poll worker
"event_description = 'Eleitor foi habilitado'",
# Voter has been enabled
"event_description ILIKE 'Voto confirmado par%'",
# Vote confirmed for ... could be [PRESIDENT, SENATOR, DEPUTY, ...]
"event_description = 'O voto do eleitor foi computado'",
# Voter's vote has been computed
]

ACCEPTED_DATES = [
'2022-10-02', '2022-10-30', # Constitutional date of the election filter
'2022-10-03', '2022-10-31',
]

query = F"""
SELECT
*
FROM (
SELECT
event_timestamp,
event_timestamp::date AS event_date,
event_type,
some_id,
event_system,
event_description,
event_id,

REPLACE(SPLIT_PART(filename, '/', 5), '_new.csv', '') AS filename,

-- Metadata from filename
SUBSTRING( SPLIT_PART(SPLIT_PART(filename, '/', 5), '-', 2), 1, 5 ) AS city_code,
SUBSTRING( SPLIT_PART(SPLIT_PART(filename, '/', 5), '-', 2), 6, 4 ) AS zone_code,
SUBSTRING( SPLIT_PART(SPLIT_PART(filename, '/', 5), '-', 2), 10, 4 ) AS section_code,
REPLACE(SPLIT_PART(filename, '/', 4), '2_', '') AS uf
FROM
{DATASET}
WHERE 1=1
AND ( {' OR '.join(VOTES_DESCRIPTIONS)} )
) _
WHERE 1=1
AND event_date IN ({', '.join([F"'{date}'" for date in ACCEPTED_DATES])})
"""

In this query, another advantage of DuckDB is highlighted, the ability to read and write partitioned data. Table partitioning is very relevant in the context of Big Data, but is still even more significant in the single-machine paradigm, given that we’re operating the same disk for input and output, i.e., it suffers twice, and every optimization is welcome.

Originally, we had 97Gb, but after ~30min, we were left with 63Gb of Parquet.

Isolate votes and their attributes

As each vote is composed of several lines, we need to condense all the information in a unique record, to ease the calculations. Here things get complicated, as the query gets complex and, unfortunately, DuckDB could not process all the data in one go.

To overcome this issue, I did a loop to process the data incrementally in slices:

for state in states:
for date in ACCEPTED_DATES:
for zone_group in ZONE_GROUPS:
query = F"""
COPY
{
complex_query_goes_here
.replace('<uf>', state)
.replace('<event_date>', date)
.replace('<zone_id_min>', str(zone_group[0]))
.replace('<zone_id_max>', str(zone_group[1]))
}
TO 'VOTES.parquet'
(FORMAT 'parquet', PARTITION_BY (event_date, uf, zone_group), OVERWRITE_OR_IGNORE 1);
"""

The implementation details don’t matter, the interesting part is that we don’t need to change the code too much to build this final table incrementally. As each ‘slice’ processed represents a partition, by setting the parameter OVERWRITE_OR_IGNORE to 1, DuckDB will automatically overwrite any existing data for that partition or ignore it if it already exists.

Originally, we had 63GB, after ~1 hour and 20 minutes, we ended up with 15GB of Parquet.

Compute metrics and build the OLAP Cube

This is a simple step. Now, with each vote represented by a record, all needed is to compute the metrics.

query_metrics = F"""
SELECT
turno, state,
zone_code,
section_code,

COUNT(*) AS total_votes,
COUNT( DISTINCT state || zone_code || section_code ) AS total_sections,

SUM( vote_time ) AS voting_time_sum,
AVG( vote_time ) AS average_voting_time,

MAX( nr_of_votes ) AS total_ballot_items_voted,
SUM( nr_of_keys_pressed ) AS total_keys_pressed

FROM
source
GROUP BY ROLLUP(turno, state, zone_code, section_code)
"""

As we need to compute the metrics in many levels of granularity, the ideal way to do this is with a GROUP BY + ROLLUP.

In this case, DuckDB stood out significantly: we started with 15 GB and, after 36 seconds, the file was reduced to 88 MB!

This is a blazing fast performance, it grouped more than 200 million rows in 4 different levels of granularity, where the highest level has cardinality=2 and, the lowest, cardinality=~200,000 in less than a minute!

The table below summarizes the results:

The total pipeline’s execution time was ~2h30min, executed on WSL with the following specs: ~16GB of DDR4 RAM, an Intel 12th generation Core i7 processor, and a 1TB NVMe SSD.

During the process, I noticed that memory usage was a bottleneck, as DuckDB constantly created temporary files in the disk in a .temp/ directory. Also, I had plenty of problems in running queries with Windows functions: they not only took more time than expected to execute, but also the program randomly crashed several times.

Despite that, I believe that the performance reached was satisfactory, after all, we’re talking about 1/2Tb of data being processed with complex queries by just one single machine (that’s not so strong, compared with clusters of computers).

The fact is that processing data is, sometimes, like refining uranium. We start with an enormous mass of raw material and, through a hard, time-consuming, and costly process (that, sometimes, puts lives at risk), we extract a small portion of the relevant refined information.

Jokes aside, in my posts, I’ve explored many ways to perform data processing, talking about tools, techniques, data architectures… always looking for the best way of doing things. This kind of knowledge is important, as it helps us choose the right tool for the right job. The goal of this post was exactly to know what kind of job DuckDB solves, and what experience it serves.

And, in general terms, it was a good experience.

Working with this database was very smooth, I didn’t have to configure practically anything, just imported and manipulated the data with plain-old SQL statements. In other words, the tool has an almost zero initial entry barrier for those who already know SQL and a little bit of Python. In my opinion, this was DuckDB’s big victory. It not only empowered my machine with the ability to process 450Gb of data but this was achieved with a low adaptation cost for the environment (and the programmer).

In terms of processing speed, considering the complexity of the project, the volume of 450Gb, and the fact that I didn’t optimize the database parameters, 2h30m was a good result. Especially thinking that, without this tool, it would be impossible, or extremely complex, to realize this task on my computer.

DuckDB is somewhat between Pandas and Spark. For small volumes of data, Pandas can be more attractive in terms of usability, especially for folks with some background in programming, as the package has many built-in transformations that could be tricky to implement in SQL. It also has seamless integration with many other Python packages, including DuckDB. For enormous volumes of data, Spark will probably be a better alternative, with the parallelism, clusters, and all that stuff. So, DuckDB fills a blind spot of medium-to-not-so-large projects, where using pandas would be impossible and Spark, overkill.

DuckDB extends the limits that a single machine can reach and expands the projects that can be developed locally, bringing speed to the analysis/manipulation of large volumes of data. Without a doubt, it is a powerful tool that I will proudly add to my toolbox.

Furthermore, I hope this post helped you get a better view of DuckDB. As always, I’m not an expert in any of the subjects addressed in this post, and I strongly recommend further reading, my references are listed below and the code is available on GitHub.

Thank you for reading! 😉

All the code is available in this GitHub repository.
Interested in more works like this one? Visit my
posts repository.

[1] 2022 Results —Files transmitted for totalization— TSE Open Data Portal. Link. [ODbL]
[2] Databricks. (2023, June 29). Data + AI Summit Keynote, Thursday Part 5 — DuckDB. YouTube.
[3]DuckDB Official Documentation. DuckDB.
[4] The electronic ballot box. Superior Electoral Court.
[5] Wikipedia contributors. (2023, July 25). OLAP cube. Wikipedia.
[6] Duckdb — GitHub. window performance · Issue #7809 · duckdb/duckdb.
[7] Gunnarmorling. GitHub — gunnarmorling/1brc: 1️⃣🐝🏎️ The One Billion Row Challenge — A fun exploration of how quickly 1B rows from a text file can be aggregated with Java.



Source link

Be the first to comment

Leave a Reply

Your email address will not be published.


*