Collecting Data with Apache Airflow on a Raspberry Pi | by Dmitrii Eliuseev | Oct, 2023


A Raspberry Pi is All You Need

Raspberry Pi Zero (model 2021), Image source Wikipedia

Often, we need to collect some data within a certain period of time. It can be data from the IoT sensor, statistical data from social networks, or something else. As an example, the YouTube Data API allows us to get the number of views and subscribers for any channel at the current moment, but the analytics and historical data are available only to the channel owner. Thus, if we want to get weekly or monthly summaries about these channels, we need to collect this data ourselves. In the case of the IoT sensor, there may be no API at all, and we also need to collect and save data on our own. In this article, I will show how to configure Apache Airflow on a Raspberry Pi, which allows running tasks for a long period of time without involving any cloud provider.

Obviously, if you’re working for a large company, you will probably not need a Raspberry Pi. In that case, if you need an extra cloud instance, just create a Jira ticket for your MLOps department 😉 But for a pet project or a low-budget startup, it can be an interesting solution.

Let’s see how it works.

Raspberry Pi

What is actually a Raspberry Pi? For those readers who have never been interested in hardware for the last 10 years (the first Raspberry Pi model was introduced in 2012), I can briefly explain that this is a single-board computer running full-fledged Linux. Usually, a Raspberry Pi has a 1GHz, 2–4-core ARM CPU and 1–8 MB of RAM. It is small, cheap, and silent; it has no fans and no disk drive (the OS is running from a Micro SD card). A Raspberry Pi needs only a standard USB power supply; it can be connected via Wi-Fi or Ethernet to a network and run different tasks within months or even years.

For my data science pet project, I wanted to collect the YouTube channel statistics within 2 weeks. For a task that requires only 30–60 seconds twice per day, a serverless architecture can be a perfect solution, and we can use something like Google Cloud Function for that. But every tutorial from Google started with the phrase “enable billing for your project”. There is free first credit and free quotas provided by Google, but I did not want to have another headache of monitoring how much money I…



Source link

Be the first to comment

Leave a Reply

Your email address will not be published.


*