2 Tasks to Boost Your Python Data Wrangling Skills | by Soner Yıldırım | Nov, 2023

How to convert raw data into a more usable and structured format.

(image created by author with Midjourney)

When learning a new tool, we usually go over the docs, watch tutorials, read articles, and solve examples. This is a good-enough approach and will help you learn the tool to a certain extent.

However, when we start using the tool in real-life settings or for solving real issues, we need to go a little beyond what’s covered in most tutorials.

In this article, I’ll explain step-by-step how I used Python for handling two different data cleaning and preprocessing tasks at my job. For each task, I’ll show you the raw data and the desired format. Then, I’ll explain the code for getting the data to that format.

We’ll dive deep into Python’s built-in data structures and Pandas library so you should expect to learn some interesting stuff on data wrangling with Python.

I have a DataFrame with a list of issues and their summaries. I’m not using or sharing the original data I have here. Instead, I generated mock data in the same format as with the original one. If you want to follow along by executing the code, download the “mock_issues.csv” file from my datasets repository.

What we’ll do in terms of data wrangling depends on the format rather than the content so the functions and methods we’ll learn in this article are applicable to the original data. In fact, the process is exactly the same as what I did at my job.

Consider we have a DataFrame of several rows with the following columns:

(image by author)

Each row in the raw issues column contains a list of issues in the following format:

"[1-The find_duplicates method is inefficiently using the data structures leading to high time complexity.,
2- Built-in data structures are not used efficiently in the generate_meta method.,
3- In the ExerciseGenerator class, excessive use of global variables may slow down the program.,
4- The get_all_contributors_for_repo method is not using built-in…

Source link

Be the first to comment

Leave a Reply

Your email address will not be published.