Radical Simplicity in Data Engineering | by Cai Parry-Jones

Learn from Software Engineers and Discover the Joy of ‘Worse is Better’ Thinking

Recently, I have had the fortune of speaking to a number of data engineers and data architects about the problems they face with data in their businesses. The main pain points I heard time and time again were:

Not knowing why something broke
Getting burnt with high cloud compute costs
Taking too long to build data solutions/complete data projects
Needing expertise on many tools and technologies

These problems aren’t new. I’ve experienced them, you’ve probably experienced them. Yet, we can’t seem to find a solution that solves all of these issues in the long run. You might think to yourself, ‘well point one can be solved with {insert data observability tool}’, or ‘point two just needs a stricter data governance plan in place’. The problem with these style of solutions is they add additional layers of complexity, which cause the final two pain points to increase in seriousness. The aggregate sum of pain remains the same, just a different distribution between the four points.

created by the author using Google Sheets

This article aims to present a contrary style of problem solving: radical simplicity.

TL;DR

Software engineers have found massive success in embracing simplicity.
Over-engineering and pursuing perfection can result in bloated, slow-to-develop data systems, with sky high costs to the business.
Data teams should consider sacrificing some functionality for the sake of simplicity and speed.

A Lesson From Those Software Guys

In 1989, the computer scientist Richard P. Gabriel wrote a relatively famous essay on computer systems paradoxically called ‘Worse Is Better’. I won’t go into the details, you can read the essay here if you like, but the underlying message was that software quality does not necessarily improve as functionality increases. In other words, on occasions, you can sacrifice completeness for simplicity and end up with an inherently ‘better’ product because of it.

This was a strange idea to the pioneers of computing during the 1950/60s. The philosophy of the day was: a computer system needs to be pure, and it can only be pure if it accounts for all possible scenarios. This was likely due to the fact that most leading computer scientists at the time were academics, who very much wanted to treat computer science as a hard science.

Academics at MIT, the leading institution in computing at the time, started working on the operating system for the next generation of computers, called Multics. After nearly a decade of development and millions of dollars of investment, the MIT guys released their new system. It was unquestionably the most advanced operating system of the time, however it was a pain to install due to the computing requirements, and feature updates were slow due to the size of the code base. As a result, it never caught on beyond a few select universities and industries.

While Multics was being built, a small group supporting Multics’s development became frustrated with the growing requirements required for the system. They eventually decided to break away from the project. Armed with this experience they set their sights on creating their own operating system, one with a fundamental philosophy shift:

The design must be simple, both in implementation and interface. It is more important for the implementation to be simple than the interface. Simplicity is the most important consideration in a design.

— Richard P. Gabriel

Five years after Multics’s release, the breakaway group released their operating system, Unix. Slowly but steadily it caught traction, and by the 1990s Unix became the go-to choice for computers, with over 90% of the world’s top 500 fastest supercomputers using it. To this day, Unix is still widely used, most notably as the system underlying macOS.

There were obviously other factors beyond its simplicity that led to Unix’s success. But its lightweight design was, and still is, a highly valuable asset of the system. That could only come about because the designers were willing to sacrifice functionality. The data industry should not be afraid to to think the same way.

Back to Data in the 21st Century

Thinking back at my own experiences, the philosophy of most big data engineering projects I’ve worked on was similar to that of Multics. For example, there was a project where we needed to automate standardising the raw data coming in from all our clients. The decision was made to do this in the data warehouse via dbt, since we could then have a full view of data lineage from the very raw files right through to the standardised single table version and beyond. The problem was that the first stage of transformation was very manual, it required loading each individual raw client file into the warehouse, then dbt creates a model for cleaning each client’s file. This led to 100s of dbt models needing to be generated, all using essentially the same logic. Dbt became so bloated it took minutes for the data lineage chart to load in the dbt docs website, and our GitHub Actions for CI (continuous integration) took over an hour to complete for each pull request.

This could have been resolved fairly simply if leadership had allowed us to make the first layer of transformations outside of the data warehouse, using AWS Lambda and Python. But no, that would have meant the data lineage produced by dbt wouldn’t be 100% complete. That was it. That was the whole reason not to massively simplify the project. Similar to the group who broke away from the Multics project, I left this project mid-build, it was simply too frustrating to work on something that so clearly could have been much simpler. As I write this, I discovered they are still working on the project.

So, What the Heck is Radical Simplicity?

Radical simplicity in data engineering isn’t a framework or data-stack toolkit, it is simply a frame of mind. A philosophy that prioritises simple, straightforward solutions over complex, all-encompassing systems.

Key principles of this philosophy include:

Minimalism: Focusing on core functionalities that deliver the most value, rather than trying to accommodate every possible scenario or requirement.
Accepting trade-offs: Willingly sacrificing some degree of completeness or perfection in favour of simplicity, speed, and ease of maintenance.
Pragmatism over idealism: Prioritising practical, workable solutions that solve real business problems efficiently, rather than pursuing theoretically perfect but overly complex systems.
Reduced cognitive load: Designing systems and processes that are easier to understand, implement, and maintain, thus reducing the expertise required across multiple tools and technologies.
Cost-effectiveness: Embracing simpler solutions that often require less computational resources and human capital, leading to lower overall costs.
Agility and adaptability: Creating systems that are easier to modify and evolve as business needs change, rather than rigid, over-engineered solutions.
Focus on outcomes: Emphasising the end results and business value rather than getting caught up in the intricacies of the data processes themselves.

This mindset can be in direct contradiction to modern data engineering solutions of adding more tools, processes, and layers. As a result, be expected to fight your corner. Before suggesting an alternative, simpler, solution, come prepared with a deep understanding of the problem at hand. I am reminded of the quote:

It takes a lot of hard work to make something simple, to truly understand the underlying challenges and come up with elegant solutions. […] It’s not just minimalism or the absence of clutter. It involves digging through the depth of complexity. To be truly simple, you have to go really deep. […] You have to deeply understand the essence of a product in order to be able to get rid of the parts that are not essential.

— Steve Jobs

Side note: Be aware that adopting radical simplicity doesn’t mean ignoring new tools and advanced technologies. In fact one of my favourite solutions for a data warehouse at the moment is using a new open-source database called duckDB. Check it out, it’s pretty cool.

Conclusion

The lessons from software engineering history offer valuable insights for today’s data landscape. By embracing radical simplicity, data teams can address many of the pain points plaguing modern data solutions.

Don’t be afraid to champion radical simplicity in your data team. Be the catalyst for change if you see opportunities to streamline and simplify. The path to simplicity isn’t easy, but the potential rewards can be substantial.

Source link

Radical Simplicity in Data Engineering | by Cai Parry-Jones | Jul, 2024