Reading for Growing Data Engineers - 2017

This blog is going away soon! :( Check out my new site where you can read the latest and subscribe for updates!

Books which have shaped my path in the last six months #

It’s safe to say that I invest way too much in books. When I was in college, I got my hands on as many mathematics books as I could (after going through what was available - and understandable to my level - in the library). Thankfully that has carried over in to my data science and data engineering career.

I tend to buy books about any technology I have an interest in learning. While there is ample material online for learning Hadoop, Spark, Kubernetes, and others, it’s very easy to gloss over finer details in the interest of putting the tech in practice. Books tend to cover more in-depth knowledge, such as best practices for production, design nuances for long term maintainability and scalability, and security (let’s face it, none of us wants to be the next Equifax).

These are the books which have most heavily influenced my data engineering practices over the last six months.

The Art of Scalability: Scalable Web Architecture, Processes, and Organizations for the Modern Enterprise (2nd Edition) #

For data engineers, scalability is the defining issue. Most of us come out of a data science background where there is a need to scale analyses and processes so the business can actually benefit from big data.

Abbott and Fisher cover in-depth different design architectures which may already be familiar to some data engineers. But, they also tackle something we tend to have less familiarity with: organizational scale. This is usually the most overlooked part of a big data strategy - allowing big data to shape your business means that your organization must be just as scalable, if not more so, than your data technology.

Executives tend to think of big data as being “magic,” and will not necessarily seek out organizational scale around big data. This puts much of the burden of explaining this limitation on the data engineers tasked with using big data. Personally, this book provided a lot of insight on how to bear that burden.

High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark #

One reason Apache Spark has exploded in popularity over the last two years is that it is one of the easiest big data platforms to start using. Data engineers tend to already have practical experience in Scala, Python, SQL, or R, and can jump right in.

While the Spark community has done a tremendous job at abstracting away optimization (so that is “just works”), Holden Karau and Rachel Warren do a great job at walking through how to get the most from using Spark. This includes SQL strategies, operations on raw RDDs (still necessary in some cases), and building robust machine learning pipelines.

Personally, my favorite section is Chapter 8, strategies for testing and validating your Spark programs. It’s a less-glamorous piece of data engineering which often gets glossed over in favor of Spark’s more impressive aspects, but is still important for building robust systems.

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems #

I’m going to be honest here - most of the material in this book is not very helpful unless you want to understand challenges which are mostly solved by platforms like Spark or services like Google Cloud Platform.

Why did I include it on the list? Chapters 3 (“Storage and Retrieval”) and 4 (“Encoding and Evolution”) challenged me to put much more long-term thought in to how I chose data formats.

The Effective Engineer: How to Leverage Your Efforts In Software Engineering to Make a Disproportionate and Meaningful Impact #

There is no shortage of books on personal productivity, efficiency, and personal/professional development; however, only parts of those books are applicable to the software/data engineering profession. Edmund Lau does an amazing job of distilling them down to what does and doesn’t work for engineers (individual results may vary, but that actually is part of the point).

Lau also discusses how certain technical aspects of the job (i.e., using CI/CD) lends itself to individuals and teams being more efficient at building their products. It’s a great horizontal/team-level view of organizational scale - The Art of Scalability tackles it at a more abstract level, this book gives concrete direction for teams wanting to take scale in to their own hands.


Feel free to connect with me!

 
0
Kudos
 
0
Kudos

Now read this

All Things Open 2017 - Data Washing Machine

This blog is going away soon! :( Check out my new site where you can read the latest and subscribe for updates! This year marks the second time I have attended All Things Open, and it continues to be awesome! Some amazing keynote... Continue →