By Dan Lamyman
Co-Founder & Director
Business Intelligence & Advanced Analytics
Whilst data scientists are getting all the attention in countless articles about the rise of big data, no data scientist could do their job without the technology that enables it. This is where a Data Engineer comes in. A good data engineer delivers clean, queued, and timely data to data scientists on demand. In many ways, a data engineer needs to understand the worlds of software engineering and data science in order to stitch them together into a cohesive production-ready system.
As such, finding a good data engineer with a wide range of skills in software development, production pipelines, cloud DevOps, data transformation, data science intuition, and understanding of common machine learning techniques can be a challenge. The skill set required of a great data engineer is quite broad. This article covers the core capabilities that companies should look for in a data engineer. That said, not every candidate will possess all of these skills nor will every company’s data pipeline require them. Use this post as a baseline to determine what to look for in a data engineer you hire.
Background of a Data Engineer
Typically, data engineers write more code and perform less analysis than a data scientist. They’ll often come from roles in software engineering before entering the field of big data. In contrast, data scientists usually have formal training in statistics. They often come from academia and work on applying various models and analysing the results. That said, there is significant overlap between positions and both need to understand the work of the other in order to produce an effective data pipeline.
The best data engineers come from a software engineering, business intelligence, or data warehousing background. They’re comfortable working with large datasets, but more importantly they’re skilled at writing scripts to wrangle and transform data effectively. Additionally, they have experience testing, deploying, and maintaining software on highly-available cloud infrastructure.
Selecting the Right Language
Data engineers inevitably write a lot of code. They also spend quite a bit of time working on getting modules and libraries to play nicely with each other. The ultimate goal is using code to pass along and transform data. Whilst this may not sound difficult, it becomes a major challenge when your dataset has billions of records or you need to instantaneously transform and analyse real-time data streams.
A large proportion of data engineers write code in Python. Python’s extensive data science tools like NumPy, Jupyter Notebooks, and Pandas make it an attractive choice with a clean syntax to write data pipeline code. Some have pointed out that Python is a relatively slow language compared to Java or C++, for instance. However, NumPy directly uses Python’s C compiler and other runtimes of Python, like Jython, PyPy, or IronPython use different compilers to get fast performance.
Still, Java and functional programming languages like Scala are also popular for data pipelines for their performance and security. In addition, data engineers need a solid command of shell scripting in Bash, enterprise Linux server architecture and UNIX in general, and the SQL language for writing queries. Statistical languages like R are also useful for a data engineer to know. However, solid programming skills are very much the baseline requirements for a data engineer, as everything builds atop them.
The path to using big data involves acquiring, cleaning, standardizing, storing, and then queuing the data for analysis. This general pipeline is the entire work of a data engineer. However, the process is complicated by the many sources of data and possible endpoints for that data’s delivery. Data engineers may create connections to legacy software, relational and non-relational databases, cloud-based SaaS providers, or any number of other data sources. In fact, much of the work of a data engineer might involve extracting data from legacy systems to power new services in the cloud.
ETL, Pipelines, & Migration
As a data engineer is compiling and relating data from many sources, a core competency is the process of extracting data from a source, transforming it to a new format, and loading it into a new database for later use. This extract, transform, load (ETL) workflow has been the core of data engineering since its inception. However, increasingly companies need data to be available immediately for analysis. Whereas batch ETL was the main process a few years ago, streaming ETL is now the gold standard.
Furthermore, this cleaned and transformed data stream likely needs to be available as a cloud microservice, exposing the data to other services that will consume it in different ways. Productionising such streaming pipelines comes with its own challenges in scaling, reliability, and security, as popular applications can easily generate billions of data points per day and attackers will try anything to get their hands on that data.
MapReduce, Hadoop, & Spark
There are a whole host of tools designed to make processing large datasets easier. The challenge in processing billions of records is you need a way to process the data across multiple machines if you want to do it quickly, but you also need to maintain relationships between data and make sure two computers don’t overwrite each others’ changes or orphan data points along the way.
Apache Hadoop has become the go-to solution for distributed processing of large datasets across clusters of computers. Its MapReduce functionality allows for quick operations by breaking data into queues. Apache Spark builds on Hadoop’s success to offer new types of batch processing and new workloads like streaming, interactive queries, and machine learning support. Both Hadoop and Spark are critical technologies for a data engineer to know.
Another Apache tool, Kafka, is also important in the data engineering field. Kafka solves issues with data streaming, making it possible to analyse or transform real-time data as it arrives.
Productionising Data Science Code & Machine Learning
Whilst data engineers typically don’t perform the analysis on the data, they do need to understand how the analysis works and what common pitfalls to avoid in the data they provide to data scientists. A good data engineer will also be studying the latest developments in machine learning and playing around with algorithms themselves as well as investigating common statistical analyses and visualizations used by data scientists.
With this knowledge in hand, the goal of any production data pipeline is highly-available, consistent, real-time data served via an API for consumption by a machine learning model or statistical analysis package.
This pipeline should be linearly scalable to run in parallel on multiple machines, cloud VMs, or Docker images. It should also be fault tolerant when nodes fail, which is where knowledge of containers, orchestration, and cloud service provider (AWS, GCP, Azure) infrastructure become critical. Apache Spark and its libraries like MLlib make it possible to build such pipelines, but the architecture requires significant know-how and experience.
Junior data engineers may understand some of these concepts. However, building a data pipeline from scratch for a complex application is a task best left to a senior data engineer. With such a key role in the success of your pipeline, an experienced data engineer could be your most important hire in your data science efforts.
Cloud vs. On-Premise
While some companies may run their data science efforts from on-premise servers, the vast majority of organizations are taking advantage of cloud service providers like Amazon Web Services, Google Cloud Platform, and Microsoft Azure. As such, knowledge of cloud computing concepts, containers, and cluster management are essential for a data engineer.
Some companies will insist that the data engineer they hire be experienced in their specific technology stack, be it AWS, GCP, or Azure. However, many of the same concepts are transferable across providers. While the name and syntax of various services might differ, cloud deployment skills are generally transferable.
Since moving to the cloud allows your data pipeline to scale seamlessly in both computing and storage resources, cloud deployment has significant advantages over on-premise storage, especially in the case of data science. As such, knowledge of cloud deployment and best practices like CI/CD for data pipelines is a must-have skill in a good data engineer.
We’ve covered a ton of technical skills, but without some soft skills your data engineer hire could still be a failure. Teamwork, above all, is critical to a data engineer’s job. They’ll need to work alongside other engineers and data scientists to collaboratively produce insights. Moreover, communication skills in a data engineer are key. Both written and oral articulation of findings, challenges, and opportunities are an important part of a data engineer’s work.
Additionally, you’ll likely want a data engineer who is strong-willed. Since the data engineer’s role is support and infrastructure, it can be difficult for management and executives to prioritise investing in the data pipeline. A good data engineer argues passionately for the importance of infrastructure. They also will have an eye for potential insights, and will proactively bring solutions to the team to push the analysis into new arenas.
Hiring a data engineer is an important decision at any company, as they’ll be building the infrastructure that powers insights from your data. This wide-ranging role can be difficult to hire for because so many skills are required. However, finding a candidate that matches these criteria will be well worth the effort.
Looking to add a talented data engineer to your team? At Logikk, we work with top companies to recruit great data professionals, and we have deep connections in the space. Get in touch today to see how we can help!