Data Foundations: Engineering


The purpose of my current blog series is to lower the barrier to entry to the field of analytics. I previously discussed the smattering of reasons the growing field of analytics is open to newcomers as well as some of the key terms you might want to know about data access and storage. Now, we're here to talk about the next layer in the data stack... data engineering! 


Data engineering is also referred to by a few other names, each having a slightly different flavor but meaning almost the same thing:
  • Data preparation
  • Feature engineering
  • Data cleansing

All these terms simply mean that we want to get the data into a state where it might be able to tell us something interesting.

You might be surprised to know that most data needs to be poked and prodded into a suitable format for analysis. That's why so many data analysts and data scientists learn this skill. You have folks whose job it is to find insights in data working with messy data! So how do we get started? Well, I think it's helpful to understand a bit more about the data you'll be working with.

  • Is the data structured or unstructured? Structured data has a clear, predictable set of features. Unstructured data, on the other hand, is wild and a bit more random.
  • Is the data in an OLTP or OLAP system? OLTP is transactional and real-time, with data coming in frequently. OLAP is a long-term data storage system.
  • What's the data's relationship model? Typically data is normalized with stored in a database to reduce redundancies. To make the data useful to an analyst, the pieces of data need to be brought back together and to do that, we need a data model.

Let's assume we're working with pretty standard, enterprise data that is structured and stored in an OLAP, relational, star-schema SQL database. 

The first thing you'll want to do is look at you data and determine what makes a unique row. 

This information will come in handy later on when you're trying to do a calculation or join this data to other data. You'll also want to look at what columns of data you have and what the data types are:

  • Categorical (also called Dimension or Discrete)
    • Boolean: True or False
    • String: “Hello” “Womxn”
    • Date: 1/31/2020
  • Numerical (also called Measure, Fact, or Continuous)
    • Integer: 33
    • Float: 33.01912
    • Geographic: 43.038902 and -87.906471

      Once you have this information you can begin to put the pieces of this data puzzle together. 

      Let's take a look at some data you might receive from an online retailer. It's is not normalized because when dealing with transactional data (OLTP), they are concerned with speed. Getting the data, storing it, and moving on because they're processing a lot of data in real-time. But this creates duplication information about customers, an order's ship mode, and more.


      Data normalization occurs in the data integration and storage layer of the data pyramid, so you don't need to know how to do this, but it's helpful to know what happens to the data before it gets to you. What you will see might look more like this. You may have a table in the database with order information - specifically, which products are part of an order, how many were purchased, and how much they cost. This order has an order date and a customer. That customer information is in a different table because we hope a customer will have more than one order and only need to store that information once.

      Hopefully you can imagine all the possible questions you could answer with this data! You could use the customer data to know who receives your emails or newsletters. Or you could build a dashboard to show which orders didn't go out on time. Maybe you're more interested in which products are your most popular. All great questions!

      Some questions may require this data be brought together in order to tell a good story. If an order is late, for example, I may want to know what customer is going to be affected. But based on this normalized data, I just have the Customer ID.

      How do we bring these tables of data together? Joins!

      If you're familiar with the vlookup function in Excel, then you're well on your way to understanding joins. Because a left join is basically a vlookup! 

      Imagine you have a table of data about orders and you want know which customer these orders belong to. You take your orders and look up the Customer ID associated with each order in the Customer table. You could have multiple orders that were made by the same customer, so they would have the same Customer ID associated with them. If we have a record in our Customer table that hasn't placed an order, we don't want to include that person in our prepared data set about orders. And that's a left join.

      A right join would be if we wanted all the information about all of our customers and the order information only if the customer placed any orders. A full join would return all the information from both tables and an inner join would return only the information that is in both tables.


      Scratching the surface

      There is so much more to preparing data and while we just reviewed a few things, these few things are crucial! 
      • There is no way you can be prepared for every situation. Practice, be thoughtful, and validate your results.
      • Be careful to join on all the necessary fields. If you don't, you could end up with an enormous amount of duplicate records. This is why knowing what makes a unique records is so important. Identify all relationships between tables of data!
      • There can be different types of relationships between tables of data: 1-to-1, 1-to-many, and many-to-many.
      • All of this information and more is available in my full presentation!