Data Foundations: Integration & Storage Overview

In my last blog post I mentioned how open and approachable the field of data analytics is. Job openings for data analysts are growing at an insane pace because the amount of data businesses are collecting is growing at an insane pace and they need people with the skills to find valuable insights. After all, if you don't do anything with your data, why collect it in the first place?

I have this wild idea that in 15 years data skills are going to be as mundane as Microsoft office skills are now.

But before we can begin to analyze the data, it needs to be retrieved (data integration) and saved somewhere (data storage). These are the first two foundational layers of my data pyramid. My goal in this post is to give you a sense of the tools used in these areas. That said, I am by no means an expert in this space. My jam is really at the top of the pyramid where data engineering, data analysis, and data science takes place. I joke that I'm just a taker of data because a lot of people put a lot of time and hard work into getting the data in a usable format for me. So, if it's not clear enough yet, this will be an overview of some of the terminology and technologies used in data integration and storage, by someone (me) who only dabbles.

Even though this isn't my bread-and-butter, I've found it helpful to understand the terminology so that I may be a better communicator with my colleagues and be a better user of data.

When it comes to data integration, we're really talking about getting data from it's source. Mechanisms for getting data might include:

  • an API (it's like a URL you can send a query to and get data back)
  • an FTP (similar to a folder on your computer where data gets dropped)
  • flat files (almost like FTP, except files get delivered via email or download)
  • web scraping (programmatically copying data from a webpage)

Next, we need to get the data into a format that's helpful for our needs. This step may require transformation and/or quality checks. It's for this reason many data integration teams may be called "data quality", "enterprise data management", or "ETL". ETL stands for extract, transform, and load... very fitting, don't you think? In my Lift Up MKE presentation, I use Gartner's magic quadrant to show the most popular tools in the data integration space.

Source: Informatica

A new term on the horizon these days is ELT. Can you guess what is stands for? Yup, extract, load, transform. 

The idea is that we're getting data so quickly, let's focus first on getting it and saving it somewhere and worry about how to transform it for our needs later. This will become a bit more clear in my next post about data engineering. Now, you may be wondering why data integration and data engineering are two separate layers in my pyramid, but you'll have to wait for my next post to learn my reasons. (Shameless plug: you can subscribe to this blog using the Subscribe section at the top of this post.)

Once we have the data in the format we want, it's time to store it! There are a variety of storage solutions, but the biggest ones you'll want to know are:

Source: Teradata

It's worth noting that some data may be created and stored by SaaS (software as a service) providers. These providers do not necessarily require that you retrieve the data to store it yourself. That's one their many benefits. You can generate data that is kept in the application to be used as needed on the fly. Think Salesforce or Google Analytics. I don't need to download data from Google Analytics on a monthly basis because whenever I need data about how my website is doing, I can go to Google Analytics and get the information I need whenever I need it. That's not always the case with data, but in some cases it is.

I think it's pretty clear that there's not one best solution. 

  • Each company (and sometimes every project) employs a different approach. Don't get overwhelmed. 
  • You're not going to need to know everything. Most organizations choose a few items to be the pillars of their data architecture. 
  • The idea is to become familiar with the terms and a general understanding of what happens to data before it gets to you
  • Data is retrieved from a data vendor, via a mechanism, using data integration tools.
  • It is then stored in some solution using data storage tools or SaaS.