Data Foundations: Data Science

A series of blog posts that began back in April, attempts to summarize all the key terms, technology, and skill sets necessary for someone entering the field of analytics. You can find previous posts here, here, here, and here. In this fifth and final post I will cover a few basic data science concepts. You can find the slides from my original presentation here.

Let's start with what data science is!

There are a variety of terms out there that mean similar things: artificial intelligence, machine learning, and data science. While they are related, and there is some overlap, they are fundamentally different. I particularly like these definitions and visuals to describe the relationships between these terms.

For me, the phrase "practical application" is what makes data science different from the others. The goal of data science is to combine machine learning algorithms with business knowledge to solve a real-world problem. There are lots of ways you might leverage data science, and there is no shortage of blog posts and articles that describe the various options. Instead of discussing the possibilities, I'm going to focus on how this field fits into the rest of the data world.

The necessary skills

There is a lot of overlap between the skills of a data scientist and those of an analyst, data engineer, and even a visualization expert. The demands of a data scientist may vary depending on the needs of the team they're on, so don't take these skills as a firm requirement. Instead, one of the best skills that a data scientist could have is the ability to learn new techniques quickly.

A data scientist should probably know or dabble with the following:
  • SQL (to get the data from wherever it is stored)
  • APIs (to get the data from wherever it is stored)
  • Data engineering (to clean, enrich, or further prepare the data)
  • Python (a programming language)
  • R (a programming language)
  • Mathematics (to understand what is happening in algorithms)
  • Visualization (to communicate the results of an analysis)

Where it fits into the data stack

You'll notice that a data scientist needs to know some of the basics from the other roles we've covered in the data tech stack. Many times a data scientist needs data that isn't readily available (because they're working on a new or niche project) so they need to be able to get and manipulate data as needed. Additionally, in the lifecycle of a data science project, there is a step called Exploratory Data Analysis (EDA), which is essentially a special kind of data visualization.

Many of the first steps in a data science project are a lightweight version of the data pyramid. Which is confusing, because why have all these separate roles if a data scientist does it all anyways? That's a great question! For starters, a data scientist is likely not a specialist is getting and storing data, or even designing reports and dashboards. They're most likely working on a very specific problem with specific data needs, so they need the skills to get what they need. Alternatively, people who play roles in each of the other areas of a data pyramid, do so for a much wider audience and purpose.

Time to get started, then!

Not so fast. It can be easy to see and hear about data science and think that's where you need to be, but in order to be successful in this space, you need a good foundation of all the other layers. A good reference when you considering your data path, is the analytical maturity curve. The goal of an organization is to move up the curve, but you can't get there by skipping over the first steps. Data visualization covers the descriptive and diagnostic sections. Your data needs to be explored and deeply understood. Once that is complete, you begin to move onto the predictive and prescriptive sections, which is where data science comes into play.

Recommendations for getting started with data science

  • Learning about data science can feel like an uphill battle. With every new thing learned, there are three more things you need to research. You will never know everything. Even in the data science space, people have specialties: natural language processing, deep learning, time series forecasting, etc. Do not get discouraged. Everything you learn is valuable!
  • This book does a great job and gives some examples you can do within Excel. You'll begin to understand some of the mathematical concepts necessary to continue.
  • It can be confusing for non-technical people to install a programming language like R or Python, especially if you just want to dabble. These applications can be a huge barrier to entry. That's why I found DataCamp so valuable! I got to learn more about R and then Python through them. You can learn it all right from the browser. Sign up using this link for $20 off.
  • Speaking of languages, it can be easy to get overwhelming and try to do too much right away. My advice? Pick a single language and get comfortable with that first. I used to be a fan of R (and it's still great, especially for time series analysis) but these days I'm a Python gal.
  • I've been really into deep learning lately and Python makes it so easy. This book made the concept easy to understand.
  • Oh, and if you need any other book references, please reach out. I'm happy to provide my thoughts on books, websites, and tools that made my life easier.
  • Finally, if you're really serious, you could always pursue a masters degree. This is the program I am finishing this fall!