Skip to main content

Big Data Fundamentals


Intro Video

Photo of Myles Young

Myles Young

BigData Training Architect II in Content

I am a father and husband with a passion for tech. I have large-scale enterprise IT experience in network security, agile development, middleware, QA, system reliability engineering, and data infrastructure engineering. I have worked in DevOps for most of my IT career with a focus on using automation and big data technologies for operational analytics and log aggregation to further support CI/CD pipelines. I have a great appreciation for distributed systems and finding non-obvious answers in mountains of data. I am excited to be working at Linux Academy where I get to share what I've learned with our awesome students!







Course Details

If you're completely new to big data and aren't quite sure what it is, why it's neccessary, and how it works, then this is the course for you! We are going to clarify what big data is (and isn't), while also defining some other related terms around data characterization and analysis methods. Then, we will talk about some architectural problems with big data and how we solve them with cluster computing, distributed storage, and cluster managment. Lastly, we will cover some of the popular technologies and illustrate how big data is used in the real world to hopefully shine a light on how big data is already impacting your daily life — whether you realize it or not. Let's get started!


Getting Started

Course Introduction


Lesson Description:

Welcome to the course! Let's quickly go over what we will cover in this course, who it is for, and what you should already know.

About the Training Architect


Lesson Description:

Get to know a little bit about me, the training architect. I look forward to sharing what I know about big data with you!

What Is Big Data?


Lesson Description:

Let's clarify one very important thing: What is big data? We hear the term used a lot, but what exactly is it — and what is it not?


Types of Data


Lesson Description:

Understanding the three main types of data will help you understand the complexity aspect of big data. Essentially, big data isn't just a lot of data — it's also a lot of complex data that cannot otherwise be stored in traditional databases. So, what makes data complex?

The V's


Lesson Description:

The original three V's are a nice way to characterize data as they apply to any data set. Over the years, the "V's" have become 4, 5, 6, 7, and now 10+. Let's discuss the three V's and also a few of the less silly ones added over the years to help us understand the various characteristics of big data.

Machine Learning


Lesson Description:

A big topic in the big data industry is machine learning. Being under the artificial intelligence umbrella, machine learning is often talked about as if it's magic. So, let's talk about what machine learning actually is so we can better understand it and how it works at a high level.

Data Science and Analytics


Lesson Description:

It's common to hear data science and analytics referred to as being nearly the same thing or completely different practices with no relation. Let's clarify what data science and data analytics are so we can better compare and contrast these two practices, which could be considered two sides of the same coin.


Data Ingestion


Lesson Description:

Sometimes the hardest part of building a big data solution is collecting and processing the data in way that is reliable and backpressure tolerant. Here, we offer a high-level overview of what most big data pipelines should look like.

Parallel Computing


Lesson Description:

How do we ask a mountain of data a complex question? We break up the mountain! Let's take a look at how we can horizontally scale computing power to ask big data questions no matter how large the data set is.

Distributed Storage


Lesson Description:

How do you cost-effectively scale data storage for an ever-growing data set? How do you store data sets that don't fit on a single server? By distributing storage horizontally, we can create a storage solution that is easy and cheap to scale, stores as much as we need it to, and has increased performance over monolithic storage solutions.

Cluster Management


Lesson Description:

In the same way a computer has to manage all the processes and activities on a single machine, a distributed system needs to manage all the nodes and activities of a cluster of machines. Let's talk more about how cluster management works and why management nodes are the most important nodes in any distributed system.




Lesson Description:

Hadoop has been around for a long time and was popularized very early on. This means it grabbed a large chunk of the big data market and is now a ubiquitous big data storage solution. So, let's take a look at this technology and get a better idea of exactly what it is, how it's used, and how it works.



Lesson Description:

MapReduce was popularized by Google using it to rebuild its search index and was built to work right on top of HDFS. Let's take a look at this popular data analytics technique to better understand how it works.



Lesson Description:

Moving on to a newer big data technology, Spark is a very popular and easy-to-use data analytics solution that works with just about any other big data technology you can imagine. Its ease of use, wide compatability, and impressive performance has made it the modern-day go-to for data analytics. Let's take a closer look at what it is and how it is used today.



Lesson Description:

Elasticsearch is an impressively flexible technology that is quickly growing in popularity for a large variety of use cases. Its speed, wide compatability, ease of use, and ability to adapt to a large variety of use cases make it a popular choice in the big data industry. Let's take a closer look at how it is so flexible and get a better understanding of the various ways it is used today.



Lesson Description:

If you need somewhere to store your data for analysis and search, then HDFS or Elasticsearch are great candidates. However, what do you use when you want to store data in the order in which it was generated, so it can be consumed in that same order? Kafka, among other things, is a great data streaming application that solves a lot of big data pipeline problems with its ability to function as a fully distributed multi-tenant data streaming platform.



Lesson Description:

NoSQL databases are quickly becoming a popular alternative to traditional databases due to their ability to handle semi-structured data through the use of JSON-like documents. This gives developers much more flexibility when designing their applications' data back-ends. Let's take a look at this modern database solution to get a better idea of how it solves problems that traditional databases can't.

Use Cases

Internet of Things (IoT)


Lesson Description:

Probably the largest producer of big data today are IoT devices. So let's talk a bit more about what these devices entail.



Lesson Description:

Whether we realize it or not, big data plays a major role in our online shopping experience and also in how online retailers manage their business. Let's take a closer look at how big data is used with e-commerce.



Lesson Description:

Whether it's for collecting your health information, monitoring your vitals, or finding better treatments, the use of big data in healthcare is giving a notable boost to how we fight illnesses and provide better patient care. Let's take a look at a few examples.



Lesson Description:

Big data is starting to make its way into education in a variety of ways. Whether it's helping people decide on a career path, strengthening student success, or finding new ways to prevent and detect cheating, big data is improving the education industry.



Lesson Description:

How the finance industry leverages big data is a rapidly growing trend. Whether it's protecting unauthorized purchases, calculating financial risk, or trading stocks faster and more effectively than any human, big data is making waves in finance.

Business Intelligence


Lesson Description:

Informing critical decisions is at the center of business intelligence. Companies have to spend huge amounts of money in the most cost-effective way possible with the goal of attaining the largest ROI. All of this is a tall order without being able to use your company's data to help you make the right decisions.

Information Technology


Lesson Description:

IT departments generate huge amounts of data every second. From logs to resource utilization metrics or network events, all this data can seem useless at first glance — but when collected, mined, and analyzed, this data can be used to troubleshoot issues and provide state-of-the-art monitoring.

Final Steps

What's Next?


Lesson Description:

Congratulations on completing the course! Now let's talk about some next steps to continue your data learning journey.

Take this course and learn a new skill today.

Transform your learning with our all access plan.

Start 7-Day Free Trial