Big Data Fundamentals
BigData Training Architect II in Content
Welcome to the course! Let's quickly go over what we will cover in this course, who it is for, and what you should already know.
About the Training Architect
Get to know a little bit about me, the training architect. I look forward to sharing what I know about big data with you!
What Is Big Data?
Let's clarify one very important thing: What is big data? We hear the term used a lot, but what exactly is it — and what is it _not_?
Types of Data
Understanding the three main types of data will help you understand the complexity aspect of big data. Essentially, big data isn't just _a lot_ of data — it's also a lot of _complex_ data that cannot otherwise be stored in traditional databases. So, what makes data complex?
The original three V's are a nice way to characterize data as they apply to any data set. Over the years, the "V's" have become 4, 5, 6, 7, and now 10+. Let's discuss the three V's and also a few of the less silly ones added over the years to help us understand the various characteristics of big data.
A big topic in the big data industry is machine learning. Being under the artificial intelligence umbrella, machine learning is often talked about as if it's magic. So, let's talk about what machine learning _actually_ is so we can better understand it and how it works at a high level.
Data Science and Analytics
It's common to hear data science and analytics referred to as being nearly the same thing or completely different practices with no relation. Let's clarify what data science and data analytics are so we can better compare and contrast these two practices, which could be considered two sides of the same coin.
Sometimes the hardest part of building a big data solution is collecting and processing the data in way that is reliable and backpressure tolerant. Here, we offer a high-level overview of what most big data pipelines should look like.
How do we ask a mountain of data a complex question? We break up the mountain! Let's take a look at how we can horizontally scale computing power to ask big data questions no matter how large the data set is.
How do you cost-effectively scale data storage for an ever-growing data set? How do you store data sets that don't fit on a single server? By distributing storage horizontally, we can create a storage solution that is easy and cheap to scale, stores as much as we need it to, and has increased performance over monolithic storage solutions.
In the same way a computer has to manage all the processes and activities on a single machine, a distributed system needs to manage all the nodes and activities of a cluster of machines. Let's talk more about how cluster management works and why management nodes are the most important nodes in any distributed system.
Hadoop has been around for a long time and was popularized very early on. This means it grabbed a large chunk of the big data market and is now a ubiquitous big data storage solution. So, let's take a look at this technology and get a better idea of exactly what it is, how it's used, and how it works.
MapReduce was popularized by Google using it to rebuild its search index and was built to work right on top of HDFS. Let's take a look at this popular data analytics technique to better understand how it works.
Moving on to a newer big data technology, Spark is a very popular and easy-to-use data analytics solution that works with just about any other big data technology you can imagine. Its ease of use, wide compatability, and impressive performance has made it the modern-day go-to for data analytics. Let's take a closer look at what it is and how it is used today.
Elasticsearch is an impressively flexible technology that is quickly growing in popularity for a large variety of use cases. Its speed, wide compatability, ease of use, and ability to adapt to a large variety of use cases make it a popular choice in the big data industry. Let's take a closer look at how it is so flexible and get a better understanding of the various ways it is used today.
If you need somewhere to store your data for analysis and search, then HDFS or Elasticsearch are great candidates. However, what do you use when you want to store data in the order in which it was generated, so it can be consumed in that same order? Kafka, among other things, is a great data streaming application that solves a lot of big data pipeline problems with its ability to function as a fully distributed multi-tenant data streaming platform.
NoSQL databases are quickly becoming a popular alternative to traditional databases due to their ability to handle semi-structured data through the use of JSON-like documents. This gives developers much more flexibility when designing their applications' data back-ends. Let's take a look at this modern database solution to get a better idea of how it solves problems that traditional databases can't.
Internet of Things (IoT)
Probably the largest producer of big data today are IoT devices. So let's talk a bit more about what these devices entail.
Whether we realize it or not, big data plays a major role in our online shopping experience and also in how online retailers manage their business. Let's take a closer look at how big data is used with e-commerce.
Whether it's for collecting your health information, monitoring your vitals, or finding better treatments, the use of big data in healthcare is giving a notable boost to how we fight illnesses and provide better patient care. Let's take a look at a few examples.
Big data is starting to make its way into education in a variety of ways. Whether it's helping people decide on a career path, strengthening student success, or finding new ways to prevent and detect cheating, big data is improving the education industry.
How the finance industry leverages big data is a rapidly growing trend. Whether it's protecting unauthorized purchases, calculating financial risk, or trading stocks faster and more effectively than any human, big data is making waves in finance.
Informing critical decisions is at the center of business intelligence. Companies have to spend huge amounts of money in the most cost-effective way possible with the goal of attaining the largest ROI. All of this is a tall order without being able to use your company's data to help you make the right decisions.
IT departments generate huge amounts of data every second. From logs to resource utilization metrics or network events, all this data can seem useless at first glance — but when collected, mined, and analyzed, this data can be used to troubleshoot issues and provide state-of-the-art monitoring.
Congratulations on completing the course! Now let's talk about some next steps to continue your data learning journey.
Take this course and learn a new skill today.
Transform your learning with our all access plan.Start 7-Day Free Trial