So, What is All the Fuss about this Hadoop thingy?!

--

Apache Hadoop can efficiently store and process large datasets ranging in size from gigabytes to petabytes of data

Introduction

It’s open-source (which is more than trendy nowadays), with an enthusiastic and supportive community (as per other key, open-source software models), and it’s both reliable and scalable (big deal — yawn…)

So what exactly is it about Apache Hadoop that makes it more than a practical framework that can aggregate, manipulate, store, analyse and process Big Data sets? (And I am talking big data sets here — as in petabytes of data).

Quite simply:

  • Hadoop uses a highly efficient distributed processing model that was originally devised back in the early 90's,
  • Such a distributed processing model can achieve exceptional processing capabilities,
  • Also, at raised processing speeds, and
  • All at a comparatively low cost when compared to competitor market offerings.

Ok, before we move any further, lets get the key definitions out of the way…

Definition

  • The essence of Apache Hadoop is a collection of open-source software modules and utilities that manages a network of numerous integrated computers to unravel problems involving tremendous amounts of data and calculation.
  • It delivers a software framework for distributed storage and processing of big data using its internal MapReduce programming model.
  • All the modules within Hadoop are designed with High Availability architectural design in mind. That is, hardware failures might and can occur and should automatically recover — managed by the Hadoop framework.

Composition of the Hadoop Framework:

  • Hadoop Common: These are the common management utilities that support the other Hadoop framework modules,
  • Hadoop Distributed File System (HDFS): A distributed and automated file system providing high-throughput access to application data,
  • Hadoop YARN: A management framework that automatically oversees job scheduling and cluster resource management,
  • Hadoop MapReduce: Based on YARN, this is a system for parallel processing large data sets.

Who Out There is Currently Using Apache Hadoop?

When researching this article, I was amazed by the number of companies that currently use Hadoop for either educational or production purposes. Large enterprises such as Adobe, eBay, Meta (Facebook), Google, IBM, The New York Times, Twitter, and Yahoo are just a few that are current users of the Apache Hadoop framework.

There is even a separate listing of those commercial companies that offer services — on or based around — Apache Hadoop.

The complete list of current users is available here, and in the true open-source spirit, this list is edited and maintained by the actual end-users themselves (with a little oversight and help from Apache-Hadoop as well…)

Some commentators out in the ether have made the (valid) point that Hadoop HDFS is slowly becoming superseded by AWS s3 and Azure storage. Most companies do not want to host their own datalakes anymore due to the considerable costs involved.

Not only can Hadoop be deployed in a traditional on-prem datacentre, but it can also be implemented into the cloud. The cloud option allows organizations to deploy Hadoop without acquiring hardware or specific setup expertise. This option enables those companies to build their own projects in a cloud-based open-source ecosystem.

Nov’ 21: Azure HDInsight is a managed, full-spectrum service that can use the Hadoop framework

Is Hadoop a Skill Worth Learning in 2022?

In one word — Absolutely.

And this is not just due to the number — and stalwartness — of the companies that comprise its current userbase.

Prudent insight here will depend on precisely what you want to do with your data career. If you are going to move into ‘Data Science,’ use Machine Learning, Artificial Intelligence, and Deep Neural Networks techniques, etc., then knowing the basics is (more than) ideal. However, if you plan to become a data engineer who will work with the infrastructure that supports large data pipelines and data modelling, then it is perfect for studying and knowing.

Couple all of what has been said in this article with the expected “Exponential Growth of the Big Data Market,” that prime industry analysts have been banging on about of late — well, you get the picture…

Final Thoughts

The advent of Spark has enhanced the Hadoop ecosystem, and at its core is the Java language — Hadoop is continually evolving over time. As companies realize the benefits big data can deliver to their business, so will its adoption — and therefore, associated data processing technologies will be in even further demand in the resource marketplace.

I hope you all enjoyed reading this article as I have had in writing it — let me know if there are any thoughts and questions, and I would be more than happy to get back to you. Be good and be safe…

Image courtesy from Celpax from Unsplash

--

--

Paul Chambiras https://freelance-writer.online

I am a freelance writer on all things Business, DIY, Sport, Technology, IT and Management.