What is Hadoop?
Hadoop is an opne source Software project of Apache™ Hadoop® ithat enables distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance.
History of Hadoop:
It all started with the World Wide Web. As the web grew in the late 1900s and early 2000s, search engines and indexes were created to help locate relevant information amid the text-based content. In the early years, search results really were returned by humans. But as the web grew from dozens to millions of pages, automation was needed. Web crawlers were created, many as university-led research projects, and search engine start-ups took off (Yahoo, AltaVista, etc.).
One such project was an open-source web search engine called Nutch – the brainchild of Doug Cutting and Mike Cafarella. They wanted to invent a way to return web search results faster by distributing data and calculations across different computers so multiple tasks could be accomplished simultaneously. During this time, another search engine project called Google was in progress. It was based on the same concept – storing and processing data in a distributed, automated way so that relevant web search results could be returned faster.
In 2006, Cutting joined Yahoo and took with him the Nutch project as well as ideas based on Google’s early work with automating distributed data storage and processing. The Nutch project was divided. The web crawler portion remained as Nutch. The distributed computing and processing portion became Hadoop (named after Cutting’s son’s toy elephant). In 2008, Yahoo released Hadoop as an open-source project. Today, Hadoop’s framework and ecosystem of technologies are managed and maintained by the non-profit Apache Software Foundation (ASF), a global community of software developers and contributors.
Features of Hadoop:
- Open-source software.Open-source software is created and maintained by a network of developers from around the globe. It’s free to download, use and contribute to, though more and more commercial versions of Hadoop are becoming available.
2.Framework. In this case, it means that everything you need to develop and run software applications is provided – programs, connections, etc.
3.Massive storage. The Hadoop framework breaks big data into blocks, which are stored on clusters of commodity hardware.
4.Processing power. Hadoop concurrently processes large amounts of data using multiple low-cost computers for fast results.
Benefits of Hadoop:
1.Computing power. Its distributed computing model quickly processes big data. The more computing nodes you use, the more processing power you have.
2.Flexibility. Unlike traditional relational databases, you don’t have to preprocess data before storing it. You can store as much data as you want and decide how to use it later. That includes unstructured data like text, images and videos.
3.Fault tolerance. Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. And it automatically stores multiple copies of all data.
4.Low cost. The open-source framework is free and uses commodity hardware to store large quantities of data.
5.Scalability. You can easily grow your system simply by adding more nodes. Little administration is required.
Components of Hadoop :
Currently, four core modules are included in the basic framework from the Apache Foundation:
- Hadoop Common– the libraries and utilities used by other Hadoop modules.
- Hadoop Distributed File System (HDFS)– the Java-based scalable system that stores data across multiple machines without prior organization.
- MapReduce–a software programming model for processing large sets of data in parallel.
- YARN –resource management framework for scheduling and handling resource requests from distributed applications. (YARN is an acronym for Yet Another Resource Negotiator.)
Other software components that can run on top of or alongside Hadoop and have achieved top-level Apache project status include:
Pig – a platform for manipulating data stored in HDFS that includes a compiler for MapReduce programs and a high-level language called Pig Latin. It provides a way to perform data extractions, transformations and loading, and basic analysis without having to write MapReduce programs.
Hive – a data warehousing and SQL-like query language that presents data in the form of tables. Hive programming is similar to database programming. (It was initially developed by Facebook.)
HBase – a nonrelational, distributed database that runs on top of Hadoop. HBase tables can serve as input and output for MapReduce jobs.
HCatalog – a table and storage management layer that helps users share and access data.
Ambari – a web interface for managing, configuring and testing Hadoop services and components.
Cassandra – A distributed database system.
Chukwa – a data collection system for monitoring large distributed systems.
Flume – software that collects, aggregates and moves large amounts of streaming data into HDFS.
Oozie – a Hadoop job scheduler.
Sqoop – a connection and transfer mechanism that moves data between Hadoop and relational databases.
Spark – an open-source cluster computing framework with in-memory analytics.
Solr – an scalable search tool that includes indexing, reliability, central configuration, failover and recovery.
Zookeeper – an application that coordinates distributed processes.
What is Hadoop used for?
Going beyond its original goal of searching millions (or billions) of web pages and returning relevant results, many organizations are looking to Hadoop as their next big data platform. Popular uses today include:
Low-cost storage and active data archive. The modest cost of commodity hardware makes Hadoop useful for storing and combining data such as transactional, social media, sensor, machine, scientific, click streams, etc. The low-cost storage lets you keep information that is not deemed currently critical but that you might want to analyze later.
Staging area for a data warehouse and analytics store. One of the most prevalent uses is to stage large amounts of raw data for loading into an enterprise data warehouse (EDW) or an analytical store for activities such as advanced analytics, query and reporting, etc. Organizations are looking at Hadoop to handle new types of data (e.g., unstructured), as well as to offload some historical data from their enterprise data warehouses.
Data lake. Hadoop is often used to store large amounts of data without the constraints introduced by schemas commonly found in the SQL-based world. It is used as a low-cost compute-cycle platform that supports processing ETL and data quality jobs in parallel using hand-coded or commercial data management technologies. Refined results can then be passed to other systems (e.g., EDWs, analytic marts) as needed.
Sandbox for discovery and analysis. Because Hadoop was designed to deal with volumes of data in a variety of shapes and forms, it can run analytical algorithms. Big data analytics on Hadoop can help your organization operate more efficiently, uncover new opportunities and derive next-level competitive advantage. The sandbox approach provides an opportunity to innovate with minimal investment.
Recommendation systems. One of the most popular analytical uses by some of Hadoop’s largest adopters is for web-based recommendation systems. Facebook – people you may know. LinkedIn – jobs you may be interested in. Netflix, eBay, Hulu – items you may be interested in. These systems analyze huge amounts of data in real time to quickly predict preferences before customers leave the web page.
Getting data into Hadoop
Here are just a few ways to get your data into Hadoop.
Load files to the system using simple Java commands. HDFS takes care of making multiple copies of data blocks and distributing them across multiple nodes.
If you have a large number of files, a shell script that runs multiple “put” commands in parallel will speed up the process. You don’t have to write MapReduce code.
Create a cron job to scan a directory for new files and “put” them in HDFS as they show up. This is useful for things like downloading email at regular intervals.
Mount HDFS as a file system and copy or write files there.
Use Sqoop to import structured data from a relational database to HDFS, Hive and HBase. It can also extract data from Hadoop and export it to relational databases and data warehouses.
Use Flume to continuously load data from logs into Hadoop.
Use third-party vendor connectors (like SAS/ACCESS® or SAS Data Loader for Hadoop).
What are the challenges of using Hadoop?
MapReduce programming is not a good match for all problems. It’s good for simple information requests and problems that can be divided into independent units, but it’s not efficient for iterative and interactive analytic tasks. MapReduce is file-intensive. Because the nodes don’t intercommunicate except through sorts and shuffles, iterative algorithms require multiple map-shuffle/sort-reduce phases to complete. This creates multiple files between MapReduce phases and is inefficient for advanced analytic computing.
There’s a widely acknowledged talent gap. It can be difficult to find entry-level programmers who have sufficient Java skills to be productive with MapReduce. That’s one reason distribution providers are racing to put relational (SQL) technology on top of Hadoop. It is much easier to find programmers with SQL skills than MapReduce skills. And, Hadoop administration seems part art and part science, requiring low-level knowledge of operating systems, hardware and Hadoop kernel settings.
Data security. Another challenge centers around the fragmented data security issues, though new tools and technologies are surfacing. The Kerberos authentication protocol is a great step forward for making Hadoop environments secure.
Full-fledged data management and governance. Hadoop does not have easy-to-use, full-feature tools for data management, data cleansing, governance and metadata. Especially lacking are tools for data quality and standardization.