Recently, I read an interesting research article titled “MapReduce: Simplified Data Processing on Large Clusters” written by Google employees Jeffrey Dean and Sanjay Ghemawat.
After reading the article, I summarized its key points, including the background, processes, and extension to be Apache Hadoop.
This article is taken from the original one on Medium. The Thai Article is available here.
First, what is MapReduce?
Table of Contents
MapReduce
MapReduce [1] is a programming paradigm used to process and manage big data that requires hundreds or thousands of servers.
The idea of proposing MapReduce originated from the need to process vast amounts of data such as searched documents and requested web data in Google.
In traditional programming, we can simply write a loop code to process all the data. However, when dealing with enormous amounts of data, writing a loop code is not practical as it would take a considerable amount of time to complete. In such cases, we need to consider parallel processing, data distribution, and failure management.
To address this problem, the developers of MapReduce were inspired by the Map and reduce functions from Lisp [2], a functional programming language. It has been observed that a majority of big data processing involves two key operations: Map and Reduce.
The Map operation processes each record to create intermediate key/value pairs, while the Reduce operation takes these intermediate pairs and combines them to generate the final output.
These processes have been designed in a way that enables developers to perform parallel processing and efficiently handle errors. This allows other developers to seamlessly integrate these operations into their own projects for streamlined and effective data processing.
Process
MapReduce employs a divide-and-conquer approach to processing massive datasets. It divides the data into smaller chunks and distributes them across multiple machines in a cluster.
Map
The Map function is an important step in MapReduce for processing large datasets. It takes the input key/value pairs and divides them into smaller chunks, typically ranging from 16 to 64MB, in order to distribute the data evenly among the computers in a cluster.
Once the data is distributed, the Map function (M) processes it in parallel on each node, generating intermediate key-value pairs that are then passed on to the next step in MapReduce.
We can summarize the Map function below
map (k1, v1) -> list(k2, v2)
where (k1, v1) is the input key/value pair and (k2, v2) is the intermediate key/value pair.
Shuffle
In the context of MapReduce, Shuffle represents a crucial intermediate step between the Map and Reduce functions.
During this phase, the data is partitioned into smaller chunks, sorted, and grouped based on their keys, before being passed on to the Reduce function for further processing.
The primary objective of the Shuffle phase is to minimize network bandwidth, data transfer, and input/output, which can significantly enhance the overall performance of MapReduce [3].
By optimizing data transfer and reducing the amount of unnecessary data exchanged between nodes, Shuffle plays a critical role in making MapReduce more efficient and scalable.
Reduce
Finally, the Reduce function (R) aggregates the intermediate data for each key and produces the final result.
We can summarize this process below
reduce (k2, list(v2)) -> list(v2)
where (k2, list(v2)) is the shuffled intermediate key/value pairs, and list(v2) is the final output.
In a MapReduce process, data goes through three phases: Map, Shuffle, and Reduce. Once this process is complete, the final output is saved in a single file per function.
Usually, there is no need to combine the output files into one because multiple files can be processed via another MapReduce function or a distributed processing framework that supports processing separated files. This provides flexibility in processing large amounts of data efficiently.
Example
The author of this research article provides one example is to count the number of words in one file by providing the below pseudocode.
map(String key, String value):
// key: document name
// value: document contents for each word w in value:
EmitIntermediate(w, "1");
reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
The Map function processes the input data and produces the intermediate key/value data which provides a word and a word count that the example provides only 1. The Reduce function then takes these intermediate results and aggregates them to produce the final output.
To better understand the concept of MapReduce, we can relate it to the population census conducted during the Roman Empire, which counted the population in each town and sent the results to the capital. This results in faster population counting than sending only one person [4].
Extension
Dug Cutting and Mike Cafarella extended MapReduce to build a robust data processing framework, which they named Apache Hadoop [5].
Hadoop is an open-source tool that enables efficient management and processing of large-scale data sets by distributing data across a cluster of interconnected computers.
By breaking down large data sets into smaller chunks and processing them in parallel, Hadoop dramatically enhances the speed and scalability of data processing tasks.
The invention of Hadoop marked a significant milestone in the history of big data processing. However, this tool has four limitations that make it less appealing to users [6,7].
- Firstly, Hadoop restricts the data processing to only one input per worker, which can increase the time needed to complete the process.
- Secondly, Hadoop is slower than other big data processing frameworks, such as Apache Spark, since it reads and writes data from the Harddisk as HDFS (Hadoop Distributed File System).
- Thirdly, Hadoop is not suitable for real-time data processing.
- Lastly, Hadoop has a steep learning curve as it only supports Java, while many users write SQL for data analysis.
As a result, many users have switched to Apache Spark and Apache Flink, which support both Stream and Batch data processing.
In fact, many companies prefer these tools over Hadoop. However, both tools still use HDFS as their data storage.
Summary
MapReduce is a widely-used data processing framework that enables the swift and efficient processing of voluminous datasets across a distributed cluster of nodes.
By dividing large datasets into smaller chunks and processing them in parallel, MapReduce allows the processing of massive datasets that would otherwise be impossible to handle using traditional programming techniques.
The above features of MapReduce. This leads to develop Apache Hadoop, which integrates MapReduce as an integral component. This serves as a popular entry point for big data processing.
However, it has several downsides, including slow data processing, a high learning curve, and a lack of support for real-time data processing. Due to these limitations, Apache Spark and Apache Flink are more commonly used in the industry compared to Hadoop.
If you found this content useful, press Clap, and share it to various social media platforms. Moreover, readers can follow me on Linkedin or X or Twitter.
Reference
- MapReduce: Simplified Data Processing on Large Clusters
- MapReduce: The programming model and practice — Google Research
- MAPREDUCE คืออะไร เข้าใจที่มา HADOOP และในด้าน DISTRIBUTED COMPUTING — Burasakorn Sabyeying (Mils)
- What is Apache MapReduce? — IBM)
- Hadoop MapReduce deep diving and tuning — Today Software Magazine
- The Good and the Bad of Hadoop Big Data Framework
- Road to Data Engineer 2.0 — DataTH School