A basic version of the MapReduce algorithm consists of the following steps:
1. Use a mapper function to turn each item into zero or more key-value pairs.
(Often this is called the map function, but there is already a Python function
called map and we don’t need to confuse the two.)
2. Collect together all the pairs with identical keys.
3. Use a reducer function on each collection of grouped values to produce output
values for the corresponding key.
The most widely used MapReduce system is Hadoop, which itself merits many
books. There are various commercial and noncommercial distributions and a
huge ecosystem of Hadoop-related tools.
In order to use it, you have to set up your own cluster (or find someone to let you
use theirs), which is not necessarily a task for the faint-hearted. Hadoop mappers
and reducers are commonly written in Java, although there is a facility known as
“Hadoop streaming” that allows you to write them in other languages (including
Python).
1. Use a mapper function to turn each item into zero or more key-value pairs.
(Often this is called the map function, but there is already a Python function
called map and we don’t need to confuse the two.)
2. Collect together all the pairs with identical keys.
3. Use a reducer function on each collection of grouped values to produce output
values for the corresponding key.
The most widely used MapReduce system is Hadoop, which itself merits many
books. There are various commercial and noncommercial distributions and a
huge ecosystem of Hadoop-related tools.
In order to use it, you have to set up your own cluster (or find someone to let you
use theirs), which is not necessarily a task for the faint-hearted. Hadoop mappers
and reducers are commonly written in Java, although there is a facility known as
“Hadoop streaming” that allows you to write them in other languages (including
Python).