MapReduce design patterns — Part 1

Shashank Baravani
3 min readAug 26, 2018

Even if you are remotely familiar with the big data ecosystem, then you will very well know that the simple paradigm of map and reduce surprisingly solves a large variety of problems. Here are a few class of patterns that can be applied with map reduce problems.

Summarization Patterns

  1. Counting with counters: This pattern applies when the intent is to summarize a very large data set through use of counters. E.g. word count, record count, count a small number of unique instances, summations, etc. Of course this obliviates the need for reducers because we can rely on map reduce counters.
  2. Numerical summarizations: Finding statistical details such min-max, average, median, standard deviation, group aggregates, etc satisfy this pattern. Implementation: In the mapper, group records by key and calculate aggregate per group in the reducer to get a top level view of the large data set.
  3. Inverted Index: If the data set represents an forward index and the intent is to create a reverse index out of it then this pattern applies. Implementation: Simple iterate over the <Key,Value> in the mapper and convert it to a series of <Value,Key> pairs. What you collect in the reducer will be a reverse index that can be readily written to a file.

Performance: Also, combiners can be used to avail of performance benefits by reducing volume of data transfer between mappers and reducers.

Filtering Patterns

  1. Simple filter: When the intent is to filter the data set of any unwanted records then this pattern applies. Examples, Distributed grep, Data cleansing, Simple random sampling, Removing low scoring data, etc
  2. Distinct records: When the intent is to remove all duplicates then this pattern applies. Quite self explanatory.
  3. Bloom Filter: When the intent is to retain records based on set membership. Implementation: Load the bloom filter in each mapper (this can be done using distributed file cache), simple iterate over <Key,Value> pairs, check for set membership and filter records that yield a false.

Performance: For most filtering cases, we can make do with identity reducers since mappers will do most of the heavy lifting. Since there are no reducers, both the sort phase and the reduce phase are cut out resulting in large performance gains.

Data Organization Patterns

  1. Top M records: When the intent is to select a canned list of top M records that based on a ranking scheme, then this pattern applies. Implementation: Each mapper identifies a top M records using an in memory min/max heap. The reducer, in the sorting phase, will work through K * M records. This pattern is very useful for outlier analysis, selecting most valuable data, building dashboards, etc. Performance: The sorting phase will be very slow if M is in millions; this pattern works best if M is of the order of tens or hundreds.
  2. Data structure conversion: This pattern applies when the intent is to convert a structured data set in form of rows into a different form, say, hierarchical wherein multiple values for a key collapse into a single json payload. This is useful when data set from one source needs to be converted into a format compatible with another store. E.g. HDFS to MongoDB. Another use case is pre joining data from large data sets, before the next stage in the pipeline can run.
  3. Partitioning: When the intent is to partition the data set (based on a strategy) into multiple smaller data sets to be fed into an external source or grouping data in relevant namespaces, then this pattern applies. Implementation:
InputSplit --> IdentityMapper --> Partitioner --> Shuffle&Sort --> IdentityReducer --> PartA InputSplit --> IdentityMapper --> Partitioner --> Shuffle&Sort --> IdentityReducer --> PartB
  • Partitioning pruning based on continuous value: when the intent is to consume only a smaller subset especially if data pivoted on a continuous variable such as date, temperature, etc
  • Partitioning pruning based on category value: when the intent is to categorize based on a super key such as country, company, etc
  • Sharding: when data needs to be split to be ingested into specific shards in the underlying data store such as Elastic Search.

Performance: This pattern works well if the outcome is a small number of large files and it is recommended to use block compressed sequence file format. Also, inverted indexes are particularly susceptible to hot spots in the index keys, since the index keys are rarely evenly distributed. A custom partitioner can be used to avoid performance penalties.

We will cover the remaining patterns in part 2 of this blog. Thank you.

--

--