Some time back a colleague and I were discussing our respective MapReduce frameworks. Although very different, they both aimed to simplify the mundane task of configuring Jobs through the standard Hadoop boilerplate. “I’ve always thought it would be great to use annotations for developing MapReduce jobs,” he said. I must admit the thought had never occurred to me - but it certainly piqued my interest.
Could it be done? And more importantly, would it actually make things easier, faster, and more fun?
The primary design goal is simple: eliminate as much of the boilerplate configuration method calls as possible, and replace them with a combination of explicit annotations and logical defaults. We should do as much by convention or through reflection as possible to keep things uncluttered and clear.
Secondary goals include lessening or eliminating the need to explicitly subclass framework components, and removal of the main method in drivers through the use of a standard container. It would be also useful to simplify the distributed cache mechanism, if possible leveraging it to pass objects into MapReduce components.
Replacing the Boilerplate
The majority of the annotations are replacements for the job configuration methods used by nearly all MapReduce jobs to specify the various components, inputs, outputs, etc. The canonical “WordCount” job includes several of these boilerplate calls. We also call static methods on the
FileOutputFormat classes for configuring our input and output paths.
Using Mara, however, the simplest MapReduce driver would look something like
Admittedly this wouldn’t do much useful work - it would use the default Hadoop
Mapper class to output your input into separate part files, one for each mapper. However, you would have a fully deployable, debuggable, production-ready job with a default context requring
output cli arguments. Add in a simple Mapper and Reducer implementation, and you’ll have a production-ready wordcount job…
(I’m not including the mapper/reducer code for clarity. See examples for full source.)
A Standard Driver Container
The annotations are enabled by a standard runtime container ‘RunJob’ that bootstraps and submits jobs while providing base features common to all jobs. The container searches the packages on the classpath specified in the manifest for all drivers and, if none is specified, produces a list of all available jobs
By default it supports the flag
--dryRun which will initialize the driver but skip submission of the job to the scheduler. Use with the
--dump option to see the full list of confguration options and option values.
The Context Bean
Command line options are specfied using a context bean. If none is specified in the driver, the container will create a default instance including
--archive options. When you require additional, or different, command line options you can provide your own custom context. To provide a context, create a bean with the properties required and mark up the driver field containing this context with the
Any properties on the context bean annotated with the
@Option annotation will be initialized by the framework as a command line option. The container will populate these properties with the values provided on the command line, handling type conversion for primatives and wrappers where necessary. Options without arguments (
argCount=0) are treated as boolean flags.
Mara greatly simplifies the distribution of resources, hiding the complexities of the distributed cache behind the annotations
@Resource. This pair of annotations is used to distribute values from the Tool class to mapreduce components. Both static and dynamically-built objects and values may be passed into your component classes.
In your driver, you simply mark the field or method you wish to project with the
@Distribute tag. Mara will then inject these resources in your component class fields marked with the matching
@Resource annotation. This mechanism presently works for primatives, wrapper classes, and any object implementing
This mechanism is especially helpful in unit testing your jobs.
Mara supports unit testing and TDD for MapReduce through the Apache MRUnit project. Although MRUnit makes unit testing MapReduce feasible, there are still numerous headaches with writing and maintaining these tests. First, the driver class isn’t exercised in standard MRUnit. This often means copying code and manually porting to the MRUnit equivalent. It also leaves no coverage for your base configuration which can result in problems that are only exposed upon deployment to a full Hadoop environment.
Because you must essentially translate the driver configuration into MRUnit, future changes to your driver configuration must also be duplicated in the unit tests. This can become rather tedious and error prone. Furthermore, effective testing of certain jobs - such as those employing Avro or MultipleOutputs - require the use of mocks to work properly.
Also, for testing with MultipleOutputs you typically need to create a method in mappers or reducers for directly injecting a mock version. Not a huge problem, but it’s bad practice to add code to a production class solely for the purpose of unit testing.
Mara eliminates these complexities and removes the need to separately write bootstrap code for annotated drivers in the majority of cases. For
MultipleOutputs this means using the
@NamedOutput tags in your mapper or reducer classes and allowing Mara to manage them. In production these are properly configured using the standard classes while under MRUnit they’ll be mocked out. Methods exist in the BaseMRUnitTest class that allow you to easily verify the writes to any multiple outputs.
Unit testing with distributed resources
Mara allows you to set your
@Resource values in unit tests through a simple naming convention. Assume you have a resource in your driver — perhaps it’s a
Map<String,String> named ‘myMap’ needed for a join in your mapper class. In production the
@Distribute annotation is applied to the driver class member (or method). In your unit test, you would write a method
Map<String,String> getMyMap() returning the map - or mock - you want to use for test purposes. If you need to use different resources for different test cases, simply append the name of the test case method to your method. For example…
For more information
Conversant has released Mara to the open source community! There are many other options and a few ‘extras’ in the library for handling things like HBase or Avro. For the full documentation and source code, please visit the project page on GitHub:
- Documentation: https://github.com/conversant/mara/wiki
- Source code: https://github.com/conversant/mara