java - Hadoop options are not having any effect (mapreduce.input.lineinputformat.linespermap, mapred.max.map.failures.percent)


Question: 

I am trying to implement a MapReduce job, where each of the mappers would take 150 lines of the text file, and all the mappers would run simmultaniously; also, it should not fail, no matter how many map tasks fail.

Here's the configuration part:

        JobConf conf = new JobConf(Main.class);
        conf.setJobName("My mapreduce");

        conf.set("mapreduce.input.lineinputformat.linespermap", "150");
        conf.set("mapred.max.map.failures.percent","100");

        conf.setInputFormat(NLineInputFormat.class);

        FileInputFormat.addInputPath(conf, new Path(args[0]));
        FileOutputFormat.setOutputPath(conf, new Path(args[1]));

The problem is that hadoop creates a mapper for every single line of text, they seem to run sequentially, and if a single one fails, the job fails.

From this I deduce, that the settings I've applied do not have any effect.

What did I do wrong?




4 Answers: 

I assume you are using Hadoop 0.20. In 0.20 the configuration parameter is "mapred.line.input.format.linespermap" and you are using "mapreduce.input.lineinputformat.linespermap". If the configuration parameter is not set then it's defaulted to 1, so you so you are seeing the behavior mentioned in the query.

Here is the code snippet from 0.20 NLineInputFormat.

public void configure(JobConf conf) { N = conf.getInt("mapred.line.input.format.linespermap", 1); }

Hadoop configuration is sometimes a real pain, not documented properly and I have observed that the configuration parameter also keeps changing sometimes between releases. The best bet is to see the code when uncertain of some configuration parameters.

 

To start with "mapred." is old api and "mapreduce." is new api. so you would better not use them together. check which version you are using and stick with that. And also recheck your imports, since there are 2 NLineInputFormat aswell (mapred and mapreduce).

Secondly you can check this link : (gonna paste the important part)

NLineInputFormat will split N lines of input as one split. So, each map gets N lines.

But the RecordReader is still LineRecordReader, which reads one line at time, thereby Key is the offset in the file and Value is the line. If you want N lines as Key, you may to override LineRecordReader.

 

If you want to quickly find the correct names for the options for hadoop's new api, use this link: http://pydoop.sourceforge.net/docs/examples/intro.html#hadoop-0-21-0-notes .

 

The new api's options are mostly undocumented

 

More Articles


scala - Should I use collectionAsScalaIterable({java collection}) or Seq({java collection}).flatten?

We're starting to use Scala Test to test our Java application, and I want to test the contents of a Java Collection. We came up with 2 possibilities:JavaConversions.collectionAsScalaIterable(getJavaCollection()) must contain(allOf(item1, item2).inOrder)orSeq(getJavaCollection()).flatten mustEqual Se

authentication - Sitecore - Prevent access to a page, but still show it in the navigation

In Sitecore I have denied access to a particular page for the anonymous user. This works correctly, but it also means that the page does not appear in the navigation menus and sitemap (both XSLT).What I would like is for the user to be able to see the link, but be redirected to a Register/Login page

Apply function to one element of a list in Python

I'm looking for a concise and functional style way to apply a function to one element of a tuple and return the new tuple, in Python.For example, for the following input:inp = ("hello", "my", "friend")I would like to be able to get the following output:out = ("hello", "MY", "friend")I came up with t


scalatest - Confusing type mismatch in Scala

I have:val words = List("all", "the", "words", "all", "the", "counts", "all", "day")val map = Exercise02.count(words.iterator)val expected = Map("all" -> 3, "the" -> 2, "words" -> 1, "counts" -> 1, "day" -> 1)where Exercise02.count is java.util.Iterator[String] => Map[String, Int]

R: turning list items into objects

I have a list of objects that I've created manually, like this:rand1 <- rnorm(1e3)rand2 <- rnorm(1e6)myObjects <- NULLmyObjects[[1]] <-rand1myObjects[[2]] <-rand2names(myObjects) <- c("rand1","rand2")I'm working on some code that bundles up objects and puts them up in S3. Then I ha

php - Dynamically replace form with message upon submission

I've written a basic PHP script to generate an email when my contact form is submitted, but as it stands, it redirects the user to an error or thankyou page, which is a little clunky. I'd love to dynamically replace the entire form (or, at least, the submit button) with an error or success message.


Multi-user mass image uploads with PHP

I'll soon be building a competition microsite which is based entirely around image uploads: in order to enter, users will need to upload an image.I've previously built a similar website on top of a PHP MVC framework, which worked awesomely up until a couple of hours before the competition was due to

pycharm - pydev debugger: CRITICAL WARNING: This version of python seems to be incorrectly compiled

I can't figure out how to fix this PyCharm warning. I started getting it after I migrated my account to a new Mac. (I also just updated PyCharm.) Another symptom is the PyCharm editor is complaining about references to methods on commands like "logging.info()". Even though when I run my app, it s

How can I use a Java List with Scala's foreach?

This question already has an answer here: Iterating over Java collections in Scala 9 answers

mapreduce - Hadoop difficultie with composite key

I'm using Hadoop to analyze GSOD data (ftp://ftp.ncdc.noaa.gov/pub/data/gsod/).I chose 5 years to executed my experiments (2005 - 2009).I've configured a little cluster and executed a simple MapReduce program that gets the maximum temperature registered for a year.Now I have to create a new MR progr