Friday, December 23, 2011

Hadoop Facts: pros and cons

Hadoop Facts (for 0.20.2):

This is the facts when I learn to use Hadoop. Some item is obvious but it's still good for hadoop learner.


  1. Try to find what configuration you have? look at  http://hadoop.apache.org/common/docs/current/core-default.htmlhttp://hadoop.apache.org/common/docs/current/hdfs-default.htmlhttp://hadoop.apache.org/common/docs/current/mapred-default.html They are corresponding to core-site.xml, hdfs-site.xml and mapred-site.xml
  2. Don't try to run multiple datanode/tasktracker on one machine. Hadoop will try to run multiple task simultaneously. Look for  mapred.tasktracker.{map|reduce}.tasks.maximum in  conf/mapred-site.xml.  That's the place to increase the number of parallel task if you think your machine is powerful enough.
  3. If you don't want to try out HDFS but to use local file system. Use file:/// for fs.default.name in core-site.xml
  4. HDFS and MapReduce is two components in Hadoop, you can try out them separately.
  5. Be aware of zombie process when your namenode/jobtracker get killed by accident. All datanode/tasktracker will be zombie as there is no way to kill them all (otherwise you write some in-house script).
  6. Hadoop log files are huge. Well, not that huge but definitely it's not for human reading. Change it by looking into hadoop-env.sh to alter HADOOP_LOG_DIR.
  7. Who is the jobtracker/namenode? Usually it's determined by the machine where you run bin/start-all.sh 
  8. bin/masters is NOT the place you put namenode/jobtracker. Instead, it's the place you tell Hadoop where should it starts secondarynamenode. 
  9. No matter how the configuration file is organized in conf/. Hadoop always read all the XML files and get the configuration. (I haven't test this yet.)
  10. Wondering the number of map jobs? " The Hadoop MapReduce framework spawns one map task for each InputSplit generated by the InputFormat for the job. "
  11. Cannot start datanode? Some datanode starts and some cannot? Less and less datanodes can start? Check if we have formatted the namenode. Check if there is  java.io.IOException: Incompatible namespaceIDs in datanodes' log file. If the answer is yes. Then we have the famous namenode format only problem. There are two solutions to this. Either delete datanode data storage path or update namespaceID in datanode to the current version from namenode.
  12. Document about Hadoop api and HDFS api is separated.
  13. Datanode need time to set up and namenode can accept service request after its own initialization. So we should wait a short time for all HDFS nodes are up. 
  14. Often people run a hadoop job by the following command: hadoop jar job.jar job-parameter. What if you have dependency jar file? There are several ways to do this. we can use -libjars described here. Or you can copy all your jar to $HADOOP_HOME/lib. If only tasknode use the jar, we also can use HADOOP_TASKTRACKER_OPTS="-classpath <jars-separated-by-colon>"

No comments: