Tuesday, January 03, 2012

machine learning tricks


  1. Never use coordinate descent, use something like BFGS.

Wednesday, December 28, 2011

Mahout from zero

Mahout is the tools for machine learning built on Hadoop.

--SequenceFile
Hadoop is based on distributed Hash table and key for a hash is the key and value. And SequenceFile is the persistent format to store this key and value pairs. As we may expect, Hadoop use SequenceFile to store intermediate result. We never directly interact with SequenceFile but use the interface. There are reader, writer and sorter for the SequenceFile class. We never use writer directly but use the createWriter to get the proper writer.

--Mahout Vector
Sometimes a vector is often used to represent a recode in real application. But anyway. Vector is the input format for most Mahout algorithms. There are two kinds of vector implementation, of course, dense and sparse (including random access and sequential access). So the idea is that no matter where we get our data and whats the format of the data, we have to convert data into Mahout vector format. Note that vector in Mahout is always float number. To store vector in SequenceFile for Mahout use, the key is the record id (row number, text id, you name it) and the value is the Mahout vector.

ssh tricks

1. To login without typing password, add client's id_rsa.pub to server's authorized_keys. Both under .ssh/
2. To avoid ssh timeout, add two lines: ServerAliveInterval 20 and ServerAliveCountMax 100 to .ssh/config

Sunday, December 25, 2011

Java Programming in one shot

Here is the place we can learn Java programming besides the language itself. Topics like, how to compile, run the program; how to unit test; how to write a makefile for your Java project will be covered here.

----How to compile and run your Java program?
Every programming language starts with HelloWorld examples. As far as I read it starts from "C Programming Language". Here is our version:

HelloWorld.java:
public class HelloWorld {
    public static void main(String[] args) {
        System.out.println("Hello World");
    }
}


In the command line, you type:
bash$ javac HelloWorld.java
to compile the java program and type:
bash$ java HelloWorld
to see the result


There are some alternatives to these commands we just see. For example, in gcc system, you may use gcj -C and you may use java -jar to run a jar file.


----How to write build file for Java project?
When our Java project grows larger and larger, we may think put our source code into a directory tree and have build content in a separate place. Similar to make/Makefile build system, Ant build system for Java is the choice and lucky enough, it's included in the JDK. Instead of having Makefile, usually we have build.xml (Note that build file in Ant is written in XML. Ant has default to read build.xml as the build file. If you in case you want to use another build file name, use ant -buildfile option, or simply as ant -file|ant -fas the file for Ant. And like make/Makefile, you call on a task like: ant build if we have a target named build.


Every build.xml should have one project elements, so we can start our build.xml like this:
<project basedir="." default="clean" name="HelloWorld">
     <target name="clean">
         <delete dir="build">
     </target>
</project>
We can see that there are 3 attributes for project, all of them are optional and their meaning is self-explained.
This short build file also gives us clue about the structure of it.


  --What's the structure of Ant build file?
Since Ant build file use XML, we expect the tree structure. The top level is the project elements, the second level is the target elements and under target elements there are task elements. You can write your own task elements or use built-in ones. Besides these elements, people usually use property elements to refer to directory and property elements usually put under project elements along with targets.

Using property to define directory to use is a common practice when write build file for Ant. For example, we can add following lines to the build file we already have:
<project ...>
     ...
     <property name="src.dir" value="src"/>
     <property name="build.dir" value="build"/>
     <property name="classes.dir" value="${build.dir}/classes"/>
     <property name="jar.dir" value="${build.dir}/jar"/>
     ...
</project>
The reason here is these directory may be used several times and you don't want to hard code them. Problem can always be easier if we add another layer. So we set property and use them later in targets. You can look up Web to check how to use property and a list of built-in property. Instead of using general property, you can also use path.

  --<path> or <pathelement>?
path property usually stays at the same level as other property and can be composed of pathelements. While pathelement are usually component of other task, like classpath.

To compile our Java project, we need include some jar files in classpath. We don't usually change CLASSPATH environment variable, since we prefer to have different classpath for each project. For compile process, javac task has parameter classpath, which we need to have a directory as its argument; or javac also has classpathref, which we can have a id contains path structure. For execution, java task usually have nested classpath elements composed of path structure.

  --Wildcard in Ant?
There are 3 wildcard in Ant: ?, * and **. ? is used least frequently as it matches any single character; * matches zero or more character, it's often used for wildcard of file names; ** is the wildcard for directory.

----How to build unit test for Java project?
Java has its default unit test framework JUnit.

----Maven, more than Ant
Maven is a project management system, including functions provided by Ant.

Friday, December 23, 2011

Hadoop Facts: pros and cons

Hadoop Facts (for 0.20.2):

This is the facts when I learn to use Hadoop. Some item is obvious but it's still good for hadoop learner.


  1. Try to find what configuration you have? look at  http://hadoop.apache.org/common/docs/current/core-default.htmlhttp://hadoop.apache.org/common/docs/current/hdfs-default.htmlhttp://hadoop.apache.org/common/docs/current/mapred-default.html They are corresponding to core-site.xml, hdfs-site.xml and mapred-site.xml
  2. Don't try to run multiple datanode/tasktracker on one machine. Hadoop will try to run multiple task simultaneously. Look for  mapred.tasktracker.{map|reduce}.tasks.maximum in  conf/mapred-site.xml.  That's the place to increase the number of parallel task if you think your machine is powerful enough.
  3. If you don't want to try out HDFS but to use local file system. Use file:/// for fs.default.name in core-site.xml
  4. HDFS and MapReduce is two components in Hadoop, you can try out them separately.
  5. Be aware of zombie process when your namenode/jobtracker get killed by accident. All datanode/tasktracker will be zombie as there is no way to kill them all (otherwise you write some in-house script).
  6. Hadoop log files are huge. Well, not that huge but definitely it's not for human reading. Change it by looking into hadoop-env.sh to alter HADOOP_LOG_DIR.
  7. Who is the jobtracker/namenode? Usually it's determined by the machine where you run bin/start-all.sh 
  8. bin/masters is NOT the place you put namenode/jobtracker. Instead, it's the place you tell Hadoop where should it starts secondarynamenode. 
  9. No matter how the configuration file is organized in conf/. Hadoop always read all the XML files and get the configuration. (I haven't test this yet.)
  10. Wondering the number of map jobs? " The Hadoop MapReduce framework spawns one map task for each InputSplit generated by the InputFormat for the job. "
  11. Cannot start datanode? Some datanode starts and some cannot? Less and less datanodes can start? Check if we have formatted the namenode. Check if there is  java.io.IOException: Incompatible namespaceIDs in datanodes' log file. If the answer is yes. Then we have the famous namenode format only problem. There are two solutions to this. Either delete datanode data storage path or update namespaceID in datanode to the current version from namenode.
  12. Document about Hadoop api and HDFS api is separated.
  13. Datanode need time to set up and namenode can accept service request after its own initialization. So we should wait a short time for all HDFS nodes are up. 
  14. Often people run a hadoop job by the following command: hadoop jar job.jar job-parameter. What if you have dependency jar file? There are several ways to do this. we can use -libjars described here. Or you can copy all your jar to $HADOOP_HOME/lib. If only tasknode use the jar, we also can use HADOOP_TASKTRACKER_OPTS="-classpath <jars-separated-by-colon>"

Thursday, December 22, 2011

run Hadoop under Lava Workload Scheduler

Other title:
run Hadoop in constrained environment
run Hadoop in shared environment with only a user account
run Hadoop with LSF

Opportunity.neu.edu cluster at Northeastern U doesn't have hadoop yet and I can see it will not have it in the near future. So I am wondering whether I can run hadoop as a job under Lava. The plan is like this:
Step 1: run hadoop in single mode as a job.
Step 2: run hadoop as a parallel job after initializing tasktracker on assigned node (MapReduce).
Step 3: run hadoop using full setting, add datanode on assigned node (HDFS).

Step 1: This goal is just simple enough. You just need to follow the get started http://hadoop.apache.org/common/docs/current/#Getting+Started
link and You should be OK. I try the pi example on the front node and it works.

Step2: I want to first try out MapReduce for several reasons. First, most algorithms I commonly use actually is compute-oriented; Second, it's not necessary for MapReduce to deal with HDFS, I can run it with local file system. If something runs wrong, it's easier to kill processes than to clean up files; Third, it should be easy if we can get HDFS running after we understand how to make MapReduce work.

After spending a day search on the Web and try out different configuration, I can run MapReduce under opportunity. First of all, following is the Python script I submit using bsub, configuration for Hadoop is also shown:

bash$ bsub -J hadoop_test -o /home/yguan/ExpJobs/Jobs/exp103/log/hadoop_test.log -n 12 /home/yguan/ExpJobs/Jobs/exp103/template/hadoop_test.py

hadoop_test.py:

#!/usr/bin/python


import os
import sys


templateDir = sys.path[0]
print(templateDir)
machineFileDir = os.path.join(templateDir, 'hadoop_conf/slaves')
configDir = os.path.join(templateDir, 'hadoop')


hosts = os.environ["LSB_HOSTS"]
hostList = hosts.split(' ')
hostList = list(set(hostList)) # remove duplicate host


# write to slaves files
fh = open(machineFileDir, 'w')
for host in hostList:
    fh.write(host)
    fh.write('\n')
fh.close()


# get the network address for the jobtracker.
import platform
nodeName = platform.node()
mapredFileName = os.path.join(templateDir, "hadoop_conf/mapred-site.xml")
mapredFileNameBack = os.path.join(templateDir, "hadoop_conf/mapred-site.xml.back")
fhr = open(mapredFileName, 'r')
fhw = open(mapredFileNameBack, 'w')
# here we assume the second line after mapred.job.tracker line
# is the value line.
mapredJobTrackerLine = False
for line in fhr:
    if "mapred.job.tracker<" in line:
        mapredJobTrackerLine = True
    elif mapredJobTrackerLine == True:
        mapredJobTrackerLine = False
        fhw.write("        "+nodeName+":42312\n")
        continue
    fhw.write(line)
fhr.close()
fhw.close()
os.remove(mapredFileName)
os.rename(mapredFileNameBack, mapredFileName)


# start hadoop/MapReduce part
hadoop_config = " --config " + os.path.join(templateDir, 'hadoop_conf')
hadoop_home = os.environ["HADOOP_HOME"]
os.chdir(hadoop_home)
startHadoop_comm  = "bin/start-mapred.sh "
[input, output, err] = os.popen3(startHadoop_comm + hadoop_config)
for line in output:
    print line


for line in err:
    print line


# run the hadoop job
hadoop_command = "hadoop "
hadoop_run = " jar hadoop-0.20.2-examples.jar pi 100 100"
#command = "bin/hadoop jar hadoop-0.20.2-examples.jar pi 2 100"
command = hadoop_command + hadoop_config + hadoop_run
print(command)


[input, output, err] = os.popen3(command)
for line in output:
    print line
for line in err:
    print line


# shut down hadoop/mapreduce
stopHadoop_comm = "bin/stop-mapred.sh"
[input, output, err] = os.popen3(stopHadoop_comm + hadoop_config)
for line in output:
    print line
for line in err:
    print line

hadoop_conf/core-site.xml:


   
        hadoop.tmp.dir
        /scratch_global/hadoop-yguan/tmp
   
   
        fs.default.name
        file:///
   


hadoop_conf/mapred-site.xml:
   
        mapred.child.tmp
        /scratch/hadoop-yguan/tmp
   
   
        mapred.system.dir
        /scratch_global/hadoop-yguan/tmp/mapred/system
   
   
        mapred.local.dir
        /scratch/hadoop-yguan/tmp/mapred/local
   
   
        mapred.job.tracker
        compute-2-11.local:42312
   
   
        mapred.tasktracker.map.tasks.maximum
        4
   
   
        mapred.tasktracker.reduce.tasks.maximum
        4
   
The workflow here is simple enough. First we get the list of slaves as bsub will tell us the node we should run jobs on. Then we remove duplicate node. The reason is that you only need one datanode/tasktracker on one machine and mapred.tasktracker.{map|reduce}.tasks.maximum will determine the number of parallel JVM running on one machine. Then we replace mapred.job.tracker in mapred-site.xml to the node name our job running on. Now it's time to call bin/start-mapred.sh to set up jobtracker and populate work nodes with tasktrackers. Then we can run your Hadoop job. After the Hadoop is done. We call bin/stop-mapred.sh to stop jobtracker and tasktrackers.

Step 3: Now it's time to make MapReduce and HDFS works. One thing special about HDFS other than MapReduce is that you need to format the file system first. When you type in shell command "bin/hadoop namenode -format", it will ask you for a "Y" to confirm. Thus I play a dirty trick here to have a file containing a "Y" and a newline and serve as the input for the format shell command. So here we have a new version of hadoop_test.py:


#!/usr/bin/python


import os
import sys


templateDir = sys.path[0]
print(templateDir)
machineFileDir = os.path.join(templateDir, 'hadoop_conf/slaves')
configDir = os.path.join(templateDir, 'hadoop')


hosts = os.environ["LSB_HOSTS"]
hostList = hosts.split(' ')
hostList = list(set(hostList)) # remove duplicate host


# write to slaves files
fh = open(machineFileDir, 'w')
for host in hostList:
    fh.write(host)
    fh.write('\n')
fh.close()


# get the network address for the jobtracker.
import platform
nodeName = platform.node()
mapredFileName = os.path.join(templateDir, "hadoop_conf/mapred-site.xml")
mapredFileNameBack = os.path.join(templateDir, "hadoop_conf/mapred-site.xml.back")
fhr = open(mapredFileName, 'r')
fhw = open(mapredFileNameBack, 'w')
# here we assume the second line after mapred.job.tracker line
# is the value line.
mapredJobTrackerLine = False
for line in fhr:
    if "mapred.job.tracker<" in line:
        mapredJobTrackerLine = True
    elif mapredJobTrackerLine == True:
        mapredJobTrackerLine = False
        fhw.write("        "+nodeName+":42312\n")
        continue
    fhw.write(line)
fhr.close()
fhw.close()
os.remove(mapredFileName)
os.rename(mapredFileNameBack, mapredFileName)


# get the network address for the namenode
hdfsFileName = os.path.join(templateDir, "hadoop_conf/core-site.xml")
hdfsFileNameBack = os.path.join(templateDir, "hadoop_conf/core-site.xml.back")
fhr = open(hdfsFileName, 'r')
fhw = open(hdfsFileNameBack, 'w')
hdfsLine = False
for line in fhr:
    if "fs.default.name<" in line:
        hdfsLine = True
    elif hdfsLine == True:
        hdfsLine = False
        fhw.write("        hdfs://"+nodeName+"\n")
        continue
    fhw.write(line)
fhr.close()
fhw.close()
os.remove(hdfsFileName)
os.rename(hdfsFileNameBack, hdfsFileName)


# format the newly created hdfs
hadoop_config = " --config " + os.path.join(templateDir, 'hadoop_conf')
hadoop_home = os.environ["HADOOP_HOME"]
os.chdir(hadoop_home)


format_comm = "bin/hadoop " + hadoop_config + " namenode -format < Y"
[input, output, err] = os.popen3(format_comm)
for line in output:
    print line
for line in err:
    print line


# start hadoop/HDFS part
starthdfs_comm = "bin/start-dfs.sh "
[input, output, err] = os.popen3(starthdfs_comm + hadoop_config)
for line in output:
    print line
for line in err:
    print line


# start hadoop/MapReduce part
startHadoop_comm  = "bin/start-mapred.sh "
[input, output, err] = os.popen3(startHadoop_comm + hadoop_config)
for line in output:
    print line
for line in err:
    print line




# run the hadoop job
hadoop_command = "hadoop "
hadoop_run = " jar hadoop-0.20.2-examples.jar pi 100 100"
#command = "bin/hadoop jar hadoop-0.20.2-examples.jar pi 2 100"
command = hadoop_command + hadoop_config + hadoop_run
print(command)


[input, output, err] = os.popen3(command)
for line in output:
    print line
for line in err:
    print line


# shut down hadoop/mapreduce
stopHadoop_comm = "bin/stop-mapred.sh"
[input, output, err] = os.popen3(stopHadoop_comm + hadoop_config)
for line in output:
    print line
for line in err:
    print line


# shut down hadoop/hdfs
stophdfs_comm = "bin/stop-dfs.sh"
[input, output, err] = os.popen3(stophdfs_comm + hadoop_config)
for line in output:
    print line
for line in err:
    print line

core-site.xml:
   
        hadoop.tmp.dir
        /scratch/hadoop-yguan/tmp
   
   
        fs.default.name
        hdfs://compute-2-16.local
   

hdfs-site.xml:
   
        dfs.name.dir
        /scratch/yguan/hadoop/name
   
   
        dfs.data.dir
        /scratch/yguan/hadoop/data
   

There are some limitation in this setting. For example, we have the maximum of parallel running job on a machine set to constant. In fact we can set it to the number assigned by LSF. And this requires each tasknode has its own configuration file.

At this state, both components in Hadoop can be started and stopped under a job session.
Note that Platform who produces LSF already has its own Hadoop support.
Note that Hadoop has its own scheduler.

However, I find some error after try this script several times. When you format the namenode, datanode is not aware of it and since we don't change the path where datanode stores the data. There is will be error like this:
java.io.IOException: Incompatible namespaceIDs
The solution is to remove the path on datanode completely which is fine for our case or update namespaceID on datanode. Searching web for namespaceIDs will show this. To remove the directory, we can issue
ssh datanode_name rm -rf /path/to/datanode/storage
for every datanode.

Another error comes up is when we run the script, datanode is starting but namenode already starts to accept service request. So it will come up with "file could only be replicated to 0 nodes, instead of 1". The solution to this problem is wait for a few seconds before submit job. As there is no easy way to check HDFS health, especially datanode status by program right now. On opportunity.neu.edu for example, in rush hour the wait time can be 30 seconds long. (Update: when I'm using 0.20, this error occur sometimes even I have time delayed long enough to wait for the datanode. After upgrade to a new version, this problem seems go away.)

There are many places we can continue our studying about Hadoop. We can make us familiar with HDFS for example from the information at:
http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html

Pro Hadoop from Apress is a bad book

"Pro Hadoop" from Apress is a bad book.

 Hadoop is premature considering the lack of documents. There is no central place you can find how many configuration you can put in the variety of configuration XML files(This is not true after I spend sometime with Hadoop on the network. You at least can get some hit about configuration of Hadoop using the default files online. The default configuration value URL is:
http://hadoop.apache.org/common/docs/current/hdfs-default.html
http://hadoop.apache.org/common/docs/current/mapred-default.html
http://hadoop.apache.org/common/docs/current/core-default.html
). At least I want to learn it. So I want to grab some book to learn it. There are currently two choice you can have. One is "Pro Hadoop" from Apress and "Hadoop: The Definite Guide" from Oreilly. (Actually, I realize that there is also a book "Hadoop in Action")

 I start to read "Pro Hadoop" since I just get it first. Here is my reading experience:

 I read through the book, from the very beginning. After the "bin/hadoop jar hadoop-*-examples.jar pi 2 2" example, I come across a section of sample code from book. I type one line by one line, try to see the meaning of every line. And after that I try to compile it. Of course, there is some error related to classpath. I fix and now the code still cannot get through. After a while, I realize the code cannot be compiled since it use a object before initialize it. Woo, no other choice but I have to switch to Oreilly book.....

Story continues. I finally realize "Hadoop: The Definite Guide" is actually the right book to read for everyone newbie wants to learn about Hadoop. It's well organized and most importantly the thought flows well, especially compared to "Pro Hadoop".

I suggest everyone wants to learn Hadoop has "Hadoop: The Definite Guide" at hands.

Friday, October 14, 2011

swap ctrl capslock for windows caps-ctrl-swap.reg

REGEDIT4

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Keyboard Layout]
"Scancode Map"=hex:00,00,00,00,00,00,00,00,03,00,00,00,1d,00,3a,00,3a,00,1d,00,00,00,00,00

Monday, October 10, 2011

nokia e73 model

documents:

* JavaTM ME Developer's Library 2.3

Blog Archive