Mahout is the tools for machine learning built on Hadoop.
--SequenceFile
Hadoop is based on distributed Hash table and key for a hash is the key and value. And SequenceFile is the persistent format to store this key and value pairs. As we may expect, Hadoop use SequenceFile to store intermediate result. We never directly interact with SequenceFile but use the interface. There are reader, writer and sorter for the SequenceFile class. We never use writer directly but use the createWriter to get the proper writer.
--Mahout Vector
Sometimes a vector is often used to represent a recode in real application. But anyway. Vector is the input format for most Mahout algorithms. There are two kinds of vector implementation, of course, dense and sparse (including random access and sequential access). So the idea is that no matter where we get our data and whats the format of the data, we have to convert data into Mahout vector format. Note that vector in Mahout is always float number. To store vector in SequenceFile for Mahout use, the key is the record id (row number, text id, you name it) and the value is the Mahout vector.
No comments:
Post a Comment