Java 如何在mahout中矢量化文本文件?
我有一个带有标签和推文的文本文件 我需要将每一行转换为向量值。如果我使用Java 如何在mahout中矢量化文本文件?,java,vectorization,mahout,bigdata,Java,Vectorization,Mahout,Bigdata,我有一个带有标签和推文的文本文件 我需要将每一行转换为向量值。如果我使用seq2sparse命令,意味着整个文档将转换为向量,但我需要将每一行转换为向量,而不是整个文档。 前任: 键:正值:矢量值(tweet) 我们如何在mahout中实现这一点 /*这就是我所做的*/ StringTokenizer str= new StringTokenizer(line,","); String label=str.nextToken(); whi
seq2sparse
命令,意味着整个文档将转换为向量,但我需要将每一行转换为向量,而不是整个文档。
前任:
键:正值:矢量值(tweet)
我们如何在mahout中实现这一点
/*这就是我所做的*/
StringTokenizer str= new StringTokenizer(line,",");
String label=str.nextToken();
while (str.hasMoreTokens())
{
tweetline =str.nextToken();
System.out.println("Tweetline"+tweetline);
StringTokenizer words = new StringTokenizer(tweetline," ");
while(words.hasMoreTokens()){
featureList.add(words.nextToken());}
}
Vector unclassifiedInstanceVector = new RandomAccessSparseVector(tweetline.split(" ").length);
FeatureVectorEncoder vectorEncoder = new AdaptiveWordValueEncoder(label);
vectorEncoder.setProbes(1);
System.out.println("Feature List: "+featureList);
for (Object feature: featureList) {
vectorEncoder.addToVector((String) feature, unclassifiedInstanceVector);
}
context.write(new Text("/"+label), new VectorWritable(unclassifiedInstanceVector));
提前感谢您可以使用SequenceFile.Writer将其写入app hdfs路径
FS = FileSystem.get(HBaseConfiguration.create());
String newPath = "/foo/mahouttest/part-r-00000";
Path newPathFile = new Path(newPath);
Text key = new Text();
VectorWritable value = new VectorWritable();
SequenceFile.Writer writer = SequenceFile.createWriter(FS, conf, newPathFile,
key.getClass(), value.getClass());
.....
key.set("c/"+label);
value.set(unclassifiedInstanceVector );
writer.append(key,value);
FS = FileSystem.get(HBaseConfiguration.create());
String newPath = "/foo/mahouttest/part-r-00000";
Path newPathFile = new Path(newPath);
Text key = new Text();
VectorWritable value = new VectorWritable();
SequenceFile.Writer writer = SequenceFile.createWriter(FS, conf, newPathFile,
key.getClass(), value.getClass());
.....
key.set("c/"+label);
value.set(unclassifiedInstanceVector );
writer.append(key,value);