Apache 如何在Mahout K-means聚类中维护数据条目id_Apache_Hadoop_Mahout_K Means

Apache 如何在Mahout K-means聚类中维护数据条目id

apache hadoop

Apache 如何在Mahout K-means聚类中维护数据条目id,apache,hadoop,mahout,k-means,Apache,Hadoop,Mahout,K Means,我使用mahout运行k-means聚类，在聚类时我遇到了识别数据项的问题，例如，我有100个数据项 id data 0 0.1 0.2 0.3 0.4 1 0.2 0.3 0.4 0.5 ... ... 100 0.2 0.4 0.4 0.5 聚类后，我需要从聚类结果中获取id，以查看哪个点属于哪个聚类，但似乎没有方法维护id 在对合成控制数据进行聚类的官方mahout示例中，只有数据输入到mahout，而没有id 28.7812 34.46

我使用mahout运行k-means聚类，在聚类时我遇到了识别数据项的问题，例如，我有100个数据项

id      data
0       0.1 0.2 0.3 0.4
1       0.2 0.3 0.4 0.5
...     ...
100     0.2 0.4 0.4 0.5

聚类后，我需要从聚类结果中获取id，以查看哪个点属于哪个聚类，但似乎没有方法维护id

在对合成控制数据进行聚类的官方mahout示例中，只有数据输入到mahout，而没有id

28.7812 34.4632 31.3381 31.2834 28.9207 ...
...
24.8923 25.741  27.5532 32.8217 27.8789 ...

且聚类结果只有聚类id和分值：

VL-539{n=38 c=[29.950, 30.459, ...
   Weight:  Point:
   1.0: [28.974, 29.026, 31.404, 27.894, 35.985...
   2.0: [24.214, 33.150, 31.521, 31.986, 29.064

但是不存在点id，所以，在进行mahout集群时，有人知道如何添加和维护点id吗？多谢各位

您的请求经常被自己不是实践者的程序员忽略。。。不幸的是。我不知道怎么做（到目前为止），但我从ApacheCommonsMath开始，它包括一个具有相同缺陷的K-means。我对它进行了修改，以满足您的要求。你可以在这里找到它：

此外，别忘了将数据标准化（线性）到[0..1]的间隔，否则任何集群算法都会产生垃圾

由kmeans生成的clusteredPoints目录包含此映射。

请注意，您应该使用-cl选项来获取此数据。

为了实现这一点，我使用NamedVector

正如您所知，在对数据进行任何聚类之前，您必须对其进行矢量化

这意味着您必须将数据转换为Mahout向量，因为这是聚类算法处理的数据类型

矢量化过程将取决于数据的性质，即矢量化文本与矢量化数值

您的数据似乎很容易矢量化，因为它只有一个ID和4个数值

您可以编写一个Hadoop作业，将输入数据作为CSV文件，并输出一个序列文件，其中数据已矢量化

然后，将Mahout聚类算法应用于该输入，并在聚类结果中保留每个向量的ID（向量名）

可以使用以下类实现数据矢量化的示例作业：

public class DenseVectorizationDriver extends Configured implements Tool{

    @Override
    public int run(String[] args) throws Exception {
        if (args.length != 2) {
            System.err.printf("Usage: %s [generic options] <input> <output>\n", getClass().getSimpleName());
            ToolRunner.printGenericCommandUsage(System.err); return -1;
        }
        Job job = new Job(getConf(), "Create Dense Vectors from CSV input");
        job.setJarByClass(DenseVectorizationDriver.class);

        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.setMapperClass(DenseVectorizationMapper.class);
        job.setReducerClass(DenseVectorizationReducer.class);

        job.setOutputKeyClass(LongWritable.class);
        job.setOutputValueClass(VectorWritable.class);

        job.setOutputFormatClass(SequenceFileOutputFormat.class);

        return job.waitForCompletion(true) ? 0 : 1;
    }
}


public class DenseVectorizationMapper extends Mapper<LongWritable, Text, LongWritable, VectorWritable>{
/*
 * This mapper class takes the input from a CSV file whose fields are separated by TAB and emits
 * the same key it receives (useless in this case) and a NamedVector as value.
 * The "name" of the NamedVector is the ID of each row.
 */
    @Override
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        String line = value.toString();
        System.out.println("LINE: "+line);
        String[] lineParts = line.split("\t", -1);    
        String id = lineParts[0];

        //you should do some checks here to assure that this piece of data is correct

        Vector vector = new DenseVector(lineParts.length -1);
        for (int i = 1; i < lineParts.length -1; i++){
            String strValue = lineParts[i];
            System.out.println("VALUE: "+strValue);
            vector.set(i, Double.parseDouble(strValue));

        }

        vector =  new NamedVector(vector, id);

        context.write(key, new VectorWritable(vector));
    }
}


public class DenseVectorizationReducer extends Reducer<LongWritable, VectorWritable, LongWritable, VectorWritable>{
/*
 * This reducer simply writes the output without doing any computation.
 * Maybe it would be better to define this hadoop job without reduce phase.
 */
    @Override
    public void reduce(LongWritable key, Iterable<VectorWritable> values, Context context) throws IOException, InterruptedException{

        VectorWritable writeValue = values.iterator().next();
        context.write(key, writeValue);
    }
}

public类densevectoriationdriver扩展配置的工具{
@凌驾
公共int运行（字符串[]args）引发异常{
如果（参数长度！=2）{
System.err.printf（“用法：%s[通用选项]\n”，getClass（）.getSimpleName（））；
printGenericCommandUsage（System.err）；返回-1；
}
Job Job=new Job（getConf（），“从CSV输入创建密集向量”）；
job.setJarByClass（densevectoriationdriver.class）；
addInputPath（作业，新路径（args[0]）；
setOutputPath（作业，新路径（args[1]）；
setMapperClass（denseVectoriationMapper.class）；
job.setReducerClass（densevectoriationReducer.class）；
job.setOutputKeyClass（LongWritable.class）；
job.setOutputValueClass（VectorWritable.class）；
setOutputFormatClass（SequenceFileOutputFormat.class）；
返回作业。waitForCompletion（true）？0:1；
}
}
公共类DenseVectoriationMapper扩展了Mapper{
/*
*此映射器类从CSV文件中获取输入，该文件的字段由TAB分隔并发出
*它接收的同一个键（在本例中无效）和一个NamedVector作为值。
*NamedVector的“name”是每行的ID。
*/
@凌驾
公共void映射（LongWritable键、文本值、上下文上下文）引发IOException、InterruptedException{
字符串行=value.toString（）；
System.out.println（“行：”+行）；
字符串[]lineParts=line.split（“\t”，-1）；
字符串id=线部件[0]；
//您应该在这里进行一些检查，以确保这段数据是正确的
向量向量=新的密度向量（lineParts.length-1）；
对于（int i=1；i

我没有通读你所有的代码，但你的第一行就足够了。“NamedVector”！