Java apache flink-筛选器作为终止条件_Java_Filter_K Means_Apache Flink

Java apache flink-筛选器作为终止条件

java filter apache-flink

Java apache flink-筛选器作为终止条件,java,filter,k-means,apache-flink,Java,Filter,K Means,Apache Flink,我已经用k-means为终止条件定义了一个过滤器。如果我运行我的应用程序，它总是只计算一次迭代我认为问题在于： DataSet<GeoTimeDataCenter> finalCentroids = loop.closeWith(newCentroids, newCentroids.join(loop).where("*").equalTo("*").filter(new MyFilter())); DataSet finalCentroids=loop.closeWith（n

我已经用k-means为终止条件定义了一个过滤器。如果我运行我的应用程序，它总是只计算一次迭代

我认为问题在于：

DataSet<GeoTimeDataCenter> finalCentroids = loop.closeWith(newCentroids, newCentroids.join(loop).where("*").equalTo("*").filter(new MyFilter()));

DataSet finalCentroids=loop.closeWith（newCentroids，newCentroids.join（loop）.where（“*”）.equalTo（“*”）.filter（new MyFilter（））；

或者可能是过滤器功能：

public static final class MyFilter implements FilterFunction<Tuple2<GeoTimeDataCenter, GeoTimeDataCenter>> {

    private static final long serialVersionUID = 5868635346889117617L;

    public boolean filter(Tuple2<GeoTimeDataCenter, GeoTimeDataCenter> tuple) throws Exception {
        if(tuple.f0.equals(tuple.f1)) {
            return true;
        }
        else {
            return false;
        }
    }
}

公共静态最终类MyFilter实现FilterFunction{
私有静态最终长serialVersionUID=5868635346889117617L；
公共布尔筛选器（Tuple2 tuple）引发异常{
if（tuple.f0.equals（tuple.f1））{
返回true；
}
否则{
返回false；
}
}
}

致以最良好的祝愿，保罗

我的完整代码如下：

public void run() {   
    //load properties
    Properties pro = new Properties();
    FileSystem fs = null;
    try {
        pro.load(FlinkMain.class.getResourceAsStream("/config.properties"));
        fs = FileSystem.get(new URI(pro.getProperty("hdfs.namenode")),new org.apache.hadoop.conf.Configuration());
    } catch (Exception e) {
        e.printStackTrace();
    }

    int maxIteration = Integer.parseInt(pro.getProperty("maxiterations"));
    String outputPath = fs.getHomeDirectory()+pro.getProperty("flink.output");
    // set up execution environment
    ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
    // get input points
    DataSet<GeoTimeDataTupel> points = getPointDataSet(env);
    DataSet<GeoTimeDataCenter> centroids = null;
    try {
        centroids = getCentroidDataSet(env);
    } catch (Exception e1) {
        e1.printStackTrace();
    }
    // set number of bulk iterations for KMeans algorithm
    IterativeDataSet<GeoTimeDataCenter> loop = centroids.iterate(maxIteration);
    DataSet<GeoTimeDataCenter> newCentroids = points
        // compute closest centroid for each point
        .map(new SelectNearestCenter(this.getBenchmarkCounter())).withBroadcastSet(loop, "centroids")
        // count and sum point coordinates for each centroid
        .groupBy(0).reduceGroup(new CentroidAccumulator())
        // compute new centroids from point counts and coordinate sums
        .map(new CentroidAverager(this.getBenchmarkCounter()));
    // feed new centroids back into next iteration with termination condition
    DataSet<GeoTimeDataCenter> finalCentroids = loop.closeWith(newCentroids, newCentroids.join(loop).where("*").equalTo("*").filter(new MyFilter()));
    DataSet<Tuple2<Integer, GeoTimeDataTupel>> clusteredPoints = points
        // assign points to final clusters
        .map(new SelectNearestCenter(-1)).withBroadcastSet(finalCentroids, "centroids");
    // emit result
    clusteredPoints.writeAsCsv(outputPath+"/points", "\n", " ");
    finalCentroids.writeAsText(outputPath+"/centers");//print();
    // execute program
    try {
        env.execute("KMeans Flink");
    } catch (Exception e) {
        e.printStackTrace();
    }
}

public static final class MyFilter implements FilterFunction<Tuple2<GeoTimeDataCenter, GeoTimeDataCenter>> {

    private static final long serialVersionUID = 5868635346889117617L;

    public boolean filter(Tuple2<GeoTimeDataCenter, GeoTimeDataCenter> tuple) throws Exception {
        if(tuple.f0.equals(tuple.f1)) {
            return true;
        }
        else {
            return false;
        }
    }
}

public void run（）{
//负载特性
Properties pro=新属性（）；
文件系统fs=null；
试一试{
load（FlinkMain.class.getResourceAsStream（“/config.properties”）；
fs=FileSystem.get（新URI（pro.getProperty（“hdfs.namenode”）），新org.apache.hadoop.conf.Configuration（）；
}捕获（例外e）{
e、 printStackTrace（）；
}
int-maxIteration=Integer.parseInt（pro.getProperty（“maxiterations”）；
字符串outputPath=fs.getHomeDirectory（）+pro.getProperty（“flink.output”）；
//设置执行环境
ExecutionEnvironment env=ExecutionEnvironment.getExecutionEnvironment（）；
//获取输入点
数据集点=getPointDataSet（env）；
数据集质心=空；
试一试{
质心=获取质心数据集（env）；
}捕获（异常e1）{
e1.printStackTrace（）；
}
//设置KMeans算法的批量迭代次数
迭代数据集循环=质心。迭代（maxIteration）；
数据集newCentroids=点
//计算每个点的最近质心
.map（新建SelectNearestCenter（this.getBenchmarkCounter（））。withBroadcastSet（循环，“质心”）
//每个质心的计数和求和点坐标
.groupBy（0）.reduceGroup（新的质心累加器（））
//根据点计数和坐标和计算新质心
.map（新的质心平均器（this.getBenchmarkCounter（））；
//使用终止条件将新质心反馈到下一次迭代中
数据集finalCentroids=loop.closeWith（newCentroids，newCentroids.join（loop）.where（“*”）.equalTo（“*”）.filter（new MyFilter（））；
数据集clusteredPoints=点
//将点指定给最终簇
.map（新的SelectNearestCenter（-1））。带有广播集（finalcentroid，“质心”）；
//发射结果
writeAsCsv（outputPath+“/points”、“\n”、”）；
writeEasText（outputPath+“/centers”）；//print（）；
//执行程序
试一试{
环境执行（“KMeans Flink”）；
}捕获（例外e）{
e、 printStackTrace（）；
}
}
公共静态最终类MyFilter实现FilterFunction{
私有静态最终长serialVersionUID=5868635346889117617L；
公共布尔筛选器（Tuple2 tuple）引发异常{
if（tuple.f0.equals（tuple.f1））{
返回true；
}
否则{
返回false；
}
}
}

我认为问题在于过滤功能（对您未发布的代码进行模化）。Flink的终止标准的工作方式如下：如果提供的终止

数据集

为空，则满足终止标准。否则，如果未超过最大迭代次数，则开始下一次迭代

Flink的

filter

函数只保留那些

FilterFunction

true

的元素。因此，对于

MyFilter

实现，您只需保持迭代前后的质心相同。这意味着您将获得一个空的

数据集

，如果所有质心都已更改，则迭代将终止。这显然与实际终止标准相反。终止标准应该是：只要有一个质心发生了变化，就继续使用k-均值

可以使用

coGroup

函数执行此操作，如果前面的质心

数据集

中没有匹配的质心，则在该函数中发射元素。这类似于左外部联接，只是丢弃非空匹配

publicstaticvoidmain（字符串[]args）引发异常{
//设置执行环境
final ExecutionEnvironment env=ExecutionEnvironment.getExecutionEnvironment（）；
数据集oldDS=env.fromElements（新元素（1，“测试”）、新元素（2，“测试”）、新元素（3，“foobar”）；
数据集newDS=env.fromElements（新元素（1，“测试”）、新元素（3，“foobar”）、新元素（4，“测试”）；
DataSet filtered=newDS.coGroup（oldDS）.where（“*”）.equalTo（“*”）.with（new FilterCoGroup（））；
filtered.print（）；
}
公共静态类FilterCoGroup实现CoGroupFunction{
@凌驾
公共空余群(
可移植的新元素，
不可分割的元素，
收集器）引发异常{
List persistedElements=new ArrayList（）；
对于（元素：旧元素）{
持久元素。添加（元素）；
}
for（元素新元素：新元素）{
布尔值=假；
对于（元素oldElement:persistedElements）{
if（新元素等于（旧元素））{
包含=真；
}
}
如果（！包含）{
collector.collect（新元素）；
}
}
}
}
公共静态类元素实现密钥{
私有int-id；
私有字符串名称；
公共元素（int-id，字符串名）{
this.id=id；
this.name=名称；
}
公共元素（）{
此（-1，“”）；
}
@凌驾
公共int hashCode（）{
返回31+7*name.hashCode（）+11*id；
}
@凌驾
公共厕所