Apache flink 使用flinksql优化Top-N查询

Apache flink 使用flinksql优化Top-N查询,apache-flink,flink-streaming,flink-sql,Apache Flink,Flink Streaming,Flink Sql,我正在尝试使用Flink SQL运行流式top-n查询,但无法使“优化版本”正常工作。设置如下: 我有一个卡夫卡主题,其中每条记录都包含一个元组(GUID、达到的分数、最大可能分数)。把他们想象成一个接受评估的学生,元组代表他获得了多少分 我想要得到的是五个guid的列表,它们的最高分数以百分比表示(即按SUM(reacted_score)/SUM(maximum problem score)排序) 我首先汇总分数并按GUID将其分组: EnvironmentSettings bsSetting

我正在尝试使用Flink SQL运行流式top-n查询,但无法使“优化版本”正常工作。设置如下:

我有一个卡夫卡主题,其中每条记录都包含一个元组(GUID、达到的分数、最大可能分数)。把他们想象成一个接受评估的学生,元组代表他获得了多少分

我想要得到的是五个guid的列表,它们的最高分数以百分比表示(即按SUM(reacted_score)/SUM(maximum problem score)排序)

我首先汇总分数并按GUID将其分组:

EnvironmentSettings bsSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build();
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env, bsSettings);

Table scores = tableEnv.fromDataStream(/* stream from kafka */, "guid, reached_score, max_score");
tableEnv.registerTable("scores", scores);

Table aggregatedScores = tableEnv.sqlQuery(
        "SELECT " +
        "  guid, " +
        "  SUM(reached_score) as reached_score, " +
        "  SUM(max_score) as max_score, " +
        "  SUM(reached_score) / CAST(SUM(max_score) AS DOUBLE) as score " +
        "FROM scores " +
        "GROUP BY guid");

tableEnv.registerTable("agg_scores", aggregatedScores);
结果表包含未排序的汇总分数列表。然后,我尝试将其输入到Top-N查询中,因为它在Flink文档中使用:

Table topN = tableEnv.sqlQuery(
        "SELECT guid, reached_score, max_score, score, row_num " +
        "FROM (" +
        "   SELECT *," +
        "       ROW_NUMBER() OVER (ORDER BY score DESC) as row_num" +
        "   FROM agg_scores)" +
        "WHERE row_num <= 5");


tableEnv.toRetractStream(topN, Row.class).print();
然后,我按照文档中的建议,从投影中删除了行号:

Table topN = tableEnv.sqlQuery(
    "SELECT guid, reached_score, max_score, score " +
    "FROM (" +
    "   SELECT *," +
    "       ROW_NUMBER() OVER (ORDER BY score DESC) as row_num" +
    "   FROM agg_scores)" +
    "WHERE row_num <= 5");
Table topN=tableEnv.sqlQuery(
选择guid、已达到分数、最大分数、分数+
“从(”+
“选择*,”+
“行数()超过(按分数说明排序)作为行数”+
“来自agg_分数)”+

“WHERE row_num我已经检查了这个问题,也可以在我的本地环境中复制。我也做了一些调查,原因是:

“我们没有对某些场景进行优化,您的案例似乎就是其中之一”

然而,根据用户文档,我认为在您的场景中也包含这样的优化也是有效的请求。在我看来,这似乎是一个BUG,我们声称进行了一些优化,但没有成功

我提出了一个问题:为了跟踪这个问题,希望我们能在即将发布的1.9.2和1.10.0版本中解决它

谢谢你的报道

Table topN = tableEnv.sqlQuery(
    "SELECT guid, reached_score, max_score, score " +
    "FROM (" +
    "   SELECT *," +
    "       ROW_NUMBER() OVER (ORDER BY score DESC) as row_num" +
    "   FROM agg_scores)" +
    "WHERE row_num <= 5");
// add first entry
4> (true,63992935-9684-4285-8c2b-1fd57b51b48f,112,200,0.56)

// add a second entry with lower score below the first one
5> (true,d7847f58-a4d9-40f8-a38d-161821b48481,76,200,0.38)

// update the second entry with a much higher score
7> (true,d7847f58-a4d9-40f8-a38d-161821b48481,354,400,0.885)
1> (true,63992935-9684-4285-8c2b-1fd57b51b48f,112,200,0.56) <-- ???
8> (false,63992935-9684-4285-8c2b-1fd57b51b48f,112,200,0.56) <-- ???
6> (false,d7847f58-a4d9-40f8-a38d-161821b48481,76,200,0.38)