Apache flink 使用flinksql优化Top-N查询
我正在尝试使用Flink SQL运行流式top-n查询,但无法使“优化版本”正常工作。设置如下: 我有一个卡夫卡主题,其中每条记录都包含一个元组(GUID、达到的分数、最大可能分数)。把他们想象成一个接受评估的学生,元组代表他获得了多少分 我想要得到的是五个guid的列表,它们的最高分数以百分比表示(即按SUM(reacted_score)/SUM(maximum problem score)排序) 我首先汇总分数并按GUID将其分组:Apache flink 使用flinksql优化Top-N查询,apache-flink,flink-streaming,flink-sql,Apache Flink,Flink Streaming,Flink Sql,我正在尝试使用Flink SQL运行流式top-n查询,但无法使“优化版本”正常工作。设置如下: 我有一个卡夫卡主题,其中每条记录都包含一个元组(GUID、达到的分数、最大可能分数)。把他们想象成一个接受评估的学生,元组代表他获得了多少分 我想要得到的是五个guid的列表,它们的最高分数以百分比表示(即按SUM(reacted_score)/SUM(maximum problem score)排序) 我首先汇总分数并按GUID将其分组: EnvironmentSettings bsSetting
EnvironmentSettings bsSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build();
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env, bsSettings);
Table scores = tableEnv.fromDataStream(/* stream from kafka */, "guid, reached_score, max_score");
tableEnv.registerTable("scores", scores);
Table aggregatedScores = tableEnv.sqlQuery(
"SELECT " +
" guid, " +
" SUM(reached_score) as reached_score, " +
" SUM(max_score) as max_score, " +
" SUM(reached_score) / CAST(SUM(max_score) AS DOUBLE) as score " +
"FROM scores " +
"GROUP BY guid");
tableEnv.registerTable("agg_scores", aggregatedScores);
结果表包含未排序的汇总分数列表。然后,我尝试将其输入到Top-N查询中,因为它在Flink文档中使用:
Table topN = tableEnv.sqlQuery(
"SELECT guid, reached_score, max_score, score, row_num " +
"FROM (" +
" SELECT *," +
" ROW_NUMBER() OVER (ORDER BY score DESC) as row_num" +
" FROM agg_scores)" +
"WHERE row_num <= 5");
tableEnv.toRetractStream(topN, Row.class).print();
然后,我按照文档中的建议,从投影中删除了行号:
Table topN = tableEnv.sqlQuery(
"SELECT guid, reached_score, max_score, score " +
"FROM (" +
" SELECT *," +
" ROW_NUMBER() OVER (ORDER BY score DESC) as row_num" +
" FROM agg_scores)" +
"WHERE row_num <= 5");
Table topN=tableEnv.sqlQuery(
选择guid、已达到分数、最大分数、分数+
“从(”+
“选择*,”+
“行数()超过(按分数说明排序)作为行数”+
“来自agg_分数)”+
“WHERE row_num我已经检查了这个问题,也可以在我的本地环境中复制。我也做了一些调查,原因是:
“我们没有对某些场景进行优化,您的案例似乎就是其中之一”
然而,根据用户文档,我认为在您的场景中也包含这样的优化也是有效的请求。在我看来,这似乎是一个BUG,我们声称进行了一些优化,但没有成功
我提出了一个问题:为了跟踪这个问题,希望我们能在即将发布的1.9.2和1.10.0版本中解决它
谢谢你的报道
Table topN = tableEnv.sqlQuery(
"SELECT guid, reached_score, max_score, score " +
"FROM (" +
" SELECT *," +
" ROW_NUMBER() OVER (ORDER BY score DESC) as row_num" +
" FROM agg_scores)" +
"WHERE row_num <= 5");
// add first entry
4> (true,63992935-9684-4285-8c2b-1fd57b51b48f,112,200,0.56)
// add a second entry with lower score below the first one
5> (true,d7847f58-a4d9-40f8-a38d-161821b48481,76,200,0.38)
// update the second entry with a much higher score
7> (true,d7847f58-a4d9-40f8-a38d-161821b48481,354,400,0.885)
1> (true,63992935-9684-4285-8c2b-1fd57b51b48f,112,200,0.56) <-- ???
8> (false,63992935-9684-4285-8c2b-1fd57b51b48f,112,200,0.56) <-- ???
6> (false,d7847f58-a4d9-40f8-a38d-161821b48481,76,200,0.38)