Java 如何使用由两列合并而成的字段执行查询?
我正在使用JavaSpark库构建一系列分布分析。这是我用来从JSON文件获取数据并保存输出的实际代码Java 如何使用由两列合并而成的字段执行查询?,java,sql,apache-spark,Java,Sql,Apache Spark,我正在使用JavaSpark库构建一系列分布分析。这是我用来从JSON文件获取数据并保存输出的实际代码 Dataset<Row> dataset = spark.read().json("local/foods.json"); dataset.createOrReplaceTempView("cs_food"); List<GenericAnalyticsEntry> menu_distribution= spark .sql(" ***
Dataset<Row> dataset = spark.read().json("local/foods.json");
dataset.createOrReplaceTempView("cs_food");
List<GenericAnalyticsEntry> menu_distribution= spark
.sql(" ****REQUESTED QUERY ****")
.toJavaRDD()
.map(row -> Triple.of( row.getString(0), BigDecimal.valueOf(row.getLong(1)), BigDecimal.valueOf(row.getLong(2))))
.map(GenericAnalyticsEntry::of)
.collect();
writeObjectAsJsonToHDFS(fs, "/local/output/menu_distribution_new.json", menu_distribution);
我如何从上面写的代码中实现这个(写在下面)输出
+------------+--------------------+----------------------+
| FOODS | occurrences(First) | occurrences (Second) |
+------------+--------------------+----------------------+
| Pizza | 2 | 1 |
+------------+--------------------+----------------------+
| Lasagna | 2 | 1 |
+------------+--------------------+----------------------+
| Spaghetti | 2 | 3 |
+------------+--------------------+----------------------+
| Mozzarella | 0 | 2 |
+------------+--------------------+----------------------+
| Pork | 1 | 0 |
+------------+--------------------+----------------------+
当然,我自己也曾试图找出一个解决方案,但并没有成功,我可能错了,但我需要这样的东西:
"SELECT (first_food + second_food) as menu, COUNT(first_food), COUNT(second_food) from cs_food GROUP BY menu"
从示例数据来看,这看起来会产生您想要的输出:
select
foods,
first_count,
second_count
from
(select first_food as food from menus
union select second_food from menus) as f
left join (
select first_food, count(*) as first_count from menus
group by first_food
) as ff on ff.first_food=f.food
left join (
select second_food, count(*) as second_count from menus
group by second_food
) as sf on sf.second_food=f.food
;
flatMap和groupBy的简单组合应该可以完成这样的工作(抱歉,现在无法检查它是否100%正确):
这段代码给出了以下结果:“线程“main”org.apache.spark.sql.AnalysisException中的异常:分组表达式序列为空,而“menu.
first\u food
”不是聚合函数。在窗口功能或“包装”菜单中包装“(计数(1)为first\u count
)。first\u food
”在first()(或first\u value)中包装(如果您不在乎得到哪个值)。“我的错:我遗漏了GROUP BY
子句。现在修复。编辑:需要将选择的第一行从“食物”替换为“食物”,并且它正在工作。。。但是非公共字段的结果是NULL而不是0,有没有办法解决这个问题?在您的select中,尝试“select foods,ISNULL(first_count,0),ISNULL(第二个计数,0)”使用了此修复程序,它正在工作:COALESCE(第一个计数,0)
select
foods,
first_count,
second_count
from
(select first_food as food from menus
union select second_food from menus) as f
left join (
select first_food, count(*) as first_count from menus
group by first_food
) as ff on ff.first_food=f.food
left join (
select second_food, count(*) as second_count from menus
group by second_food
) as sf on sf.second_food=f.food
;
import spark.sqlContext.implicits._
val df = Seq(("Pizza", "Pasta"), ("Pizza", "Soup")).toDF("first", "second")
df.flatMap({case Row(first: String, second: String) => Seq((first, 1, 0), (second, 0, 1))})
.groupBy("_1")