Java 如何使用由两列合并而成的字段执行查询?

Java 如何使用由两列合并而成的字段执行查询?,java,sql,apache-spark,Java,Sql,Apache Spark,我正在使用JavaSpark库构建一系列分布分析。这是我用来从JSON文件获取数据并保存输出的实际代码 Dataset<Row> dataset = spark.read().json("local/foods.json"); dataset.createOrReplaceTempView("cs_food"); List<GenericAnalyticsEntry> menu_distribution= spark .sql(" ***

我正在使用JavaSpark库构建一系列分布分析。这是我用来从JSON文件获取数据并保存输出的实际代码

Dataset<Row> dataset = spark.read().json("local/foods.json");
        dataset.createOrReplaceTempView("cs_food");

List<GenericAnalyticsEntry> menu_distribution= spark
        .sql(" ****REQUESTED QUERY ****")
        .toJavaRDD()
        .map(row -> Triple.of( row.getString(0), BigDecimal.valueOf(row.getLong(1)), BigDecimal.valueOf(row.getLong(2))))
        .map(GenericAnalyticsEntry::of)
        .collect();

writeObjectAsJsonToHDFS(fs, "/local/output/menu_distribution_new.json", menu_distribution);
我如何从上面写的代码中实现这个(写在下面)输出

+------------+--------------------+----------------------+
| FOODS      | occurrences(First) | occurrences (Second) |
+------------+--------------------+----------------------+
| Pizza      | 2                  | 1                    |
+------------+--------------------+----------------------+
| Lasagna    | 2                  | 1                    |
+------------+--------------------+----------------------+
| Spaghetti  | 2                  | 3                    |
+------------+--------------------+----------------------+
| Mozzarella | 0                  | 2                    |
+------------+--------------------+----------------------+
| Pork       | 1                  | 0                    |
+------------+--------------------+----------------------+
当然,我自己也曾试图找出一个解决方案,但并没有成功,我可能错了,但我需要这样的东西:

"SELECT (first_food + second_food) as menu, COUNT(first_food), COUNT(second_food) from cs_food GROUP BY menu"

从示例数据来看,这看起来会产生您想要的输出:

select
    foods,
    first_count,
    second_count
from
    (select first_food as food from menus
    union select second_food from menus) as f
    left join (
        select first_food, count(*) as first_count from menus
        group by first_food
        ) as ff on ff.first_food=f.food
    left join (
        select second_food, count(*) as second_count from menus
        group by second_food
        ) as sf on sf.second_food=f.food
 ;

flatMap和groupBy的简单组合应该可以完成这样的工作(抱歉,现在无法检查它是否100%正确):


这段代码给出了以下结果:“线程“main”org.apache.spark.sql.AnalysisException中的异常:分组表达式序列为空,而“menu.
first\u food
”不是聚合函数。在窗口功能或“包装”菜单中包装“(计数(1)为
first\u count
)。
first\u food
”在first()(或first\u value)中包装(如果您不在乎得到哪个值)。“我的错:我遗漏了
GROUP BY
子句。现在修复。编辑:需要将选择的第一行从“食物”替换为“食物”,并且它正在工作。。。但是非公共字段的结果是NULL而不是0,有没有办法解决这个问题?在您的select中,尝试“select foods,ISNULL(first_count,0),ISNULL(第二个计数,0)”使用了此修复程序,它正在工作:COALESCE(第一个计数,0)
select
    foods,
    first_count,
    second_count
from
    (select first_food as food from menus
    union select second_food from menus) as f
    left join (
        select first_food, count(*) as first_count from menus
        group by first_food
        ) as ff on ff.first_food=f.food
    left join (
        select second_food, count(*) as second_count from menus
        group by second_food
        ) as sf on sf.second_food=f.food
 ;
import spark.sqlContext.implicits._
val df = Seq(("Pizza", "Pasta"), ("Pizza", "Soup")).toDF("first", "second")
df.flatMap({case Row(first: String, second: String) => Seq((first, 1, 0), (second, 0, 1))})
  .groupBy("_1")