Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
spark dataframe将每个id的多个记录减少到最多一个记录_Dataframe_Apache Spark_Pyspark_Reduce - Fatal编程技术网

spark dataframe将每个id的多个记录减少到最多一个记录

spark dataframe将每个id的多个记录减少到最多一个记录,dataframe,apache-spark,pyspark,reduce,Dataframe,Apache Spark,Pyspark,Reduce,给出如下表: +--+------------------+-----------+ |id| diagnosis_age| diagnosis| +--+------------------+-----------+ | 1|2.1843037179180302| 315.320000| | 1| 2.80033330216659| 315.320000| | 1| 2.8222365762732| 315.320000| | 1| 5.64822705794013| 325

给出如下表:

+--+------------------+-----------+
|id|     diagnosis_age|  diagnosis|
+--+------------------+-----------+
| 1|2.1843037179180302| 315.320000|
| 1|  2.80033330216659| 315.320000|
| 1|   2.8222365762732| 315.320000|
| 1|  5.64822705794013| 325.320000|
| 1| 5.686557787521759| 335.320000|
| 2|  5.70572315231258| 315.320000|
| 2| 5.724888517103389| 315.320000|
| 3| 5.744053881894209| 315.320000|
| 3|5.7604813374292005| 315.320000|
| 3|  5.77993740687426| 315.320000|
+--+------------------+-----------+
我试图通过对每个id进行最频繁的诊断,将每个id的记录减少到一个

如果它是一个rdd,那么类似这样的东西可以:

rdd.map(lambda x: (x["id"], [(x["diagnosis_age"], x["diagnosis"])]))\
.reduceByKey(lambda x, y: x + y)\
.map(lambda x: [i[1] for i in x[1]])\
.map(lambda x: [max(zip((x.count(i) for i in set(x)), set(x)))])
在sql中:

select id, diagnosis, diagnosis_age
from (select id, diagnosis, diagnosis_age, count(*) as cnt,
             row_number() over (partition by id order by count(*) desc) as seqnum
      from t
      group by id, diagnosis, age
     ) da
where seqnum = 1;
期望输出:

+--+------------------+-----------+
|id|     diagnosis_age|  diagnosis|
+--+------------------+-----------+
| 1|2.1843037179180302| 315.320000|
| 2|  5.70572315231258| 315.320000|
| 3| 5.744053881894209| 315.320000|
+--+------------------+-----------+
如果可能,如何仅使用spark数据帧操作实现相同的功能?特别是不使用任何rdd操作/sql


谢谢

Python:下面是我的scala代码的转换

from pyspark.sql.functions import col, first, count, desc, row_number
from pyspark.sql import Window

df.groupBy("id", "diagnosis").agg(first(col("diagnosis_age")).alias("diagnosis_age"), count(col("diagnosis_age")).alias("cnt")) \
  .withColumn("seqnum", row_number().over(Window.partitionBy("id").orderBy(col("cnt").desc()))) \
  .where("seqnum = 1") \
  .select("id", "diagnosis_age", "diagnosis", "cnt") \
  .orderBy("id") \
  .show(10, False)

Scala:您的查询对我来说没有意义。
groupBy
条件导致记录的计数始终为
1
。我在dataframe表达式中做了一些修改,例如

import org.apache.spark.sql.expressions.Window

df.groupBy("id", "diagnosis").agg(first(col("diagnosis_age")).as("diagnosis_age"), count(col("diagnosis_age")).as("cnt"))
  .withColumn("seqnum", row_number.over(Window.partitionBy("id").orderBy(col("cnt").desc)))
  .where("seqnum = 1")
  .select("id", "diagnosis_age", "diagnosis", "cnt")
  .orderBy("id")
  .show(false)
结果如下:

+---+------------------+---------+---+
|id |diagnosis_age     |diagnosis|cnt|
+---+------------------+---------+---+
|1  |2.1843037179180302|315.32   |3  |
|2  |5.70572315231258  |315.32   |2  |
|3  |5.744053881894209 |315.32   |3  |
+---+------------------+---------+---+

Python:下面是我的scala代码的转换

from pyspark.sql.functions import col, first, count, desc, row_number
from pyspark.sql import Window

df.groupBy("id", "diagnosis").agg(first(col("diagnosis_age")).alias("diagnosis_age"), count(col("diagnosis_age")).alias("cnt")) \
  .withColumn("seqnum", row_number().over(Window.partitionBy("id").orderBy(col("cnt").desc()))) \
  .where("seqnum = 1") \
  .select("id", "diagnosis_age", "diagnosis", "cnt") \
  .orderBy("id") \
  .show(10, False)

Scala:您的查询对我来说没有意义。
groupBy
条件导致记录的计数始终为
1
。我在dataframe表达式中做了一些修改,例如

import org.apache.spark.sql.expressions.Window

df.groupBy("id", "diagnosis").agg(first(col("diagnosis_age")).as("diagnosis_age"), count(col("diagnosis_age")).as("cnt"))
  .withColumn("seqnum", row_number.over(Window.partitionBy("id").orderBy(col("cnt").desc)))
  .where("seqnum = 1")
  .select("id", "diagnosis_age", "diagnosis", "cnt")
  .orderBy("id")
  .show(false)
结果如下:

+---+------------------+---------+---+
|id |diagnosis_age     |diagnosis|cnt|
+---+------------------+---------+---+
|1  |2.1843037179180302|315.32   |3  |
|2  |5.70572315231258  |315.32   |2  |
|3  |5.744053881894209 |315.32   |3  |
+---+------------------+---------+---+

您可以使用
count
max
first
窗口功能,并在
count=max
上进行过滤

from pyspark.sql import functions as F
from pyspark.sql.window import Window
w=Window().partitionBy("id","diagnosis").orderBy("diagnosis_age")
w2=Window().partitionBy("id")
df.withColumn("count", F.count("diagnosis").over(w))\
  .withColumn("max", F.max("count").over(w2))\
  .filter("count=max")\
  .groupBy("id").agg(F.first("diagnosis_age").alias("diagnosis_age"),F.first("diagnosis").alias("diagnosis"))\
  .orderBy("id").show()

+---+------------------+---------+
| id|     diagnosis_age|diagnosis|
+---+------------------+---------+
|  1|2.1843037179180302|   315.32|
|  2|  5.70572315231258|   315.32|
|  3| 5.744053881894209|   315.32|
+---+------------------+---------+

您可以使用
count
max
first
窗口功能,并在
count=max
上进行过滤

from pyspark.sql import functions as F
from pyspark.sql.window import Window
w=Window().partitionBy("id","diagnosis").orderBy("diagnosis_age")
w2=Window().partitionBy("id")
df.withColumn("count", F.count("diagnosis").over(w))\
  .withColumn("max", F.max("count").over(w2))\
  .filter("count=max")\
  .groupBy("id").agg(F.first("diagnosis_age").alias("diagnosis_age"),F.first("diagnosis").alias("diagnosis"))\
  .orderBy("id").show()

+---+------------------+---------+
| id|     diagnosis_age|diagnosis|
+---+------------------+---------+
|  1|2.1843037179180302|   315.32|
|  2|  5.70572315231258|   315.32|
|  3| 5.744053881894209|   315.32|
+---+------------------+---------+

如果我错了,请纠正我,您需要每个id的最小诊断年龄值和每个id的最频繁诊断年龄?@Mohammad Murtaza Hashmi我只需要每个id的最频繁诊断,无论诊断年龄如何,我只是假设在示例表中也会返回最小诊断年龄记录。这是否回答了您的问题?如果我错了,请纠正我,您需要每个id的最小诊断年龄值和每个id的最频繁诊断年龄?@Mohammad Murtaza Hashmi我只需要每个id的最频繁诊断,无论诊断年龄如何,我只是假设在示例表中也会返回最小诊断年龄记录。这是否回答了您的问题?我无法实际运行您的代码,我更改了.as to.alias,并添加了\代码添加到新行的位置,但我得到了一个与行号相关的错误:NameError:name“row\u number”未定义,当我将行号修改为F.row\u号时,由于导入pyspark.sql.functions as F I get:AttributeError:“function”对象没有属性“over”。这是否与不同版本有关,因为我使用的是1.6?@mad-a,很抱歉这是一个scala代码,我将更新python代码。我无法实际运行您的代码,我更改了.as to.alias,并添加了\代码,其中代码被添加到了新行,但我得到了一个与行\号相关的错误:NameError:name'row\号'未定义,由于将pyspark.sql.functions作为F导入,因此我将row_编号修改为F.row_编号时,我得到:AttributeError:“function”对象没有属性“over”。这是否与不同版本有关,因为我使用的是1.6?@mad-a,很抱歉这是一个scala代码,我将更新python代码。虽然您的代码运行,但我不认为这会将每个id的记录减少到1,从而使每个id都不同。当我运行:df.select(“id”).distinct().count()时,我得到154957,当我对您的输出运行count()时,我得到240438。@mad-a我知道了,我已经根据您的反馈更新了解决方案。如果你尝试了,请告诉我。虽然你的代码运行了,但我不认为这会将每个id的记录减少到1,从而使每个id都不同。当我运行:df.select(“id”).distinct().count()时,我得到154957,当我对您的输出运行count()时,我得到240438。@mad-a我知道了,我已经根据您的反馈更新了解决方案。如果你想试试,请告诉我。