Apache spark 检查一列的值是否位于数据帧中另一列(数组)的范围之间
我有一个数据框架,我需要比较一些值并从中推断出一些东西 比如说, 我的DFApache spark 检查一列的值是否位于数据帧中另一列(数组)的范围之间,apache-spark,pyspark,apache-spark-sql,pyspark-sql,Apache Spark,Pyspark,Apache Spark Sql,Pyspark Sql,我有一个数据框架,我需要比较一些值并从中推断出一些东西 比如说, 我的DF CITY DAY MONTH TAG RANGE VALUE RANK A 1 01 A [50, 90] 55 1 A 2 02 B [30, 40] 34 3 A 1 03 A [05, 10] 15 20 A 1 04 B [50, 60] 11 10 A 1
CITY DAY MONTH TAG RANGE VALUE RANK
A 1 01 A [50, 90] 55 1
A 2 02 B [30, 40] 34 3
A 1 03 A [05, 10] 15 20
A 1 04 B [50, 60] 11 10
A 1 05 B [50, 60] 54 4
对于每一行,我必须检查“value”的值是否在“RANGE”之间。这里,arr[0]是下限,arr[1]是上限
我需要创建一个新的DF
NEW-DF
TAG Positive Negative
A 1 1
B 2 1
“正”和“负”只不过是满足任一条件的值的计数。我们可以使用
element\u at
获取每个位置的元素,并将它们与每行中的相应值以及秩条件进行比较,然后在标记上执行groupby
和sum
:
from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType
range_df = df.withColumn('in_range', (F.element_at('range', 1).cast(IntegerType()) < F.col('value')) &
(F.col('value') < F.element_at('range', 2).cast(IntegerType())) &
(F.col('rank') < 5))
range_df.show()
grouped_df = range_df.groupby('tag').agg(F.sum(F.col('in_range').cast(IntegerType())).alias('total_positive'),
F.sum((~F.col('in_range')).cast(IntegerType())).alias('total_negative'))
grouped_df.show()
您必须首先使用自定义项来处理范围:
val df = Seq(("A","1","01","A","[50,90]","55","1")).toDF("city","day","month","tag","range","value","rank")
+----+---+-----+---+-------+-----+----+
|city|day|month|tag| range|value|rank|
+----+---+-----+---+-------+-----+----+
| A| 1| 01| A|[50,90]| 55| 1|
+----+---+-----+---+-------+-----+----+
def checkRange(range : String,rank : String, value : String) : String = {
val rangeProcess = range.dropRight(1).drop(1).split(",")
if (rank.toInt > 5){
"negative"
} else {
if (value > rangeProcess(0) && value < rangeProcess(1)){
"positive"
} else {
"negative"
}
}
}
val checkRangeUdf = udf(checkRange _)
df.withColumn("Result",checkRangeUdf(col("range"),col("rank"),col("value"))).show()
+----+---+-----+---+-------+-----+----+--------+
|city|day|month|tag| range|value|rank| Result|
+----+---+-----+---+-------+-----+----+--------+
| A| 1| 01| A|[50,90]| 55| 1|positive|
+----+---+-----+---+-------+-----+----+--------+
val result = df.withColumn("Result",checkRangeUdf(col("range"),col("rank"),col("value"))).groupBy("city","Result").count.show
+----+--------+-----+
|city| Result|count|
+----+--------+-----+
| A|positive| 1|
+----+--------+-----+
val df=Seq((“A”、“1”、“01”、“A”、“50,90”、“55”、“1”)).toDF(“城市”、“日”、“月”、“标记”、“范围”、“值”、“等级”)
+----+---+-----+---+-------+-----+----+
|城市|日|月|标记|范围|值|等级|
+----+---+-----+---+-------+-----+----+
|A | 1 | 01 | A |[50,90]| 55 | 1|
+----+---+-----+---+-------+-----+----+
def checkRange(范围:字符串、等级:字符串、值:字符串):字符串={
val rangeProcess=range.dropRight(1).drop(1).split(“,”)
如果(rank.toInt>5){
“否定”
}否则{
如果(值>范围过程(0)和值<范围过程(1)){
“积极的”
}否则{
“否定”
}
}
}
val checkRangeUdf=udf(checkRange Uf)
df.withColumn(“结果”),checkRangeUdf(列(“范围”)、列(“排名”)、列(“值”)).show()
+----+---+-----+---+-------+-----+----+--------+
|城市|日|月|标记|范围|值|等级|结果|
+----+---+-----+---+-------+-----+----+--------+
|A | 1 | 01 | A |[50,90]| 55 | 1 |正|
+----+---+-----+---+-------+-----+----+--------+
val result=df.withColumn(“result”,checkRangeUdf(col(“range”),col(“rank”),col(“value”))。groupBy(“city”,“result”)。count.show
+----+--------+-----+
|城市|结果|计数|
+----+--------+-----+
|A |正| 1|
+----+--------+-----+
你好,非常感谢!我只是有点怀疑,因为我对python和spark都是新手。我尝试使用“~”来匹配负面条件(对于不属于特定范围的内容)range\u df=df.withColumn('in\u range',1.cast(IntegerType(.
但是,这会引发数据类型不匹配错误!因为它返回整数类型,而not运算符需要布尔值!@nikitap我有点困惑。这不是我的答案,是吗?嗨,我只是想知道当某个“值”不在“范围”之间时,如何编写条件。我也需要它,因为我正在为它添加一些条件。很抱歉让你困惑,我找到了它!谢谢!@nikitap啊,那么你问的是一个不相关的问题?在这种情况下,我想这正是你放~
的地方(它的优先级很低,所以你需要明智地使用括号)。希望能有所帮助!
val df = Seq(("A","1","01","A","[50,90]","55","1")).toDF("city","day","month","tag","range","value","rank")
+----+---+-----+---+-------+-----+----+
|city|day|month|tag| range|value|rank|
+----+---+-----+---+-------+-----+----+
| A| 1| 01| A|[50,90]| 55| 1|
+----+---+-----+---+-------+-----+----+
def checkRange(range : String,rank : String, value : String) : String = {
val rangeProcess = range.dropRight(1).drop(1).split(",")
if (rank.toInt > 5){
"negative"
} else {
if (value > rangeProcess(0) && value < rangeProcess(1)){
"positive"
} else {
"negative"
}
}
}
val checkRangeUdf = udf(checkRange _)
df.withColumn("Result",checkRangeUdf(col("range"),col("rank"),col("value"))).show()
+----+---+-----+---+-------+-----+----+--------+
|city|day|month|tag| range|value|rank| Result|
+----+---+-----+---+-------+-----+----+--------+
| A| 1| 01| A|[50,90]| 55| 1|positive|
+----+---+-----+---+-------+-----+----+--------+
val result = df.withColumn("Result",checkRangeUdf(col("range"),col("rank"),col("value"))).groupBy("city","Result").count.show
+----+--------+-----+
|city| Result|count|
+----+--------+-----+
| A|positive| 1|
+----+--------+-----+