Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 检查一列的值是否位于数据帧中另一列(数组)的范围之间_Apache Spark_Pyspark_Apache Spark Sql_Pyspark Sql - Fatal编程技术网

Apache spark 检查一列的值是否位于数据帧中另一列(数组)的范围之间

Apache spark 检查一列的值是否位于数据帧中另一列(数组)的范围之间,apache-spark,pyspark,apache-spark-sql,pyspark-sql,Apache Spark,Pyspark,Apache Spark Sql,Pyspark Sql,我有一个数据框架,我需要比较一些值并从中推断出一些东西 比如说, 我的DF CITY DAY MONTH TAG RANGE VALUE RANK A 1 01 A [50, 90] 55 1 A 2 02 B [30, 40] 34 3 A 1 03 A [05, 10] 15 20 A 1 04 B [50, 60] 11 10 A 1

我有一个数据框架,我需要比较一些值并从中推断出一些东西

比如说,

我的DF

CITY DAY MONTH TAG RANGE     VALUE  RANK
A    1    01    A   [50, 90]   55     1
A    2    02    B   [30, 40]   34     3
A    1    03    A   [05, 10]   15    20
A    1    04    B   [50, 60]   11    10 
A    1    05    B   [50, 60]   54    4 
对于每一行,我必须检查“value”的值是否在“RANGE”之间。这里,arr[0]是下限,arr[1]是上限

我需要创建一个新的DF

NEW-DF

TAG  Positive  Negative
A     1          1
B     2          1 
  • 如果“值”位于给定范围和秩<5之间,则我会将其添加到“正值”

  • 如果该值不在给定范围内,则为负值

  • 如果该值位于给定范围内,但秩>5,则我将其计为负值


  • “正”和“负”只不过是满足任一条件的值的计数。

    我们可以使用
    element\u at
    获取每个位置的元素,并将它们与每行中的相应值以及秩条件进行比较,然后在标记上执行
    groupby
    sum

    from pyspark.sql import functions as F
    from pyspark.sql.types import IntegerType
    
    range_df = df.withColumn('in_range', (F.element_at('range', 1).cast(IntegerType()) < F.col('value')) & 
                                         (F.col('value') < F.element_at('range', 2).cast(IntegerType())) &
                                         (F.col('rank') < 5))
    
    range_df.show()
    
    grouped_df = range_df.groupby('tag').agg(F.sum(F.col('in_range').cast(IntegerType())).alias('total_positive'), 
                                             F.sum((~F.col('in_range')).cast(IntegerType())).alias('total_negative'))
    
    grouped_df.show()
    

    您必须首先使用自定义项来处理范围:

    val df = Seq(("A","1","01","A","[50,90]","55","1")).toDF("city","day","month","tag","range","value","rank")
    
    +----+---+-----+---+-------+-----+----+
    |city|day|month|tag|  range|value|rank|
    +----+---+-----+---+-------+-----+----+
    |   A|  1|   01|  A|[50,90]|   55|   1|
    +----+---+-----+---+-------+-----+----+
    
    
      def checkRange(range : String,rank : String, value : String) : String = {
        val rangeProcess = range.dropRight(1).drop(1).split(",")
        if (rank.toInt > 5){
          "negative"
        } else {
          if (value > rangeProcess(0) && value < rangeProcess(1)){
            "positive"
          } else {
            "negative"
          }
        }
      }
    
      val checkRangeUdf = udf(checkRange _)
    
    df.withColumn("Result",checkRangeUdf(col("range"),col("rank"),col("value"))).show()
    
    +----+---+-----+---+-------+-----+----+--------+
    |city|day|month|tag|  range|value|rank|  Result|
    +----+---+-----+---+-------+-----+----+--------+
    |   A|  1|   01|  A|[50,90]|   55|   1|positive|
    +----+---+-----+---+-------+-----+----+--------+
    
    
    val result = df.withColumn("Result",checkRangeUdf(col("range"),col("rank"),col("value"))).groupBy("city","Result").count.show
    
    +----+--------+-----+
    |city|  Result|count|
    +----+--------+-----+
    |   A|positive|    1|
    +----+--------+-----+
    
    val df=Seq((“A”、“1”、“01”、“A”、“50,90”、“55”、“1”)).toDF(“城市”、“日”、“月”、“标记”、“范围”、“值”、“等级”)
    +----+---+-----+---+-------+-----+----+
    |城市|日|月|标记|范围|值|等级|
    +----+---+-----+---+-------+-----+----+
    |A | 1 | 01 | A |[50,90]| 55 | 1|
    +----+---+-----+---+-------+-----+----+
    def checkRange(范围:字符串、等级:字符串、值:字符串):字符串={
    val rangeProcess=range.dropRight(1).drop(1).split(“,”)
    如果(rank.toInt>5){
    “否定”
    }否则{
    如果(值>范围过程(0)和值<范围过程(1)){
    “积极的”
    }否则{
    “否定”
    }
    }
    }
    val checkRangeUdf=udf(checkRange Uf)
    df.withColumn(“结果”),checkRangeUdf(列(“范围”)、列(“排名”)、列(“值”)).show()
    +----+---+-----+---+-------+-----+----+--------+
    |城市|日|月|标记|范围|值|等级|结果|
    +----+---+-----+---+-------+-----+----+--------+
    |A | 1 | 01 | A |[50,90]| 55 | 1 |正|
    +----+---+-----+---+-------+-----+----+--------+
    val result=df.withColumn(“result”,checkRangeUdf(col(“range”),col(“rank”),col(“value”))。groupBy(“city”,“result”)。count.show
    +----+--------+-----+
    |城市|结果|计数|
    +----+--------+-----+
    |A |正| 1|
    +----+--------+-----+
    
    你好,非常感谢!我只是有点怀疑,因为我对python和spark都是新手。我尝试使用“~”来匹配负面条件(对于不属于特定范围的内容)
    range\u df=df.withColumn('in\u range',1.cast(IntegerType(.
    但是,这会引发数据类型不匹配错误!因为它返回整数类型,而not运算符需要布尔值!@nikitap我有点困惑。这不是我的答案,是吗?嗨,我只是想知道当某个“值”不在“范围”之间时,如何编写条件。我也需要它,因为我正在为它添加一些条件。很抱歉让你困惑,我找到了它!谢谢!@nikitap啊,那么你问的是一个不相关的问题?在这种情况下,我想这正是你放
    ~
    的地方(它的优先级很低,所以你需要明智地使用括号)。希望能有所帮助!
    val df = Seq(("A","1","01","A","[50,90]","55","1")).toDF("city","day","month","tag","range","value","rank")
    
    +----+---+-----+---+-------+-----+----+
    |city|day|month|tag|  range|value|rank|
    +----+---+-----+---+-------+-----+----+
    |   A|  1|   01|  A|[50,90]|   55|   1|
    +----+---+-----+---+-------+-----+----+
    
    
      def checkRange(range : String,rank : String, value : String) : String = {
        val rangeProcess = range.dropRight(1).drop(1).split(",")
        if (rank.toInt > 5){
          "negative"
        } else {
          if (value > rangeProcess(0) && value < rangeProcess(1)){
            "positive"
          } else {
            "negative"
          }
        }
      }
    
      val checkRangeUdf = udf(checkRange _)
    
    df.withColumn("Result",checkRangeUdf(col("range"),col("rank"),col("value"))).show()
    
    +----+---+-----+---+-------+-----+----+--------+
    |city|day|month|tag|  range|value|rank|  Result|
    +----+---+-----+---+-------+-----+----+--------+
    |   A|  1|   01|  A|[50,90]|   55|   1|positive|
    +----+---+-----+---+-------+-----+----+--------+
    
    
    val result = df.withColumn("Result",checkRangeUdf(col("range"),col("rank"),col("value"))).groupBy("city","Result").count.show
    
    +----+--------+-----+
    |city|  Result|count|
    +----+--------+-----+
    |   A|positive|    1|
    +----+--------+-----+