如何基于列中数组的第一个值过滤pyspark数据帧?
假设我在pyspark中有这个数据帧:如何基于列中数组的第一个值过滤pyspark数据帧?,pyspark,Pyspark,假设我在pyspark中有这个数据帧: +--------+----------------+---------+---------+ |DeviceID| TimeStamp |range | zipcode | +--------+----------------+---------+---------+ | 00236|11-03-2014 07:33|[4.5, 2] | 90041 | | 00234|11-06-2014 05:55|[6.2, 8] |
+--------+----------------+---------+---------+
|DeviceID| TimeStamp |range | zipcode |
+--------+----------------+---------+---------+
| 00236|11-03-2014 07:33|[4.5, 2] | 90041 |
| 00234|11-06-2014 05:55|[6.2, 8] | 90037 |
| 00234|11-06-2014 05:55|[5.6, 4] | 90037 |
| 00235|11-09-2014 05:33|[7.5, 6] | 90047 |
+--------+----------------+---------+---------+
当范围数组中的第一个值大于6时,如何编写保留行的脚本。输出应如下所示:
+--------+----------------+---------+---------+
|DeviceID| TimeStamp |range | zipcode |
+--------+----------------+---------+---------+
| 00234|11-06-2014 05:55|[6.2, 8] | 90037 |
| 00235|11-09-2014 05:33|[7.5, 6] | 90047 |
+--------+----------------+---------+---------+
我写了以下脚本:
import pyspark.sql.functions as f
df.filter(f.col("range")[0] > 6)
但我有一个错误:
AnalysisException: u"Can't extract value from range#12989: need struct type but got vector;"
或
输出:
+--------+----------------+----------+-------+
|DeviceID| TimeStamp| range|zipcode|
+--------+----------------+----------+-------+
| 00235|11-09-2014 05:33|[7.5, 6.0]| 90047|
| 00234|11-06-2014 05:55|[6.2, 8.0]| 90037|
+--------+----------------+----------+-------+
你的spark版本是什么?你能粘贴你的模式吗?schema,看看你的范围列是数组还是向量,我猜你的范围是向量类型的,这是[0]无法访问的,你必须将范围列转换为数组类型
df.withColumn("first_element", df.range[0])\
.filter(col("first_element")>6.0).drop("first_element").show()
+--------+----------------+----------+-------+
|DeviceID| TimeStamp| range|zipcode|
+--------+----------------+----------+-------+
| 00235|11-09-2014 05:33|[7.5, 6.0]| 90047|
| 00234|11-06-2014 05:55|[6.2, 8.0]| 90037|
+--------+----------------+----------+-------+