List 数据帧列中的嵌套列表，提取数据帧列中列表的值_List_Dataframe_Pyspark_Apache Spark Sql_Nested Lists

List 数据帧列中的嵌套列表，提取数据帧列中列表的值

list dataframe pyspark

List 数据帧列中的嵌套列表，提取数据帧列中列表的值,list,dataframe,pyspark,apache-spark-sql,nested-lists,List,Dataframe,Pyspark,Apache Spark Sql,Nested Lists,请允许我转换下面的匹配\u事件数据框中的标记列 +-------+------------+------------------+--------+-------+-----------+--------+--------------------+----------+--------------------+--------------------+------+ |eventId| eventName| eventSec| id|matchId|matchPe

请允许我转换下面的

匹配\u事件

数据框中的

标记

列

+-------+------------+------------------+--------+-------+-----------+--------+--------------------+----------+--------------------+--------------------+------+
|eventId|   eventName|          eventSec|      id|matchId|matchPeriod|playerId|           positions|subEventId|        subEventName|                tags|teamId|
+-------+------------+------------------+--------+-------+-----------+--------+--------------------+----------+--------------------+--------------------+------+
|      8|        Pass| 1.255989999999997|88178642|1694390|         1H|   26010|[[50, 48], [47, 50]]|        85|         Simple pass|            [[1801]]|  4418|
|      8|        Pass|2.3519079999999803|88178643|1694390|         1H|    3682|[[47, 50], [41, 48]]|        85|         Simple pass|            [[1801]]|  4418|
|      8|        Pass|3.2410280000000284|88178644|1694390|         1H|   31528|[[41, 48], [32, 35]]|        85|         Simple pass|            [[1801]]|  4418|
|      8|        Pass| 6.033681000000001|88178645|1694390|         1H|    7855| [[32, 35], [89, 6]]|        83|           High pass|            [[1802]]|  4418|
|      1|        Duel|13.143591000000015|88178646|1694390|         1H|   25437|  [[89, 6], [85, 0]]|        12|Ground defending ...|     [[702], [1801]]|  4418|
|      1|        Duel|14.138041000000044|88178663|1694390|         1H|   83575|[[11, 94], [15, 1...|        11|Ground attacking ...|     [[702], [1801]]| 11944|
|      3|   Free Kick|27.053005999999982|88178648|1694390|         1H|    7915| [[85, 0], [93, 16]]|        36|            Throw in|            [[1802]]|  4418|
|      8|        Pass| 28.97515999999996|88178667|1694390|         1H|   70090|  [[7, 84], [9, 71]]|        82|           Head pass|    [[1401], [1802]]| 11944|
|     10|        Shot| 31.22621700000002|88178649|1694390|         1H|   25437|  [[91, 29], [0, 0]]|       100|                Shot|[[402], [1401], [...|  4418|
|      9|Save attempt| 32.66416000000004|88178674|1694390|         1H|   83574|[[100, 100], [15,...|        91|        Save attempt|    [[1203], [1801]]| 11944|
+-------+------------+------------------+--------+-------+-----------+--------+--------------------+----------+--------------------+--------------------+------+

对于类似这样的情况，就是将列表中的最后一项提取到一列中，如下所示

+----+
|tags|
+----+
|1801|
|1801|
|1801|
|1802|
|1801|
|1801|
+----+

该列将重新附加到

match_事件

dataframe，可能使用

withColumn

+------+
|   tag|
+------+
|[1801]|
|[1801]|
|[1801]|
|[1802]|
|[1801]|
|[1801]|
|[1802]|
|[1802]|
|[1801]|
|[1801]|
|[1801]|
|[1801]|
|[1302]|
|[1802]|
|[1801]|
|[1802]|
|[1801]|
|[1801]|
|[1801]|
|[1801]|
+------+

我尝试了下面的代码


u = match_event[['tags']].rdd
t=u.map(lambda xs: [n for x in xs[-1:] for n in x[-1:]])
tag = spark.createDataFrame(t, ['tag'])

我知道了。使用

with column

+------+
|   tag|
+------+
|[1801]|
|[1801]|
|[1801]|
|[1802]|
|[1801]|
|[1801]|
|[1802]|
|[1802]|
|[1801]|
|[1801]|
|[1801]|
|[1801]|
|[1302]|
|[1802]|
|[1801]|
|[1802]|
|[1801]|
|[1801]|
|[1801]|
|[1801]|
+------+

请帮忙。提前感谢

试试这个：

from pyspark.sql.functions import udf

columns = ['eventId',   'eventName','eventSec', 'id','matchId','matchPeriod','playerId', 'positions','subEventId','subEventName', tags','teamId']
vals = [ (   8, "Pass", 1.255989999999997,88178642,1694390,"1H",   26010,[[50, 48], [47, 50]],85,"Simple pass",[[1801]],  4418),
         (   1,"Duel",13.143591000000015,88178646,1694390,"1H",25437,  [[89, 6], [85, 0]],12,"Ground defending",[[702], [1801]],  4418)
       ]

udf1 =spark.udf.register("Lastcol", lambda xs: [n for x in xs[-1:] for n in x[-1:]])


df = spark.createDataFrame(vals, columns)
df2 = df.withColumn( 'created_col',udf1('tags')).show()

对于spark2.4+
使用元素 df.withColumn("lastItem", F.element_at("tags",-1)[0]).show() #+---------------+--------+ #| tags|lastItem| #+---------------+--------+ #|[[1], [2], [3]]| 3| #|[[1], [2], [3]]| 3| #+---------------+--------+