Warning: file_get_contents(/data/phpspider/zhask/data//catemap/6/haskell/9.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
获取数组Pyspark中的第一个元素_Pyspark - Fatal编程技术网

获取数组Pyspark中的第一个元素

获取数组Pyspark中的第一个元素,pyspark,Pyspark,我想添加新的2列值服务arr第一列和第二列值 但我得到了一个错误: 字段名应该是字符串文字,但它是0 您不必使用.getItem(0) production\u target\u datasource\u df[“Services”][0]就足够了 # Constructing your table: from pyspark.sql import Row df = sc.parallelize([Row(cid=1,Services=["2", "serv1"]), Row(cid=1, S

我想添加新的2列值服务arr第一列和第二列值 但我得到了一个错误:

字段名应该是字符串文字,但它是0


您不必使用
.getItem(0)

production\u target\u datasource\u df[“Services”][0]
就足够了

# Constructing your table:
from pyspark.sql import Row

df = sc.parallelize([Row(cid=1,Services=["2", "serv1"]),
Row(cid=1, Services=["3", "serv1"]),
Row(cid=1, Services=["4", "serv2"])]).toDF()
df.show()
+---+----------+
|cid|  Services|
+---+----------+
|  1|[2, serv1]|
|  1|[3, serv1]|
|  1|[4, serv2]|
+---+----------+

# Adding the two columns:
new_df = df.withColumn("first_element", df.Services[0])
new_df = new_df.withColumn("second_element", df.Services[1])
new_df.show()

+---+----------+-------------+--------------+
|cid|  Services|first_element|second_element|
+---+----------+-------------+--------------+
|  1|[2, serv1]|            2|         serv1|
|  1|[3, serv1]|            3|         serv1|
|  1|[4, serv2]|            4|         serv2|
+---+----------+-------------+--------------+

正如错误所说,您需要传递一个字符串,而不是0。 然后,你想知道:我应该通过什么样的线

如果您遵循@pault advice和printSchema,您将实际知道列表中的值对应的键是什么

下面是getItem的文档,帮助您解决这个问题

另一种知道要传递什么的方法是只传递任何字符串,您可以键入:

production_target_datasource_df.withColumn("newcol",production_target_datasource_df["Services"].getItem('0'))
日志会告诉你需要什么密钥


希望这有帮助;)

您的问题包括
production\u target\u datasource\u df.printSchema()
的输出。到目前为止您尝试了什么。您有任何代码要显示吗?
AnalysisException:“字段名应该是字符串文字,但它是0;”
# Constructing your table:
from pyspark.sql import Row

df = sc.parallelize([Row(cid=1,Services=["2", "serv1"]),
Row(cid=1, Services=["3", "serv1"]),
Row(cid=1, Services=["4", "serv2"])]).toDF()
df.show()
+---+----------+
|cid|  Services|
+---+----------+
|  1|[2, serv1]|
|  1|[3, serv1]|
|  1|[4, serv2]|
+---+----------+

# Adding the two columns:
new_df = df.withColumn("first_element", df.Services[0])
new_df = new_df.withColumn("second_element", df.Services[1])
new_df.show()

+---+----------+-------------+--------------+
|cid|  Services|first_element|second_element|
+---+----------+-------------+--------------+
|  1|[2, serv1]|            2|         serv1|
|  1|[3, serv1]|            3|         serv1|
|  1|[4, serv2]|            4|         serv2|
+---+----------+-------------+--------------+
production_target_datasource_df.withColumn("newcol",production_target_datasource_df["Services"].getItem('0'))