Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 如何从dataframe中选择多个列并将其转储到pyspark中的列表_Apache Spark_Pyspark_Apache Spark Sql - Fatal编程技术网

Apache spark 如何从dataframe中选择多个列并将其转储到pyspark中的列表

Apache spark 如何从dataframe中选择多个列并将其转储到pyspark中的列表,apache-spark,pyspark,apache-spark-sql,Apache Spark,Pyspark,Apache Spark Sql,我有一个包含多列的dataframe,我需要选择其中的两列并将它们转储到列表中,我尝试了以下方法: df.show() +------------------------------------+---------------+---------------+ |email_address |topic |user_id | +------------------------------------+----------

我有一个包含多列的dataframe,我需要选择其中的两列并将它们转储到列表中,我尝试了以下方法:

df.show()
+------------------------------------+---------------+---------------+
|email_address                       |topic          |user_id        |
+------------------------------------+---------------+---------------+
|xyz@test.com                        |hello_world    |xyz123         |
+------------------------------------+---------------+---------------+
|lmn@test.com                        |hello_kitty    |lmn456         |
+------------------------------------+---------------+---------------+
我需要的结果是一个元组列表:

[(xyz@test.com, xyz123), (lmn@test.com, lmn456)]
我尝试的方式:

tuples = df.select(col('email_address'), col('topic')).rdd.flatMap(lambda x, y: list(x, y)).collect()
它会抛出错误:

Py4JJavaError  Traceback (most recent call last)
<command-4050677552755250> in <module>()

--> 114 tuples = df.select(col('email_address'), col('topic')).rdd.flatMap(lambda x, y: list(x, y)).collect()
    115 
    116 

/databricks/spark/python/pyspark/rdd.py in collect(self)
    829         # Default path used in OSS Spark / for non-credential passthrough clusters:
    830         with SCCallSiteSync(self.context) as css:
--> 831             sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
    832         return list(_load_from_socket(sock_info, self._jrdd_deserializer))
    833 

/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1255         answer = self.gateway_client.send_command(command)
   1256         return_value = get_return_value(
-> 1257             answer, self.gateway_client, self.target_id, self.name)
如何修复它

您应该使用map:

tuples = df.select(col('email_address'), col('topic')) \
           .rdd \
           .map(lambda x: (x[0], x[1])) \
           .collect()

print(tuples)

# output
[('xyz@test.com', 'hello_world'), ('lmn@test.com', 'hello_kitty')]
另一种方法是为DataFrame收集行,然后循环获取值:

rows = df.select(col('email_address'), col('topic')).collect()

tuples = [(r.email_address, r.topic) for r in rows]
print(tuples)

# output
[('xyz@test.com', 'hello_world'), ('lmn@test.com', 'hello_kitty')]

flatMap中的函数只接受一个参数。对于你的任务,使用列表理解应该足够了。元组=[r.email\u address,r.user\u id,用于df中的r。选择'email\u address','user\u id'。collect]或元组=[*maptuple,df。选择'email\u address','user\u id'。collect]