Python 从Pyspark数据帧中的选定行获取特定字段_Python_Apache Spark_Dataframe_Pyspark_Apache Spark Sql

Python 从Pyspark数据帧中的选定行获取特定字段

python apache-spark dataframe pyspark

Python 从Pyspark数据帧中的选定行获取特定字段,python,apache-spark,dataframe,pyspark,apache-spark-sql,Python,Apache Spark,Dataframe,Pyspark,Apache Spark Sql,我有一个通过pyspark从JSON文件构建的Spark数据框架，如下所示 sc = SparkContext() sqlc = SQLContext(sc) users_df = sqlc.read.json('users.json') 现在，我想访问所选的用户数据，这是它的_id字段。我能行 print users_df[users_df._id == chosen_user].show() 这给了我一个完整的用户行。但是，假设我只需要行中的一个特定字段，比如用户性别，我将如何获得它？

我有一个通过pyspark从JSON文件构建的Spark数据框架，如下所示

sc = SparkContext()
sqlc = SQLContext(sc)

users_df = sqlc.read.json('users.json')

现在，我想访问所选的用户数据，这是它的_id字段。我能行

print users_df[users_df._id == chosen_user].show()

这给了我一个完整的用户行。但是，假设我只需要行中的一个特定字段，比如用户性别，我将如何获得它？

只需筛选并选择：

result = users_df.where(users_df._id == chosen_user).select("gender")

或使用

col

from pyspark.sql.functions import col

result = users_df.where(col("_id") == chosen_user).select(col("gender"))

最后，PySpark

行

只是一个具有一些扩展的

元组

，因此您可以例如

flatMap

：

result.rdd.flatMap(list).first()

或者使用类似于以下内容的

map

：

result.rdd.map(lambda x: x.gender).first()