Apache spark 如何访问Dataframe中由行创建Dataframe的列
我是pyspark的新手Apache spark 如何访问Dataframe中由行创建Dataframe的列,apache-spark,pyspark,Apache Spark,Pyspark,我是pyspark的新手希望从由行创建的数据帧访问列。 请参见下面我的.py文件中的代码它正在抛出错误AttributeError:“DataFrame”对象没有属性“product” import findspark findspark.init("/opt/spark") from pyspark.sql import SparkSession from pyspark.sql import Row from pyspark.sql import SQLContext produc
希望从由行创建的数据帧访问列。
请参见下面我的.py文件中的代码
它正在抛出错误AttributeError:“DataFrame”对象没有属性“product”
import findspark
findspark.init("/opt/spark")
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql import SQLContext
productRevenue = Row("product", "category", "revenue")
spark = SparkSession \
.builder \
.appName("DataFrame Learning") \
.getOrCreate()
sqlContext = SQLContext(spark)
productRevenue1 = productRevenue("product", "Cell phone", 6000)
productRevenue2 = productRevenue("Normal", "Tablet", 1500)
productRevenue3 = productRevenue("Mini", "Tablet", 5500)
productRevenue4 = productRevenue("Ultra thin", "Cell phone", 5000)
productRevenue5 = productRevenue("Very thin", "Cell phone", 6000)
productRevenue6 = productRevenue("Big", "Tablet", 2500)
productRevenue7 = productRevenue("Bendable", "Cell phone", 3000)
productRevenue8 = productRevenue("Foldable", "Cell phone", 3000)
productRevenue9 = productRevenue("Pro", "Tablet", 5500)
productRevenue10 = productRevenue("Pro2", "Tablet", 5500)
productRevenueAll = Row(
productRevenue=[productRevenue1, productRevenue2, productRevenue3, productRevenue4, productRevenue5,
productRevenue6, productRevenue7, productRevenue8, productRevenue9, productRevenue10])
dataFrame = spark.createDataFrame(productRevenueAll)
filter_df = dataFrame.filter((dataFrame.product=="product") )
要从行创建数据帧,一种方法是调用行列表 如果您想创建像这样的数据帧
# +----------+----------+-------+
# | product| category|revenue|
# +----------+----------+-------+
# | product|Cell phone| 6000|
# | Normal| Tablet| 1500|
# | Mini| Tablet| 5500|
# | ... |
# +----------+----------+-------+
# with Schema:
# root
# |-- product: string (nullable = true)
# |-- category: string (nullable = true)
# |-- revenue: long (nullable = true)
然后,将其更改为行列表,而不是将productRevenueAll
作为一行,如:
productRevenueAll=[
productRevenue1,productRevenue2,productRevenue3,
productRevenue4,productRevenue5,productRevenue6,
productRevenue7、productRevenue8、productRevenue9、,
productRevenue10,
]
dataFrame=spark.createDataFrame(productRevenueAll)
#然后像这样使用它:
dataFrame.product
#纵队
dataFrame.select(dataFrame.product.show())
# +----------+
#|产品|
# +----------+
#|产品|
#|正常|
#|迷你|
# | ... |
# +----------+
但是,如果您的目标是创建嵌套结构,例如:
# +-----------------------------+
# | productRevenue |
# +----------+----------+-------+
# | product| category|revenue|
# +----------+----------+-------+
# | product|Cell phone| 6000|
# | Normal| Tablet| 1500|
# | ... |
# +----------+----------+-------+
# with Schema:
# root
# |-- productRevenue: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- product: string (nullable = true)
# | | |-- category: string (nullable = true)
# | | |-- revenue: long (nullable = true)
使用一个项目列表馈送createDataFrame()
,如:
productRevenueAllNested = Row(
productRevenue=[
productRevenue1, productRevenue2, productRevenue3,
productRevenue4, productRevenue5, productRevenue6,
productRevenue7, productRevenue8, productRevenue9,
productRevenue10,
])
dataFrameNested = spark.createDataFrame([productRevenueAllNested])
# then access it like
dataFrameNested.printSchema()
dataFrameNested.select(dataFrameNested.productRevenue).show()
# +----------------------+
# |productRevenue.product|
# +----------------------+
# | [product, Normal,...|
# +----------------------+
要从行创建数据帧,一种方法是调用行列表 如果您想创建像这样的数据帧
# +----------+----------+-------+
# | product| category|revenue|
# +----------+----------+-------+
# | product|Cell phone| 6000|
# | Normal| Tablet| 1500|
# | Mini| Tablet| 5500|
# | ... |
# +----------+----------+-------+
# with Schema:
# root
# |-- product: string (nullable = true)
# |-- category: string (nullable = true)
# |-- revenue: long (nullable = true)
然后,将其更改为行列表,而不是将productRevenueAll
作为一行,如:
productRevenueAll=[
productRevenue1,productRevenue2,productRevenue3,
productRevenue4,productRevenue5,productRevenue6,
productRevenue7、productRevenue8、productRevenue9、,
productRevenue10,
]
dataFrame=spark.createDataFrame(productRevenueAll)
#然后像这样使用它:
dataFrame.product
#纵队
dataFrame.select(dataFrame.product.show())
# +----------+
#|产品|
# +----------+
#|产品|
#|正常|
#|迷你|
# | ... |
# +----------+
但是,如果您的目标是创建嵌套结构,例如:
# +-----------------------------+
# | productRevenue |
# +----------+----------+-------+
# | product| category|revenue|
# +----------+----------+-------+
# | product|Cell phone| 6000|
# | Normal| Tablet| 1500|
# | ... |
# +----------+----------+-------+
# with Schema:
# root
# |-- productRevenue: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- product: string (nullable = true)
# | | |-- category: string (nullable = true)
# | | |-- revenue: long (nullable = true)
使用一个项目列表馈送createDataFrame()
,如:
productRevenueAllNested = Row(
productRevenue=[
productRevenue1, productRevenue2, productRevenue3,
productRevenue4, productRevenue5, productRevenue6,
productRevenue7, productRevenue8, productRevenue9,
productRevenue10,
])
dataFrameNested = spark.createDataFrame([productRevenueAllNested])
# then access it like
dataFrameNested.printSchema()
dataFrameNested.select(dataFrameNested.productRevenue).show()
# +----------------------+
# |productRevenue.product|
# +----------------------+
# | [product, Normal,...|
# +----------------------+
您正在与行对象嵌套,这导致生成
struct
字段
- 以下是从
对象创建数据帧的方法行
示例:
#using .toDF to create dataframe
sc.parallelize([productRevenue1, productRevenue2, productRevenue3, productRevenue4, productRevenue5,productRevenue6, productRevenue7, productRevenue8, productRevenue9, productRevenue10]).toDF().show()
#using spark.createDataFrame to create dataframe
spark.createDataFrame([productRevenue1, productRevenue2, productRevenue3, productRevenue4, productRevenue5,productRevenue6, productRevenue7, productRevenue8, productRevenue9, productRevenue10]).show()
#creating dataframe from rdd
productRevenue=sc.parallelize([productRevenue1, productRevenue2, productRevenue3, productRevenue4, productRevenue5,productRevenue6, productRevenue7, productRevenue8, productRevenue9, productRevenue10])
#creating dataframe from list
productRevenue=[productRevenue1, productRevenue2, productRevenue3, productRevenue4, productRevenue5,productRevenue6, productRevenue7, productRevenue8, productRevenue9, productRevenue10]
spark.createDataFrame(productRevenue).show()
#+----------+----------+-------+
#| product| category|revenue|
#+----------+----------+-------+
#| product|Cell phone| 6000|
#| Normal| Tablet| 1500|
#| Mini| Tablet| 5500|
#|Ultra thin|Cell phone| 5000|
#| Very thin|Cell phone| 6000|
#| Big| Tablet| 2500|
#| Bendable|Cell phone| 3000|
#| Foldable|Cell phone| 3000|
#| Pro| Tablet| 5500|
#| Pro2| Tablet| 5500|
#+----------+----------+-------+
dataFrame=spark.createDataFrame(productRevenue)
dataFrame.filter((dataFrame.product=="product") ).show()
#+-------+----------+-------+
#|product| category|revenue|
#+-------+----------+-------+
#|product|Cell phone| 6000|
#+-------+----------+-------+
您正在与行对象嵌套,这导致生成
struct
字段
- 以下是从
对象创建数据帧的方法行
示例:
#using .toDF to create dataframe
sc.parallelize([productRevenue1, productRevenue2, productRevenue3, productRevenue4, productRevenue5,productRevenue6, productRevenue7, productRevenue8, productRevenue9, productRevenue10]).toDF().show()
#using spark.createDataFrame to create dataframe
spark.createDataFrame([productRevenue1, productRevenue2, productRevenue3, productRevenue4, productRevenue5,productRevenue6, productRevenue7, productRevenue8, productRevenue9, productRevenue10]).show()
#creating dataframe from rdd
productRevenue=sc.parallelize([productRevenue1, productRevenue2, productRevenue3, productRevenue4, productRevenue5,productRevenue6, productRevenue7, productRevenue8, productRevenue9, productRevenue10])
#creating dataframe from list
productRevenue=[productRevenue1, productRevenue2, productRevenue3, productRevenue4, productRevenue5,productRevenue6, productRevenue7, productRevenue8, productRevenue9, productRevenue10]
spark.createDataFrame(productRevenue).show()
#+----------+----------+-------+
#| product| category|revenue|
#+----------+----------+-------+
#| product|Cell phone| 6000|
#| Normal| Tablet| 1500|
#| Mini| Tablet| 5500|
#|Ultra thin|Cell phone| 5000|
#| Very thin|Cell phone| 6000|
#| Big| Tablet| 2500|
#| Bendable|Cell phone| 3000|
#| Foldable|Cell phone| 3000|
#| Pro| Tablet| 5500|
#| Pro2| Tablet| 5500|
#+----------+----------+-------+
dataFrame=spark.createDataFrame(productRevenue)
dataFrame.filter((dataFrame.product=="product") ).show()
#+-------+----------+-------+
#|product| category|revenue|
#+-------+----------+-------+
#|product|Cell phone| 6000|
#+-------+----------+-------+
谢谢@Quar!我尝试将productRevenueAll作为一行,将其更改为适合我的行列表!不知道为什么提问会降低wtth-1的声誉:-)@GaurangPopat很高兴它有帮助:DThanks@Quar!我尝试将productRevenueAll作为一行,将其更改为适合我的行列表!不知道为什么质疑会降低wtth-1的声誉:-)@GaurangPopat很高兴这有帮助:D