Apache spark 如何访问Dataframe中由行创建Dataframe的列

Apache spark 如何访问Dataframe中由行创建Dataframe的列,apache-spark,pyspark,Apache Spark,Pyspark,我是pyspark的新手希望从由行创建的数据帧访问列。 请参见下面我的.py文件中的代码它正在抛出错误AttributeError:“DataFrame”对象没有属性“product” import findspark findspark.init("/opt/spark") from pyspark.sql import SparkSession from pyspark.sql import Row from pyspark.sql import SQLContext produc

我是pyspark的新手
希望从由行创建的数据帧访问列。
请参见下面我的.py文件中的代码
它正在抛出错误AttributeError:“DataFrame”对象没有属性“product”

import findspark

findspark.init("/opt/spark")

from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql import SQLContext



productRevenue = Row("product", "category", "revenue")
spark = SparkSession \
    .builder \
    .appName("DataFrame Learning") \
    .getOrCreate()

sqlContext = SQLContext(spark)

productRevenue1 = productRevenue("product", "Cell phone", 6000)
productRevenue2 = productRevenue("Normal", "Tablet", 1500)
productRevenue3 = productRevenue("Mini", "Tablet", 5500)
productRevenue4 = productRevenue("Ultra thin", "Cell phone", 5000)
productRevenue5 = productRevenue("Very thin", "Cell phone", 6000)
productRevenue6 = productRevenue("Big", "Tablet", 2500)
productRevenue7 = productRevenue("Bendable", "Cell phone", 3000)
productRevenue8 = productRevenue("Foldable", "Cell phone", 3000)
productRevenue9 = productRevenue("Pro", "Tablet", 5500)
productRevenue10 = productRevenue("Pro2", "Tablet", 5500)

productRevenueAll = Row(
    productRevenue=[productRevenue1, productRevenue2, productRevenue3, productRevenue4, productRevenue5,
                    productRevenue6, productRevenue7, productRevenue8, productRevenue9, productRevenue10])

dataFrame = spark.createDataFrame(productRevenueAll)



filter_df = dataFrame.filter((dataFrame.product=="product") )


要从行创建数据帧,一种方法是调用行列表

如果您想创建像这样的数据帧

# +----------+----------+-------+
# |   product|  category|revenue|
# +----------+----------+-------+
# |   product|Cell phone|   6000|
# |    Normal|    Tablet|   1500|
# |      Mini|    Tablet|   5500|
# |             ...             |
# +----------+----------+-------+

# with Schema:

# root
#  |-- product: string (nullable = true)
#  |-- category: string (nullable = true)
#  |-- revenue: long (nullable = true)
然后,将其更改为行列表,而不是将
productRevenueAll
作为一行,如:

productRevenueAll=[
productRevenue1,productRevenue2,productRevenue3,
productRevenue4,productRevenue5,productRevenue6,
productRevenue7、productRevenue8、productRevenue9、,
productRevenue10,
]
dataFrame=spark.createDataFrame(productRevenueAll)
#然后像这样使用它:
dataFrame.product
#纵队
dataFrame.select(dataFrame.product.show())
# +----------+
#|产品|
# +----------+
#|产品|
#|正常|
#|迷你|
# |   ...    |
# +----------+

但是,如果您的目标是创建嵌套结构,例如:

# +-----------------------------+
# |      productRevenue         |
# +----------+----------+-------+
# |   product|  category|revenue|
# +----------+----------+-------+
# |   product|Cell phone|   6000|
# |    Normal|    Tablet|   1500|
# |             ...             |
# +----------+----------+-------+

# with Schema:

# root
#  |-- productRevenue: array (nullable = true)
#  |    |-- element: struct (containsNull = true)
#  |    |    |-- product: string (nullable = true)
#  |    |    |-- category: string (nullable = true)
#  |    |    |-- revenue: long (nullable = true)
使用一个项目列表馈送
createDataFrame()
,如:

productRevenueAllNested = Row(
    productRevenue=[
        productRevenue1, productRevenue2, productRevenue3, 
        productRevenue4, productRevenue5, productRevenue6, 
        productRevenue7, productRevenue8, productRevenue9, 
        productRevenue10,
    ])

dataFrameNested = spark.createDataFrame([productRevenueAllNested]) 

# then access it like
dataFrameNested.printSchema()

dataFrameNested.select(dataFrameNested.productRevenue).show()
# +----------------------+
# |productRevenue.product|
# +----------------------+
# |  [product, Normal,...|
# +----------------------+

要从行创建数据帧,一种方法是调用行列表

如果您想创建像这样的数据帧

# +----------+----------+-------+
# |   product|  category|revenue|
# +----------+----------+-------+
# |   product|Cell phone|   6000|
# |    Normal|    Tablet|   1500|
# |      Mini|    Tablet|   5500|
# |             ...             |
# +----------+----------+-------+

# with Schema:

# root
#  |-- product: string (nullable = true)
#  |-- category: string (nullable = true)
#  |-- revenue: long (nullable = true)
然后,将其更改为行列表,而不是将
productRevenueAll
作为一行,如:

productRevenueAll=[
productRevenue1,productRevenue2,productRevenue3,
productRevenue4,productRevenue5,productRevenue6,
productRevenue7、productRevenue8、productRevenue9、,
productRevenue10,
]
dataFrame=spark.createDataFrame(productRevenueAll)
#然后像这样使用它:
dataFrame.product
#纵队
dataFrame.select(dataFrame.product.show())
# +----------+
#|产品|
# +----------+
#|产品|
#|正常|
#|迷你|
# |   ...    |
# +----------+

但是,如果您的目标是创建嵌套结构,例如:

# +-----------------------------+
# |      productRevenue         |
# +----------+----------+-------+
# |   product|  category|revenue|
# +----------+----------+-------+
# |   product|Cell phone|   6000|
# |    Normal|    Tablet|   1500|
# |             ...             |
# +----------+----------+-------+

# with Schema:

# root
#  |-- productRevenue: array (nullable = true)
#  |    |-- element: struct (containsNull = true)
#  |    |    |-- product: string (nullable = true)
#  |    |    |-- category: string (nullable = true)
#  |    |    |-- revenue: long (nullable = true)
使用一个项目列表馈送
createDataFrame()
,如:

productRevenueAllNested = Row(
    productRevenue=[
        productRevenue1, productRevenue2, productRevenue3, 
        productRevenue4, productRevenue5, productRevenue6, 
        productRevenue7, productRevenue8, productRevenue9, 
        productRevenue10,
    ])

dataFrameNested = spark.createDataFrame([productRevenueAllNested]) 

# then access it like
dataFrameNested.printSchema()

dataFrameNested.select(dataFrameNested.productRevenue).show()
# +----------------------+
# |productRevenue.product|
# +----------------------+
# |  [product, Normal,...|
# +----------------------+

您正在与行对象嵌套,这导致生成
struct
字段

  • 以下是从
    对象创建数据帧的方法

示例:

#using .toDF to create dataframe
sc.parallelize([productRevenue1, productRevenue2, productRevenue3, productRevenue4, productRevenue5,productRevenue6, productRevenue7, productRevenue8, productRevenue9, productRevenue10]).toDF().show()

#using spark.createDataFrame to create dataframe
spark.createDataFrame([productRevenue1, productRevenue2, productRevenue3, productRevenue4, productRevenue5,productRevenue6, productRevenue7, productRevenue8, productRevenue9, productRevenue10]).show()

#creating dataframe from rdd
productRevenue=sc.parallelize([productRevenue1, productRevenue2, productRevenue3, productRevenue4, productRevenue5,productRevenue6, productRevenue7, productRevenue8, productRevenue9, productRevenue10])

#creating dataframe from list
productRevenue=[productRevenue1, productRevenue2, productRevenue3, productRevenue4, productRevenue5,productRevenue6, productRevenue7, productRevenue8, productRevenue9, productRevenue10]

spark.createDataFrame(productRevenue).show()
#+----------+----------+-------+
#|   product|  category|revenue|
#+----------+----------+-------+
#|   product|Cell phone|   6000|
#|    Normal|    Tablet|   1500|
#|      Mini|    Tablet|   5500|
#|Ultra thin|Cell phone|   5000|
#| Very thin|Cell phone|   6000|
#|       Big|    Tablet|   2500|
#|  Bendable|Cell phone|   3000|
#|  Foldable|Cell phone|   3000|
#|       Pro|    Tablet|   5500|
#|      Pro2|    Tablet|   5500|
#+----------+----------+-------+


dataFrame=spark.createDataFrame(productRevenue)

dataFrame.filter((dataFrame.product=="product") ).show()
#+-------+----------+-------+
#|product|  category|revenue|
#+-------+----------+-------+
#|product|Cell phone|   6000|
#+-------+----------+-------+

您正在与行对象嵌套,这导致生成
struct
字段

  • 以下是从
    对象创建数据帧的方法

示例:

#using .toDF to create dataframe
sc.parallelize([productRevenue1, productRevenue2, productRevenue3, productRevenue4, productRevenue5,productRevenue6, productRevenue7, productRevenue8, productRevenue9, productRevenue10]).toDF().show()

#using spark.createDataFrame to create dataframe
spark.createDataFrame([productRevenue1, productRevenue2, productRevenue3, productRevenue4, productRevenue5,productRevenue6, productRevenue7, productRevenue8, productRevenue9, productRevenue10]).show()

#creating dataframe from rdd
productRevenue=sc.parallelize([productRevenue1, productRevenue2, productRevenue3, productRevenue4, productRevenue5,productRevenue6, productRevenue7, productRevenue8, productRevenue9, productRevenue10])

#creating dataframe from list
productRevenue=[productRevenue1, productRevenue2, productRevenue3, productRevenue4, productRevenue5,productRevenue6, productRevenue7, productRevenue8, productRevenue9, productRevenue10]

spark.createDataFrame(productRevenue).show()
#+----------+----------+-------+
#|   product|  category|revenue|
#+----------+----------+-------+
#|   product|Cell phone|   6000|
#|    Normal|    Tablet|   1500|
#|      Mini|    Tablet|   5500|
#|Ultra thin|Cell phone|   5000|
#| Very thin|Cell phone|   6000|
#|       Big|    Tablet|   2500|
#|  Bendable|Cell phone|   3000|
#|  Foldable|Cell phone|   3000|
#|       Pro|    Tablet|   5500|
#|      Pro2|    Tablet|   5500|
#+----------+----------+-------+


dataFrame=spark.createDataFrame(productRevenue)

dataFrame.filter((dataFrame.product=="product") ).show()
#+-------+----------+-------+
#|product|  category|revenue|
#+-------+----------+-------+
#|product|Cell phone|   6000|
#+-------+----------+-------+

谢谢@Quar!我尝试将productRevenueAll作为一行,将其更改为适合我的行列表!不知道为什么提问会降低wtth-1的声誉:-)@GaurangPopat很高兴它有帮助:DThanks@Quar!我尝试将productRevenueAll作为一行,将其更改为适合我的行列表!不知道为什么质疑会降低wtth-1的声誉:-)@GaurangPopat很高兴这有帮助:D