是否可以使用PySpark创建元组类型的StructField？_Pyspark

是否可以使用PySpark创建元组类型的StructField？

pyspark

是否可以使用PySpark创建元组类型的StructField？,pyspark,Pyspark,我需要为Spark中的数据帧创建一个模式。创建常规的StructFields，例如StringType，IntegerType，我没有问题。但是，我想为元组创建一个StructField 我尝试了以下方法： StructType([ StructField("dst_ip", StringType()), StructField("port", StringType()) ]) 但

我需要为Spark中的数据帧创建一个模式。创建常规的

StructFields

，例如

StringType

，

IntegerType

，我没有问题。但是，我想为元组创建一个

StructField

我尝试了以下方法：

StructType([
             StructField("dst_ip", StringType()),
             StructField("port", StringType())
           ])

但是，它抛出了一个错误

“列表对象没有“name”属性”

是否可以为元组类型创建StructField？

您可以在

StructField的内部定义StructType
：
schema=StructType(
[
结构场(
“myTuple”，
结构类型(
[
StructField（“dst_ip”，StringType（）），
StructField（“端口”，StringType（））
]
)
)
]
)
df=sqlCtx.createDataFrame（[]，架构）
df.printSchema（）
#根
#|--myTuple:struct（nullable=true）
#| |--dst_ip:string（nullable=true）
#| |--端口：字符串（nullable=true）
用于定义数据帧结构的类是表示行的数据类型，它由列表组成
为了为列定义元组数据类型（比如columnA
），您需要将元组元素的StructType
封装（列出）到StructField
。请注意，StructField
s需要有名称，因为它们代表列
将元组StructField
定义为新的StructType
：
columnA = StructField('columnA', StructType([
                                              StructField("dst_ip", StringType()),
                                              StructField("port", StringType())
                                             ])
                     )

mySchema = StructType([ columnA, StructField("columnB", FloatType())])

定义包含columnA
和columnB
（类型FloatType
）的架构：
将架构应用于数据帧：
data =[{'columnA': ('x', 'y'), 'columnB': 1.0}] 
# data = [Row(columnA=('x', 'y'), columnB=1.0)] (needs from pyspark.sql import Row)
df = spark.createDataFrame(data, mySchema)
df.printSchema()
# root
#  |-- columnA: struct (nullable = true)
#  |    |-- dst_ip: string (nullable = true)
#  |    |-- port: string (nullable = true)
#  |-- columnB: float (nullable = true)

df.show()                                                                                 
# +-------+-------+
# |columnA|columnB|
# +-------+-------+
# | [x, y]|    1.0|
# +-------+-------+

显示数据帧：
data =[{'columnA': ('x', 'y'), 'columnB': 1.0}] 
# data = [Row(columnA=('x', 'y'), columnB=1.0)] (needs from pyspark.sql import Row)
df = spark.createDataFrame(data, mySchema)
df.printSchema()
# root
#  |-- columnA: struct (nullable = true)
#  |    |-- dst_ip: string (nullable = true)
#  |    |-- port: string (nullable = true)
#  |-- columnB: float (nullable = true)

df.show()                                                                                 
# +-------+-------+
# |columnA|columnB|
# +-------+-------+
# | [x, y]|    1.0|
# +-------+-------+

（这只是的较长版本）