Pyspark 在Spark 2.1.1中，StructType中的StructFields始终可以为空_Pyspark

Pyspark 在Spark 2.1.1中，StructType中的StructFields始终可以为空

pyspark

Pyspark 在Spark 2.1.1中，StructType中的StructFields始终可以为空,pyspark,Pyspark,我正在使用几个StructFields创建一个StructType——名称和数据类型似乎工作正常，但不管在每个StructField中将nullable设置为False，结果架构报告nullable对于每个StructField都是True 有人能解释为什么吗？谢谢 from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StringType, FloatType, TimestampType

我正在使用几个StructFields创建一个StructType——名称和数据类型似乎工作正常，但不管在每个StructField中将nullable设置为False，结果架构报告nullable对于每个StructField都是True

有人能解释为什么吗？谢谢

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StringType, FloatType, TimestampType

sparkSession = SparkSession.builder \
  .master("local") \
  .appName("SparkSession") \
  .getOrCreate()


dfStruct = StructType().add("date", TimestampType(), False)
dfStruct.add("open", FloatType(), False)
dfStruct.add("high", FloatType(), False)
dfStruct.add("low", FloatType(), False)
dfStruct.add("close", FloatType(), False)
dfStruct.add("ticker",  StringType(), False)

#print elements of StructType -- reports nullable is false
for d in dfStruct: print d

#data looks like this:
#date,open,high,low,close,ticker
# 2014-10-14 23:20:32,7.14,9.07,0.0,7.11,ARAY
# 2014-10-14 23:20:36,9.74,10.72,6.38,9.25,ARC
# 2014-10-14 23:20:38,31.38,37.0,28.0,30.94,ARCB
# 2014-10-14 23:20:44,15.39,17.37,15.35,15.3,ARCC
# 2014-10-14 23:20:49,5.59,6.5,5.31,5.48,ARCO

#read csv file and apply dfStruct as the schema
df = sparkSession.read.csv(path = "/<path tofile>/stock_data.csv", \
                           schema = dfStruct, \
                           sep = ",", \
                           ignoreLeadingWhiteSpace = True, \
                           ignoreTrailingWhiteSpace = True \
                           )

#reports nullable as True!
df.printSchema()

从pyspark.sql导入SparkSession
从pyspark.sql.types导入StructType、StringType、FloatType、TimestampType
sparkSession=sparkSession.builder\
.master（“本地”）\
.appName（“SparkSession”）\
.getOrCreate（）
dfStruct=StructType（）.add（“日期”，TimestampType（），False）
dfStruct.add（“打开”，FloatType（），False）
dfStruct.add（“高”，FloatType（），False）
dfStruct.add（“低”，FloatType（），False）
dfStruct.add（“关闭”，FloatType（），False）
dfStruct.add（“ticker”，StringType（），False）
#StructType的打印元素--reports nullable为false
对于dfStruct中的d：打印d
#数据如下所示：
#日期、开盘、高位、低位、收盘、股票代码
#2014-10-14 23:20:32,7.14,9.07,0.0,7.11，阿拉伊
#2014-10-14 23:20:36,9.74,10.72,6.38,9.25
#2014-10-14 23:20:38,31.38,37.0,28.0,30.94，ARCB
#2014-10-14 23:20:44,15.39,17.37,15.35,15.3，ARCC
#2014-10-14 23:20:49,5.59,6.5,5.31,5.48，阿科
#读取csv文件并应用dfStruct作为架构
df=sparkSession.read.csv（path=“///stock\u data.csv”\
schema=dfStruct\
sep=“，”\
ignoreLeadingWhiteSpace=True\
ignoreTrailingWhiteSpace=True\
)
#可为空的报告为真！
df.printSchema（）

这是一个火花。目前有一个旨在解决此问题的解决方案。如果确实需要字段不可为空，请尝试：

#read csv file and apply dfStruct as the schema
df = sparkSession.read.csv(path = "/<path tofile>/stock_data.csv", \
                       schema = dfStruct, \
                       sep = ",", \
                       ignoreLeadingWhiteSpace = True, \
                       ignoreTrailingWhiteSpace = True \
                       ).rdd.toDF(dfStruct)

#读取csv文件并应用dfStruct作为架构
df=sparkSession.read.csv（path=“///stock\u data.csv”\
schema=dfStruct\
sep=“，”\
ignoreLeadingWhiteSpace=True\
ignoreTrailingWhiteSpace=True\
).rdd.toDF（dfStruct）

这是一个火花。目前有一个旨在解决此问题的解决方案。如果确实需要字段不可为空，请尝试：

#read csv file and apply dfStruct as the schema
df = sparkSession.read.csv(path = "/<path tofile>/stock_data.csv", \
                       schema = dfStruct, \
                       sep = ",", \
                       ignoreLeadingWhiteSpace = True, \
                       ignoreTrailingWhiteSpace = True \
                       ).rdd.toDF(dfStruct)

#读取csv文件并应用dfStruct作为架构
df=sparkSession.read.csv（path=“///stock\u data.csv”\
schema=dfStruct\
sep=“，”\
ignoreLeadingWhiteSpace=True\
ignoreTrailingWhiteSpace=True\
).rdd.toDF（dfStruct）

我不确定这样的转换速度，所以我不会将其用于TB级数据，但如果您只是读取csv文件，它应该工作得很好。我不确定这样的转换速度，所以我不会将其用于TB级数据，但如果您只是读取csv文件，它应该工作得很好。