Pyspark 在Spark 2.1.1中,StructType中的StructFields始终可以为空

Pyspark 在Spark 2.1.1中,StructType中的StructFields始终可以为空,pyspark,Pyspark,我正在使用几个StructFields创建一个StructType——名称和数据类型似乎工作正常,但不管在每个StructField中将nullable设置为False,结果架构报告nullable对于每个StructField都是True 有人能解释为什么吗?谢谢 from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StringType, FloatType, TimestampType

我正在使用几个StructFields创建一个StructType——名称和数据类型似乎工作正常,但不管在每个StructField中将nullable设置为False,结果架构报告nullable对于每个StructField都是True

有人能解释为什么吗?谢谢

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StringType, FloatType, TimestampType

sparkSession = SparkSession.builder \
  .master("local") \
  .appName("SparkSession") \
  .getOrCreate()


dfStruct = StructType().add("date", TimestampType(), False)
dfStruct.add("open", FloatType(), False)
dfStruct.add("high", FloatType(), False)
dfStruct.add("low", FloatType(), False)
dfStruct.add("close", FloatType(), False)
dfStruct.add("ticker",  StringType(), False)

#print elements of StructType -- reports nullable is false
for d in dfStruct: print d

#data looks like this:
#date,open,high,low,close,ticker
# 2014-10-14 23:20:32,7.14,9.07,0.0,7.11,ARAY
# 2014-10-14 23:20:36,9.74,10.72,6.38,9.25,ARC
# 2014-10-14 23:20:38,31.38,37.0,28.0,30.94,ARCB
# 2014-10-14 23:20:44,15.39,17.37,15.35,15.3,ARCC
# 2014-10-14 23:20:49,5.59,6.5,5.31,5.48,ARCO

#read csv file and apply dfStruct as the schema
df = sparkSession.read.csv(path = "/<path tofile>/stock_data.csv", \
                           schema = dfStruct, \
                           sep = ",", \
                           ignoreLeadingWhiteSpace = True, \
                           ignoreTrailingWhiteSpace = True \
                           )

#reports nullable as True!
df.printSchema()
从pyspark.sql导入SparkSession
从pyspark.sql.types导入StructType、StringType、FloatType、TimestampType
sparkSession=sparkSession.builder\
.master(“本地”)\
.appName(“SparkSession”)\
.getOrCreate()
dfStruct=StructType().add(“日期”,TimestampType(),False)
dfStruct.add(“打开”,FloatType(),False)
dfStruct.add(“高”,FloatType(),False)
dfStruct.add(“低”,FloatType(),False)
dfStruct.add(“关闭”,FloatType(),False)
dfStruct.add(“ticker”,StringType(),False)
#StructType的打印元素--reports nullable为false
对于dfStruct中的d:打印d
#数据如下所示:
#日期、开盘、高位、低位、收盘、股票代码
#2014-10-14 23:20:32,7.14,9.07,0.0,7.11,阿拉伊
#2014-10-14 23:20:36,9.74,10.72,6.38,9.25
#2014-10-14 23:20:38,31.38,37.0,28.0,30.94,ARCB
#2014-10-14 23:20:44,15.39,17.37,15.35,15.3,ARCC
#2014-10-14 23:20:49,5.59,6.5,5.31,5.48,阿科
#读取csv文件并应用dfStruct作为架构
df=sparkSession.read.csv(path=“///stock\u data.csv”\
schema=dfStruct\
sep=“,”\
ignoreLeadingWhiteSpace=True\
ignoreTrailingWhiteSpace=True\
)
#可为空的报告为真!
df.printSchema()
这是一个火花。目前有一个旨在解决此问题的解决方案。如果确实需要字段不可为空,请尝试:

#read csv file and apply dfStruct as the schema
df = sparkSession.read.csv(path = "/<path tofile>/stock_data.csv", \
                       schema = dfStruct, \
                       sep = ",", \
                       ignoreLeadingWhiteSpace = True, \
                       ignoreTrailingWhiteSpace = True \
                       ).rdd.toDF(dfStruct)
#读取csv文件并应用dfStruct作为架构
df=sparkSession.read.csv(path=“///stock\u data.csv”\
schema=dfStruct\
sep=“,”\
ignoreLeadingWhiteSpace=True\
ignoreTrailingWhiteSpace=True\
).rdd.toDF(dfStruct)
这是一个火花。目前有一个旨在解决此问题的解决方案。如果确实需要字段不可为空,请尝试:

#read csv file and apply dfStruct as the schema
df = sparkSession.read.csv(path = "/<path tofile>/stock_data.csv", \
                       schema = dfStruct, \
                       sep = ",", \
                       ignoreLeadingWhiteSpace = True, \
                       ignoreTrailingWhiteSpace = True \
                       ).rdd.toDF(dfStruct)
#读取csv文件并应用dfStruct作为架构
df=sparkSession.read.csv(path=“///stock\u data.csv”\
schema=dfStruct\
sep=“,”\
ignoreLeadingWhiteSpace=True\
ignoreTrailingWhiteSpace=True\
).rdd.toDF(dfStruct)

我不确定这样的转换速度,所以我不会将其用于TB级数据,但如果您只是读取csv文件,它应该工作得很好。我不确定这样的转换速度,所以我不会将其用于TB级数据,但如果您只是读取csv文件,它应该工作得很好。