Pyspark Pypark类型错误

Pyspark Pypark类型错误,pyspark,pyspark-sql,Pyspark,Pyspark Sql,编写一个简单的CSV到拼花地板转换 CSV文件中有几个时间戳。因此,当我尝试写入时,会出现类型错误 为了解决这个问题,我尝试实现这一行来识别时间戳col,并对它们执行To_时间戳 rdd = sc.textFile("../../../Downloads/test_type.csv").map(lambda line: [to_timestamp(i) if instr(i,"-")==5 else i for i in line.split(",")]) 获取此错误: org.apache.

编写一个简单的CSV到拼花地板转换

CSV文件中有几个时间戳。因此,当我尝试写入时,会出现类型错误

为了解决这个问题,我尝试实现这一行来识别时间戳col,并对它们执行To_时间戳

rdd = sc.textFile("../../../Downloads/test_type.csv").map(lambda line: [to_timestamp(i) if instr(i,"-")==5 else i for i in line.split(",")])
获取此错误:

org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "C:/yy/xx/Documents/gg/csv_to_parquet/csv_to_parquet.py", line 55, in <lambda>
    rdd = sc.textFile("../../../test/test.csv").map(lambda line: [to_timestamp(i) if (instr(i,"-")==5) else i for i in line.split(",")])

AttributeError: 'NoneType' object has no attribute '_jvm'
我首先修改了代码以使用所有StringType,然后修改了dataframe中的数据类型

sc=SparkContext(appName=“CSV2Parquet”)
sqlContext=sqlContext(sc)
schema=StructType\
([
StructField(“标题更改顺序”,StringType(),True),
StructField(“标头更改操作”,StringType(),True),
StructField(“标头更改掩码”,StringType(),True),
StructField(“标头\流\位置”,StringType(),True),
StructField(“标头\操作”,StringType(),True),
StructField(“标头\u事务\u id”,StringType(),True),
StructField(“标头\时间戳”,StringType(),True),
StructField(“l_en_us”,StringType(),True),
StructField(“优先级”,StringType(),True),
StructField(“类型代码”,StringType(),True),
StructField(“失效”,StringType(),True),
StructField(“名称”,StringType(),True),
StructField(“id”,StringType(),True),
StructField(“说明”,StringType(),True),
StructField(“l_es_ar”,StringType(),True),
StructField(“adw_updated_ts”,StringType(),True),
StructField(“adw_进程id”,StringType(),True)
])
rdd=sc.textFile(“../../../Downloads/pctl_jobdatetype.csv”).map(lambda行:line.split(“,”)
df=sqlContext.createDataFrame(rdd,模式)
df2=df.withColumn('header\uu timestamp',df['header\uu timestamp']].cast('timestamp'))
df2=df.withColumn('adw_updated_ts',df['adw_updated_ts'].cast('timestamp'))
df2=df.withColumn('priority',df['priority'].cast('double'))
df2=df.withColumn('id',df['id'].cast('double'))
df2.write.parquet('../../../Downloads/input parquet')
样本数据:

"header__change_seq","header__change_oper","header__change_mask","header__stream_position","header__operation","header__transaction_id","header__timestamp","l_en_us","priority","typecode","retired","name","id","description","l_es_ar","adw_updated_ts","adw_process_id"
,"I",,,"IDL",,"1970-01-01 00:00:01.000","Effective Date","10.0","Effective","0","Effective Date","10001.0","Effective Date","Effective Date","2020-02-16 15:45:07.432","fb69d6f6-06fa-4c93-b8d6-bb7c7229ee88"
,"I",,,"IDL",,"1970-01-01 00:00:01.000","Written Date","20.0","Written","0","Written Date","10002.0","Written Date","Written Date","2020-02-16 15:45:07.432","fb69d6f6-06fa-4c93-b8d6-bb7c7229ee88"
,"I",,,"IDL",,"1970-01-01 00:00:01.000","Reference Date","30.0","Reference","0","Reference Date","10003.0","Reference Date","Reference Date","2020-02-16 15:45:07.432","fb69d6f6-06fa-4c93-b8d6-bb7c7229ee88"


在我将下面第3-6行的dataframe名称修改为df2之后,似乎工作正常,Athena也返回了结果

df=sqlContext.createDataFrame(rdd,模式)
df2=df.withColumn('header\uu timestamp',df['header\uu timestamp']].cast('timestamp'))
df2=df2.withColumn('adw_updated_ts',df['adw_updated_ts'])。cast('timestamp'))
df2=df2.withColumn('priority',df['priority'].cast('double'))
df2=df2.withColumn('id',df['id'].cast('double'))
df2.write.parquet('../../../Downloads/input parquet')

你能展示一下你的数据是什么样子吗?
"header__change_seq","header__change_oper","header__change_mask","header__stream_position","header__operation","header__transaction_id","header__timestamp","l_en_us","priority","typecode","retired","name","id","description","l_es_ar","adw_updated_ts","adw_process_id"
,"I",,,"IDL",,"1970-01-01 00:00:01.000","Effective Date","10.0","Effective","0","Effective Date","10001.0","Effective Date","Effective Date","2020-02-16 15:45:07.432","fb69d6f6-06fa-4c93-b8d6-bb7c7229ee88"
,"I",,,"IDL",,"1970-01-01 00:00:01.000","Written Date","20.0","Written","0","Written Date","10002.0","Written Date","Written Date","2020-02-16 15:45:07.432","fb69d6f6-06fa-4c93-b8d6-bb7c7229ee88"
,"I",,,"IDL",,"1970-01-01 00:00:01.000","Reference Date","30.0","Reference","0","Reference Date","10003.0","Reference Date","Reference Date","2020-02-16 15:45:07.432","fb69d6f6-06fa-4c93-b8d6-bb7c7229ee88"