Apache spark Pyspark用于将列字符串转换为日期
我正在尝试将一个“dob”列从string转换为date数据类型,以便在pyspark中执行一些基本操作。我的意见是Apache spark Pyspark用于将列字符串转换为日期,apache-spark,pyspark,apache-spark-sql,Apache Spark,Pyspark,Apache Spark Sql,我正在尝试将一个“dob”列从string转换为date数据类型,以便在pyspark中执行一些基本操作。我的意见是 long_name age dob wage_eur Cristiano Ronaldo dos Santos Aveiro 32 05-02-1985 565000 Lionel Andrés Messi Cuccittini 30 24-06-1987 565000 我创建了一个定制的模式
long_name age dob wage_eur
Cristiano Ronaldo dos Santos Aveiro 32 05-02-1985 565000
Lionel Andrés Messi Cuccittini 30 24-06-1987 565000
我创建了一个定制的模式,将dob列从string更改为date类型,并用于查询数据帧,如下所示
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
from pyspark.sql.types import IntegerType , DateType , StringType , TimestampType, StructType , StructField
peopleschema = StructType([StructField("long_name",StringType(),True),
StructField("age",IntegerType(),True),
**StructField("dob",DateType(),True),**
StructField("wage_eur",IntegerType(),True)
])
file_location = "/FileStore/tables/Fifa_data_Dateconvertion-1.csv"
file_type = "csv"
infer_schema = "false"
first_row_is_header = "true"
delimiter = ","
df = spark.read.format(file_type) \
.option("inferSchema", infer_schema) \
.option("header", first_row_is_header) \
.option("sep", delimiter) \
**.schema(peopleschema) \**
.load(file_location)
display(df)
long_name:string
age:integer
dob:date
wage_eur:integer
long_name age dob wage_eur
Cristiano Ronaldo dos Santos Aveiro 32 0010-07-09 565000
Lionel Andrés Messi Cuccittini 30 0029-11-08 565000
已为dob列转换架构,但值如下所示
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
from pyspark.sql.types import IntegerType , DateType , StringType , TimestampType, StructType , StructField
peopleschema = StructType([StructField("long_name",StringType(),True),
StructField("age",IntegerType(),True),
**StructField("dob",DateType(),True),**
StructField("wage_eur",IntegerType(),True)
])
file_location = "/FileStore/tables/Fifa_data_Dateconvertion-1.csv"
file_type = "csv"
infer_schema = "false"
first_row_is_header = "true"
delimiter = ","
df = spark.read.format(file_type) \
.option("inferSchema", infer_schema) \
.option("header", first_row_is_header) \
.option("sep", delimiter) \
**.schema(peopleschema) \**
.load(file_location)
display(df)
long_name:string
age:integer
dob:date
wage_eur:integer
long_name age dob wage_eur
Cristiano Ronaldo dos Santos Aveiro 32 0010-07-09 565000
Lionel Andrés Messi Cuccittini 30 0029-11-08 565000
因此,当我通过查询数据框检索年份时,我得到了错误的值
from pyspark.sql.functions import col
import datetime
from pyspark.sql.functions import year
from pyspark.sql.functions import to_date
df1 = df.withColumn('birth_year',year(df.dob))
df1.show()
我得到的结果是
long_name|age| dob|wage_eur|birth_year|
+--------------------+---+----------+--------+----------+
|Cristiano Ronaldo...| 32|0010-07-09| 565000| 10|
|Lionel Andrés Mes...| 30|0029-11-08| 565000| 29|
有人能指引我吗
谢谢,
aa如果该列不是标准日期格式(yyyy-MM-dd),则不能将其指定为日期类型。但您可以将其作为字符串列读入,然后使用
to_date
将该列转换为日期类型,使用to_date
:
import pyspark.sql.functions as F
file_location = "/FileStore/tables/Fifa_data_Dateconvertion-1.csv"
df = spark.read.csv(file_location, header=True, inferSchema=True)
df1 = df.withColumn(
'dob', F.to_date('dob', 'dd-MM-yyyy')
).withColumn(
'birth_year', F.year('dob')
)