Apache spark Pyspark用于将列字符串转换为日期_Apache Spark_Pyspark_Apache Spark Sql

Apache spark Pyspark用于将列字符串转换为日期

apache-spark pyspark

Apache spark Pyspark用于将列字符串转换为日期,apache-spark,pyspark,apache-spark-sql,Apache Spark,Pyspark,Apache Spark Sql,我正在尝试将一个“dob”列从string转换为date数据类型，以便在pyspark中执行一些基本操作。我的意见是 long_name age dob wage_eur Cristiano Ronaldo dos Santos Aveiro 32 05-02-1985 565000 Lionel AndrÃ©s Messi Cuccittini 30 24-06-1987 565000 我创建了一个定制的模式

我正在尝试将一个“dob”列从string转换为date数据类型，以便在pyspark中执行一些基本操作。我的意见是

  long_name                         age dob         wage_eur
Cristiano Ronaldo dos Santos Aveiro 32  05-02-1985  565000
Lionel AndrÃ©s Messi Cuccittini     30  24-06-1987  565000

我创建了一个定制的模式，将dob列从string更改为date类型，并用于查询数据帧，如下所示

spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
from pyspark.sql.types import IntegerType , DateType , StringType , TimestampType, StructType , StructField
peopleschema = StructType([StructField("long_name",StringType(),True),
                           StructField("age",IntegerType(),True),
                           **StructField("dob",DateType(),True),**
                           StructField("wage_eur",IntegerType(),True)                          
                          ])

file_location = "/FileStore/tables/Fifa_data_Dateconvertion-1.csv"
file_type = "csv"

infer_schema = "false"
first_row_is_header = "true"
delimiter = ","

df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  **.schema(peopleschema) \**
  .load(file_location)

display(df)

long_name:string
age:integer
dob:date
wage_eur:integer


long_name                           age dob         wage_eur
Cristiano Ronaldo dos Santos Aveiro 32  0010-07-09  565000
Lionel AndrÃ©s Messi Cuccittini     30  0029-11-08  565000

已为dob列转换架构，但值如下所示

spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
from pyspark.sql.types import IntegerType , DateType , StringType , TimestampType, StructType , StructField
peopleschema = StructType([StructField("long_name",StringType(),True),
                           StructField("age",IntegerType(),True),
                           **StructField("dob",DateType(),True),**
                           StructField("wage_eur",IntegerType(),True)                          
                          ])

file_location = "/FileStore/tables/Fifa_data_Dateconvertion-1.csv"
file_type = "csv"

infer_schema = "false"
first_row_is_header = "true"
delimiter = ","

df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  **.schema(peopleschema) \**
  .load(file_location)

display(df)

long_name:string
age:integer
dob:date
wage_eur:integer


long_name                           age dob         wage_eur
Cristiano Ronaldo dos Santos Aveiro 32  0010-07-09  565000
Lionel AndrÃ©s Messi Cuccittini     30  0029-11-08  565000

因此，当我通过查询数据框检索年份时，我得到了错误的值

from pyspark.sql.functions import col
import datetime
from pyspark.sql.functions import year
from pyspark.sql.functions import to_date

df1 = df.withColumn('birth_year',year(df.dob))

df1.show()

我得到的结果是

           long_name|age|       dob|wage_eur|birth_year|
+--------------------+---+----------+--------+----------+
|Cristiano Ronaldo...| 32|0010-07-09|  565000|        10|
|Lionel Andrés Mes...| 30|0029-11-08|  565000|        29|

有人能指引我吗

谢谢，

aa如果该列不是标准日期格式（yyyy-MM-dd），则不能将其指定为日期类型。但您可以将其作为字符串列读入，然后使用

to_date

将该列转换为日期类型，使用

to_date

：

import pyspark.sql.functions as F

file_location = "/FileStore/tables/Fifa_data_Dateconvertion-1.csv"

df = spark.read.csv(file_location, header=True, inferSchema=True)
df1 = df.withColumn(
    'dob', F.to_date('dob', 'dd-MM-yyyy')
).withColumn(
    'birth_year', F.year('dob')
)