Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark Pyspark用于将列字符串转换为日期_Apache Spark_Pyspark_Apache Spark Sql - Fatal编程技术网

Apache spark Pyspark用于将列字符串转换为日期

Apache spark Pyspark用于将列字符串转换为日期,apache-spark,pyspark,apache-spark-sql,Apache Spark,Pyspark,Apache Spark Sql,我正在尝试将一个“dob”列从string转换为date数据类型,以便在pyspark中执行一些基本操作。我的意见是 long_name age dob wage_eur Cristiano Ronaldo dos Santos Aveiro 32 05-02-1985 565000 Lionel Andrés Messi Cuccittini 30 24-06-1987 565000 我创建了一个定制的模式

我正在尝试将一个“dob”列从string转换为date数据类型,以便在pyspark中执行一些基本操作。我的意见是

  long_name                         age dob         wage_eur
Cristiano Ronaldo dos Santos Aveiro 32  05-02-1985  565000
Lionel Andrés Messi Cuccittini     30  24-06-1987  565000
我创建了一个定制的模式,将dob列从string更改为date类型,并用于查询数据帧,如下所示

spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
from pyspark.sql.types import IntegerType , DateType , StringType , TimestampType, StructType , StructField
peopleschema = StructType([StructField("long_name",StringType(),True),
                           StructField("age",IntegerType(),True),
                           **StructField("dob",DateType(),True),**
                           StructField("wage_eur",IntegerType(),True)                          
                          ])

file_location = "/FileStore/tables/Fifa_data_Dateconvertion-1.csv"
file_type = "csv"

infer_schema = "false"
first_row_is_header = "true"
delimiter = ","

df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  **.schema(peopleschema) \**
  .load(file_location)

display(df)

long_name:string
age:integer
dob:date
wage_eur:integer


long_name                           age dob         wage_eur
Cristiano Ronaldo dos Santos Aveiro 32  0010-07-09  565000
Lionel Andrés Messi Cuccittini     30  0029-11-08  565000
已为dob列转换架构,但值如下所示

spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
from pyspark.sql.types import IntegerType , DateType , StringType , TimestampType, StructType , StructField
peopleschema = StructType([StructField("long_name",StringType(),True),
                           StructField("age",IntegerType(),True),
                           **StructField("dob",DateType(),True),**
                           StructField("wage_eur",IntegerType(),True)                          
                          ])

file_location = "/FileStore/tables/Fifa_data_Dateconvertion-1.csv"
file_type = "csv"

infer_schema = "false"
first_row_is_header = "true"
delimiter = ","

df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  **.schema(peopleschema) \**
  .load(file_location)

display(df)

long_name:string
age:integer
dob:date
wage_eur:integer


long_name                           age dob         wage_eur
Cristiano Ronaldo dos Santos Aveiro 32  0010-07-09  565000
Lionel Andrés Messi Cuccittini     30  0029-11-08  565000
因此,当我通过查询数据框检索年份时,我得到了错误的值

from pyspark.sql.functions import col
import datetime
from pyspark.sql.functions import year
from pyspark.sql.functions import to_date

df1 = df.withColumn('birth_year',year(df.dob))

df1.show()

我得到的结果是

           long_name|age|       dob|wage_eur|birth_year|
+--------------------+---+----------+--------+----------+
|Cristiano Ronaldo...| 32|0010-07-09|  565000|        10|
|Lionel Andrés Mes...| 30|0029-11-08|  565000|        29|

有人能指引我吗

谢谢,
aa如果该列不是标准日期格式(yyyy-MM-dd),则不能将其指定为日期类型。但您可以将其作为字符串列读入,然后使用
to_date
将该列转换为日期类型,使用
to_date

import pyspark.sql.functions as F

file_location = "/FileStore/tables/Fifa_data_Dateconvertion-1.csv"

df = spark.read.csv(file_location, header=True, inferSchema=True)
df1 = df.withColumn(
    'dob', F.to_date('dob', 'dd-MM-yyyy')
).withColumn(
    'birth_year', F.year('dob')
)