Apache spark 在数据帧中将日期从字符串转换为日期格式
我正在尝试使用Apache spark 在数据帧中将日期从字符串转换为日期格式,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,我正在尝试使用to_Date函数将字符串格式的列转换为日期格式,但返回空值 df.createOrReplaceTempView("incidents") spark.sql("select Date from incidents").show() +----------+ | Date| +----------+ |08/26/2016| |08/26/2016| |08/26/2016| |06/14/2016| spark.sql("select to_date(Date)
to_Date
函数将字符串格式的列转换为日期格式,但返回空值
df.createOrReplaceTempView("incidents")
spark.sql("select Date from incidents").show()
+----------+
| Date|
+----------+
|08/26/2016|
|08/26/2016|
|08/26/2016|
|06/14/2016|
spark.sql("select to_date(Date) from incidents").show()
+---------------------------+
|to_date(CAST(Date AS DATE))|
+---------------------------+
| null|
| null|
| null|
| null|
日期列采用字符串格式:
|-- Date: string (nullable = true)
在Java中使用
更新
例如:
spark.sql("""
SELECT TO_DATE(CAST(UNIX_TIMESTAMP('08/26/2016', 'MM/dd/yyyy') AS TIMESTAMP)) AS newdate"""
).show()
+----------+
| dt|
+----------+
|2016-08-26|
+----------+
我在没有临时表/视图和dataframe函数的情况下解决了同样的问题
当然,我发现只有一种格式适用于此解决方案,那就是yyyy-MM-DD
例如:
val df = sc.parallelize(Seq("2016-08-26")).toDF("Id")
val df2 = df.withColumn("Timestamp", (col("Id").cast("timestamp")))
val df3 = df2.withColumn("Date", (col("Id").cast("date")))
df3.printSchema
root
|-- Id: string (nullable = true)
|-- Timestamp: timestamp (nullable = true)
|-- Date: date (nullable = true)
df3.show
+----------+--------------------+----------+
| Id| Timestamp| Date|
+----------+--------------------+----------+
|2016-08-26|2016-08-26 00:00:...|2016-08-26|
+----------+--------------------+----------+
时间戳当然有00:00:00.0
作为时间值。您也可以执行此查询
sqlContext.sql("""
select from_unixtime(unix_timestamp('08/26/2016', 'MM/dd/yyyy'), 'yyyy:MM:dd') as new_format
""").show()
因为您的主要目标是将数据帧中的列类型从字符串转换为时间戳,所以我认为这种方法会更好
import org.apache.spark.sql.functions.{to_date, to_timestamp}
val modifiedDF = DF.withColumn("Date", to_date($"Date", "MM/dd/yyyy"))
如果您需要细粒度的时间戳,也可以使用来标记时间戳(我想这可以从Spark 2.x获得)。dateID is int列包含int格式的日期
spark.sql("SELECT from_unixtime(unix_timestamp(cast(dateid as varchar(10)), 'yyyymmdd'), 'yyyy-mm-dd') from XYZ").show(50, false)
您还可以传递日期格式
df.withColumn("Date",to_date(unix_timestamp(df.col("your_date_column"), "your_date_format").cast("timestamp")))
比如说
import org.apache.spark.sql.functions._
val df = sc.parallelize(Seq("06 Jul 2018")).toDF("dateCol")
df.withColumn("Date",to_date(unix_timestamp(df.col("dateCol"), "dd MMM yyyy").cast("timestamp")))
Sai Kiriti Badam提出的上述解决方案对我有效
我正在使用Azure DataRicks读取从EventHub捕获的数据。它包含一个名为EnqueuedTimeUtc的字符串列,格式如下
2018年7月12日12:54:13下午
我正在使用一个Python笔记本,并使用了以下内容
import pyspark.sql.functions as func
sports_messages = sports_df.withColumn("EnqueuedTimestamp", func.to_timestamp("EnqueuedTimeUtc", "MM/dd/yyyy hh:mm:ss aaa"))
。。。要使用以下格式的数据创建“timestamp”类型的新列EnqueuedTimestamp
2018-12-07 12:54:13我个人发现,在使用spark 1.6将基于unix时间戳的日期转换从dd-MMM-yyyy格式转换为yyyy-mm-dd格式时,出现了一些错误,但这可能会扩展到最新版本。下面我将介绍一种使用java.time解决问题的方法,该方法应适用于所有版本的spark:
我在执行以下操作时看到错误:
from_unixtime(unix_timestamp(StockMarketClosingDate, 'dd-MMM-yyyy'), 'yyyy-MM-dd') as FormattedDate
下面是说明错误的代码,以及我的解决方案。
首先,我以通用标准文件格式读取股票市场数据:
import sys.process._
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, DateType}
import sqlContext.implicits._
val EODSchema = StructType(Array(
StructField("Symbol" , StringType, true), //$1
StructField("Date" , StringType, true), //$2
StructField("Open" , StringType, true), //$3
StructField("High" , StringType, true), //$4
StructField("Low" , StringType, true), //$5
StructField("Close" , StringType, true), //$6
StructField("Volume" , StringType, true) //$7
))
val textFileName = "/user/feeds/eoddata/INDEX/INDEX_19*.csv"
// below is code to read using later versions of spark
//val eoddata = spark.read.format("csv").option("sep", ",").schema(EODSchema).option("header", "true").load(textFileName)
// here is code to read using 1.6, via, "com.databricks:spark-csv_2.10:1.2.0"
val eoddata = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("delimiter", ",") //.option("dateFormat", "dd-MMM-yyyy") failed to work
.schema(EODSchema)
.load(textFileName)
eoddata.registerTempTable("eoddata")
以下是有问题的日期转换:
%sql
-- notice there are errors around the turn of the year
Select
e.Date as StringDate
, cast(from_unixtime(unix_timestamp(e.Date, "dd-MMM-yyyy"), 'YYYY-MM-dd') as Date) as ProperDate
, e.Close
from eoddata e
where e.Symbol = 'SPX.IDX'
order by cast(from_unixtime(unix_timestamp(e.Date, "dd-MMM-yyyy"), 'YYYY-MM-dd') as Date)
limit 1000
齐柏林飞艇制造的图表显示了尖峰,这是错误的
以下是显示日期转换错误的检查:
// shows the unix_timestamp conversion approach can create errors
val result = sqlContext.sql("""
Select errors.* from
(
Select
t.*
, substring(t.OriginalStringDate, 8, 11) as String_Year_yyyy
, substring(t.ConvertedCloseDate, 0, 4) as Converted_Date_Year_yyyy
from
( Select
Symbol
, cast(from_unixtime(unix_timestamp(e.Date, "dd-MMM-yyyy"), 'YYYY-MM-dd') as Date) as ConvertedCloseDate
, e.Date as OriginalStringDate
, Close
from eoddata e
where e.Symbol = 'SPX.IDX'
) t
) errors
where String_Year_yyyy <> Converted_Date_Year_yyyy
""")
//df.withColumn("tx_date", to_date(unix_timestamp($"date", "M/dd/yyyy").cast("timestamp")))
result.registerTempTable("SPX")
result.cache()
result.show(100)
result: org.apache.spark.sql.DataFrame = [Symbol: string, ConvertedCloseDate: date, OriginalStringDate: string, Close: string, String_Year_yyyy: string, Converted_Date_Year_yyyy: string]
res53: result.type = [Symbol: string, ConvertedCloseDate: date, OriginalStringDate: string, Close: string, String_Year_yyyy: string, Converted_Date_Year_yyyy: string]
+-------+------------------+------------------+-------+----------------+------------------------+
| Symbol|ConvertedCloseDate|OriginalStringDate| Close|String_Year_yyyy|Converted_Date_Year_yyyy|
+-------+------------------+------------------+-------+----------------+------------------------+
|SPX.IDX| 1997-12-30| 30-Dec-1996| 753.85| 1996| 1997|
|SPX.IDX| 1997-12-31| 31-Dec-1996| 740.74| 1996| 1997|
|SPX.IDX| 1998-12-29| 29-Dec-1997| 953.36| 1997| 1998|
|SPX.IDX| 1998-12-30| 30-Dec-1997| 970.84| 1997| 1998|
|SPX.IDX| 1998-12-31| 31-Dec-1997| 970.43| 1997| 1998|
|SPX.IDX| 1998-01-01| 01-Jan-1999|1229.23| 1999| 1998|
+-------+------------------+------------------+-------+----------------+------------------------+
FINISHED
现在我将其注册为sql中使用的函数:
sqlContext.udf.register("fromEODDate", fromEODDate(_:String))
并检查结果,然后重新运行测试:
val results = sqlContext.sql("""
Select
e.Symbol as Symbol
, e.Date as OrigStringDate
, Cast(fromEODDate(e.Date) as Date) as ConvertedDate
, e.Open
, e.High
, e.Low
, e.Close
from eoddata e
order by Cast(fromEODDate(e.Date) as Date)
""")
results.printSchema()
results.cache()
results.registerTempTable("results")
results.show(10)
results: org.apache.spark.sql.DataFrame = [Symbol: string, OrigStringDate: string, ConvertedDate: date, Open: string, High: string, Low: string, Close: string]
root
|-- Symbol: string (nullable = true)
|-- OrigStringDate: string (nullable = true)
|-- ConvertedDate: date (nullable = true)
|-- Open: string (nullable = true)
|-- High: string (nullable = true)
|-- Low: string (nullable = true)
|-- Close: string (nullable = true)
res79: results.type = [Symbol: string, OrigStringDate: string, ConvertedDate: date, Open: string, High: string, Low: string, Close: string]
+--------+--------------+-------------+-------+-------+-------+-------+
| Symbol|OrigStringDate|ConvertedDate| Open| High| Low| Close|
+--------+--------------+-------------+-------+-------+-------+-------+
|ADVA.IDX| 01-Jan-1996| 1996-01-01| 364| 364| 364| 364|
|ADVN.IDX| 01-Jan-1996| 1996-01-01| 1527| 1527| 1527| 1527|
|ADVQ.IDX| 01-Jan-1996| 1996-01-01| 1283| 1283| 1283| 1283|
|BANK.IDX| 01-Jan-1996| 1996-01-01|1009.41|1009.41|1009.41|1009.41|
| BKX.IDX| 01-Jan-1996| 1996-01-01| 39.39| 39.39| 39.39| 39.39|
|COMP.IDX| 01-Jan-1996| 1996-01-01|1052.13|1052.13|1052.13|1052.13|
| CPR.IDX| 01-Jan-1996| 1996-01-01| 1.261| 1.261| 1.261| 1.261|
|DECA.IDX| 01-Jan-1996| 1996-01-01| 205| 205| 205| 205|
|DECN.IDX| 01-Jan-1996| 1996-01-01| 825| 825| 825| 825|
|DECQ.IDX| 01-Jan-1996| 1996-01-01| 754| 754| 754| 754|
+--------+--------------+-------------+-------+-------+-------+-------+
only showing top 10 rows
这看起来不错,我重新运行图表,查看是否存在错误/峰值:
正如您所看到的,没有更多的尖峰或错误。我现在使用UDF,如我所示,将日期格式转换应用于标准的yyyy-MM-dd格式,并且从那时起没有出现错误。:-) 找到下面提到的代码,它可能会对您有所帮助
val stringDate = spark.sparkContext.parallelize(Seq("12/16/2019")).toDF("StringDate")
val dateCoversion = stringDate.withColumn("dateColumn", to_date(unix_timestamp($"StringDate", "dd/mm/yyyy").cast("Timestamp")))
dateCoversion.show(false)
+----------+----------+
|StringDate|dateColumn|
+----------+----------+
|12/16/2019|2019-01-12|
+----------+----------+
在PySpark中使用以下函数将数据类型转换为所需的数据类型。
这里,我将所有日期数据类型转换为Timestamp列
def change_dtype(df):
for name, dtype in df.dtypes:
if dtype == "date":
df = df.withColumn(name, col(name).cast('timestamp'))
return df
您可以简单地使用列(“日期”、日期格式(列(“字符串”)、“yyyy-MM-dd HH:MM:ss.ssssss”))进行df.withColumn.show()
试试这个试试这个raw-u-data['mycl']=pd.to-datetime(raw-u-data['mycl',format='%d%b%Y:%H:%M.%S.%f')
这两个注释都指向使用熊猫数据框的答案,而不是Spark-data框。虽然这些数据帧格式是可互换的,但在大型数据集上转换为熊猫的成本很高,并且抵消了Spark提供的许多好处(例如能够在分布式Spark群集上运行转换)。这两种功能仅在2.2版本.ys到_timestamp
后可用。。错误:to_date的值不是org.apache.spark.sql.DataFrame:的成员suggest@MapReddy请导入org.apache.spark.sql.functions.@AmitDubey,我只需要带有“yyyy-MM-dd-HH”的时间戳,就像这样,作为小时数怎么办?如果我这样做(to_timestamp(current_timestamp(),“yyyy-MM-dd-HH”)),它就像“2018-11-26 02:36:26”。。。如何以“yyyy-MM-dd-HH”格式执行此操作?我发现to_-date(my_-string_列,'yyyyymmdd')作为my_-date_列
在Spark 2.3.2
中运行良好;当然,您可以用您自己的日期格式代替yyyyMMdd
下面的链接“answer works df.withColumn”(“tx_date”),to_date(unix_时间戳($“date”,“M/dd/yyyy”).cast(“timestamp”))
val stringDate = spark.sparkContext.parallelize(Seq("12/16/2019")).toDF("StringDate")
val dateCoversion = stringDate.withColumn("dateColumn", to_date(unix_timestamp($"StringDate", "dd/mm/yyyy").cast("Timestamp")))
dateCoversion.show(false)
+----------+----------+
|StringDate|dateColumn|
+----------+----------+
|12/16/2019|2019-01-12|
+----------+----------+
def change_dtype(df):
for name, dtype in df.dtypes:
if dtype == "date":
df = df.withColumn(name, col(name).cast('timestamp'))
return df