Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/287.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python Spark在数据帧转换期间如何处理时间戳类型?_Python_Datetime_Numpy_Apache Spark_Pyspark - Fatal编程技术网

Python Spark在数据帧转换期间如何处理时间戳类型?

Python Spark在数据帧转换期间如何处理时间戳类型?,python,datetime,numpy,apache-spark,pyspark,Python,Datetime,Numpy,Apache Spark,Pyspark,我有一个pandas数据帧,其时间戳列的类型为pandas.tslib.timestamp。我查看了“createDataFrame”(createDataFrame)中的pyspark源代码,它们似乎将数据转换为numpy记录数组,再转换为列表: data = [r.tolist() for r in data.to_records(index=False)] 但是,在此过程中,时间戳类型将转换为长时间列表: > df = pd.DataFrame(pd.date_range(star

我有一个pandas数据帧,其时间戳列的类型为pandas.tslib.timestamp。我查看了“createDataFrame”(createDataFrame)中的pyspark源代码,它们似乎将数据转换为numpy记录数组,再转换为列表:

data = [r.tolist() for r in data.to_records(index=False)]
但是,在此过程中,时间戳类型将转换为长时间列表:

> df = pd.DataFrame(pd.date_range(start=datetime.datetime.now(),periods=5,freq='s'))
> df
0 2017-07-25 11:53:29.353923
1 2017-07-25 11:53:30.353923
2 2017-07-25 11:53:31.353923
3 2017-07-25 11:53:32.353923
4 2017-07-25 11:53:33.353923
> df.to_records(index=False).tolist()
[(1500983799614193000L,), (1500983800614193000L,), (1500983801614193000L,), (1500983802614193000L,), (1500983803614193000L,)]
现在,如果我将这样一个列表传递给RDD,请执行一些操作(不接触时间戳列),然后调用

> spark.createDataFrame(rdd,schema) // with schema mentioning that column as TimestampType
TypeError: TimestampType can not accept object 1465197332112000000L in type <type 'long'>
    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
>spark.createDataFrame(rdd,schema)//模式将该列称为TimestampType
TypeError:TimestampType无法接受类型中的对象146519732112000000L
位于org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
位于org.apache.spark.api.python.PythonRunner$$anon$1。(PythonRDD.scala:234)
位于org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
位于org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
在org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:323)上
位于org.apache.spark.rdd.rdd.iterator(rdd.scala:287)
在org.apache.spark.rdd.MapPartitionsRDD.compute上(MapPartitionsRDD.scala:38)
在org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:323)上
位于org.apache.spark.rdd.rdd.iterator(rdd.scala:287)
在org.apache.spark.rdd.MapPartitionsRDD.compute上(MapPartitionsRDD.scala:38)
在org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:323)上
位于org.apache.spark.rdd.rdd.iterator(rdd.scala:287)
在org.apache.spark.rdd.MapPartitionsRDD.compute上(MapPartitionsRDD.scala:38)
在org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:323)上
位于org.apache.spark.rdd.rdd.iterator(rdd.scala:287)
在org.apache.spark.rdd.MapPartitionsRDD.compute上(MapPartitionsRDD.scala:38)
在org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:323)上
位于org.apache.spark.rdd.rdd.iterator(rdd.scala:287)
在org.apache.spark.rdd.MapPartitionsRDD.compute上(MapPartitionsRDD.scala:38)
在org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:323)上
位于org.apache.spark.rdd.rdd.iterator(rdd.scala:287)
位于org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
位于org.apache.spark.scheduler.Task.run(Task.scala:99)
位于org.apache.spark.executor.executor$TaskRunner.run(executor.scala:322)
位于java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
位于java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
运行(Thread.java:745)
我应该做什么(在将列表转换为RDD之前)来保留datetime类型

编辑1

我所知道的一些涉及数据帧创建后处理的方法有:

  • 将时区信息添加到pandas中的datetime对象。但是,这似乎是不必要的,并且可能会导致错误,具体取决于您自己的时区

  • 使用datetime库将long转换为时间戳

  • 假设tstampl是输入:tstamp=datetime(1970,1,1)+ 时间增量(微秒=tstampl/1000)

  • 在Pandas数据帧端将datetime转换为字符串,然后在Spark数据帧端转换为datetime
  • 如下面Suresh的回答所述


    然而,我正在寻找一种更简单的方法,它将在创建数据帧之前处理好所有处理。

    我尝试将时间戳列转换为字符串类型,然后在pandas系列上应用tolist()。使用spark dataframe中的列表并将其转换回时间戳

    >>> df = pd.DataFrame(pd.date_range(start=datetime.datetime.now(),periods=5,freq='s'))
    >>> df
                        0
    0 2017-07-25 21:51:53.963
    1 2017-07-25 21:51:54.963
    2 2017-07-25 21:51:55.963
    3 2017-07-25 21:51:56.963
    4 2017-07-25 21:51:57.963
    
    >>> df1 = df[0].apply(lambda x: x.strftime('%Y-%m-%d %H:%M:%S'))
    >>> type(df1)
    <class 'pandas.core.series.Series'>
    >>> df1.tolist()
    ['2017-07-25 21:51:53', '2017-07-25 21:51:54', '2017-07-25 21:51:55', '2017-07-25 21:51:56', '2017-07-25 21:51:57']
    
     from pyspark.sql.types import StringType,TimestampType
     >>> sdf = spark.createDataFrame(df1.tolist(),StringType())
     >>> sdf.printSchema()
     root
        |-- value: string (nullable = true)
     >>> sdf = sdf.select(sdf['value'].cast('timestamp'))
     >>> sdf.printSchema()
     root
        |-- value: timestamp (nullable = true)
    
     >>> sdf.show(5,False)
     +---------------------+
     |value                |
     +---------------------+
     |2017-07-25 21:51:53.0|
     |2017-07-25 21:51:54.0|
     |2017-07-25 21:51:55.0|
     |2017-07-25 21:51:56.0|
     |2017-07-25 21:51:57.0|
     +---------------------+
    
    df=pd.DataFrame(pd.date\u范围(start=datetime.datetime.now(),periods=5,freq='s')) >>>df 0 0 2017-07-25 21:51:53.963 1 2017-07-25 21:51:54.963 2 2017-07-25 21:51:55.963 3 2017-07-25 21:51:56.963 4 2017-07-25 21:51:57.963 >>>df1=df[0]。应用(lambda x:x.strftime(“%Y-%m-%d%H:%m:%S”)) >>>类型(df1) >>>df1.tolist() ['2017-07-25 21:51:53', '2017-07-25 21:51:54', '2017-07-25 21:51:55', '2017-07-25 21:51:56', '2017-07-25 21:51:57'] 从pyspark.sql.types导入StringType、TimestampType >>>sdf=spark.createDataFrame(df1.tolist(),StringType()) >>>sdf.printSchema() 根 |--值:字符串(nullable=true) >>>sdf=sdf.select(sdf['value'].cast('timestamp')) >>>sdf.printSchema() 根 |--值:时间戳(nullable=true) >>>sdf.show(5,假) +---------------------+ |价值观| +---------------------+ |2017-07-25 21:51:53.0| |2017-07-25 21:51:54.0| |2017-07-25 21:51:55.0| |2017-07-25 21:51:56.0| |2017-07-25 21:51:57.0| +---------------------+
    是的,我知道这种方法,还有另外一种方法涉及长时间到时间戳的重新转换(目前我正在使用)。问题是所有这些方法都需要某种数据帧转换后处理。这正是我希望避免的。您是否在spark中使用/100000000进行长时间到时间戳的转换,并将其转换为时间戳??不。假设tstampl是输入:tstamp=datetime(1970,1,1)+timedelta(微秒=tstampl/1000),我将添加到编辑中-我真的希望得到一个更有效的答案