Warning: file_get_contents(/data/phpspider/zhask/data//catemap/1/cassandra/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark spark.sql时间函数的时区问题_Apache Spark_Datetime_Pyspark_Apache Spark Sql_Timezone - Fatal编程技术网

Apache spark spark.sql时间函数的时区问题

Apache spark spark.sql时间函数的时区问题,apache-spark,datetime,pyspark,apache-spark-sql,timezone,Apache Spark,Datetime,Pyspark,Apache Spark Sql,Timezone,我正在使用spark 2.4.7和pyspark在独立模式下运行的jupyter笔记本上编写一些代码。 我需要将一些时间戳转换为unix时间来执行一些操作,但是我注意到一个奇怪的行为,我正在运行的代码如下: import pyspark from pyspark.sql import SparkSession import pyspark.sql.functions as F from datetime import datetime, timedelta, date spark = Spar

我正在使用spark 2.4.7和pyspark在独立模式下运行的jupyter笔记本上编写一些代码。
我需要将一些时间戳转换为unix时间来执行一些操作,但是我注意到一个奇怪的行为,我正在运行的代码如下:

import pyspark
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from datetime import datetime, timedelta, date

spark = SparkSession.builder \
        .appName("test") \
        .master(n_spark_master)\
        .config("spark.total.executor.cores",n_spark_cores_max)\
        .config("spark.cores.max", n_spark_cores_max)\
        .config("spark.executor.memory",n_spark_executor_memory)\
        .config("spark.executor.cores",n_spark_executor_cores)\
        .enableHiveSupport() \
        .getOrCreate()

print(datetime.now().astimezone().tzinfo)

df = spark.createDataFrame([
    (1, "a"),
    (2, "b"),
    (3, "c"), ], ["dummy1", "dummy2"])

epoch = datetime.utcfromtimestamp(0) df=df.withColumn('epoch',lit(epoch))  
timeFmt = '%Y-%m-%dT%H:%M:%S'  
df= df.withColumn('unix_time_epoch',F.unix_timestamp('epoch', format=timeFmt)) df.show()
输出:

CET
+------+------+-------------------+---------------+
|dummy1|dummy2|              epoch|unix_time_epoch|
+------+------+-------------------+---------------+
|     1|     a|1970-01-01 00:00:00|          -3600|
|     2|     b|1970-01-01 00:00:00|          -3600|
|     3|     c|1970-01-01 00:00:00|          -3600|
+------+------+-------------------+---------------+
根据spark 2.4.7的文件:

pyspark.sql.functions.unix_时间戳(timestamp=None,format='yyyy-MM-dd HH:MM:ss')[source]
使用默认时区和默认区域设置将具有给定模式的时间字符串(“yyyy-MM-dd HH:MM:ss”,默认情况下)转换为Unix时间戳(以秒为单位),如果失败,则返回null

输出
CET
的上一个命令
print(datetime.now().astimezone().tzinfo)
应该给我本地时区,这确实是机器上正确的时区,因为我在UTC+1。
在Spark的用户界面上,我还可以清楚地看到
user.timezone=Europe/Rome

尽管如此,spark仍在尝试从UTC+1转换为UTC,因此我得到了输出
unix\u time\u epoch=-3600
,而我希望它是
unix\u time\u epoch=0

我尝试按照其他线程的建议更改为UTC,如下所示:

import pyspark
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from datetime import datetime, timedelta, date
import time

os.environ['TZ'] = 'Europe/London'
time.tzset()

spark = SparkSession.builder \
        .appName("test") \
        .master(n_spark_master)\
        .config("spark.total.executor.cores",n_spark_cores_max)\
        .config("spark.cores.max", n_spark_cores_max)\
        .config("spark.executor.memory",n_spark_executor_memory)\
        .config("spark.executor.cores",n_spark_executor_cores)\
        .config('spark.driver.extraJavaOptions', '-Duser.timezone=UTC') \
        .config('spark.executor.extraJavaOptions', '-Duser.timezone=UTC') \
        .config('spark.sql.session.timeZone', 'UTC') \
        .enableHiveSupport() \
        .getOrCreate()

print(datetime.now().astimezone().tzinfo)

df = spark.createDataFrame([
    (1, "a"),
    (2, "b"),
    (3, "c"),
], ["dummy1", "dummy2"])

epoch = datetime.utcfromtimestamp(0)
df=df.withColumn('epoch',lit(epoch))
timeFmt = '%Y-%m-%dT%H:%M:%S'
df = df.withColumn('unix_time_epoch',F.unix_timestamp('epoch', format=timeFmt))
df.show()
+------+------+-------------------+---------------+
|dummy1|dummy2|              epoch|unix_time_epoch|
+------+------+-------------------+---------------+
|     1|     a|1970-01-01 00:00:00|              0|
|     2|     b|1970-01-01 00:00:00|              0|
|     3|     c|1970-01-01 00:00:00|              0|
+------+------+-------------------+---------------+
但结果是:

GMT
+------+------+-------------------+---------------+
|dummy1|dummy2|              epoch|unix_time_epoch|
+------+------+-------------------+---------------+
|     1|     a|1969-12-31 23:00:00|          -3600|
|     2|     b|1969-12-31 23:00:00|          -3600|
|     3|     c|1969-12-31 23:00:00|          -3600|
+------+------+-------------------+---------------+
我想要实现的是评估UTC中的所有内容,而不考虑时区偏移,因为在我所在的罗马,UTC+1和UTC+2之间的一年中,本地时区会发生变化,预期输出应如下所示:

import pyspark
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from datetime import datetime, timedelta, date
import time

os.environ['TZ'] = 'Europe/London'
time.tzset()

spark = SparkSession.builder \
        .appName("test") \
        .master(n_spark_master)\
        .config("spark.total.executor.cores",n_spark_cores_max)\
        .config("spark.cores.max", n_spark_cores_max)\
        .config("spark.executor.memory",n_spark_executor_memory)\
        .config("spark.executor.cores",n_spark_executor_cores)\
        .config('spark.driver.extraJavaOptions', '-Duser.timezone=UTC') \
        .config('spark.executor.extraJavaOptions', '-Duser.timezone=UTC') \
        .config('spark.sql.session.timeZone', 'UTC') \
        .enableHiveSupport() \
        .getOrCreate()

print(datetime.now().astimezone().tzinfo)

df = spark.createDataFrame([
    (1, "a"),
    (2, "b"),
    (3, "c"),
], ["dummy1", "dummy2"])

epoch = datetime.utcfromtimestamp(0)
df=df.withColumn('epoch',lit(epoch))
timeFmt = '%Y-%m-%dT%H:%M:%S'
df = df.withColumn('unix_time_epoch',F.unix_timestamp('epoch', format=timeFmt))
df.show()
+------+------+-------------------+---------------+
|dummy1|dummy2|              epoch|unix_time_epoch|
+------+------+-------------------+---------------+
|     1|     a|1970-01-01 00:00:00|              0|
|     2|     b|1970-01-01 00:00:00|              0|
|     3|     c|1970-01-01 00:00:00|              0|
+------+------+-------------------+---------------+

您应该使用
os.environ['TZ']=“UTC”
而不是
Europe/London

1970年,联合王国进行了一项“英国标准时间试验”,在1968年10月27日至1971年10月31日期间,英国的时区为GMT+1。(来源:)。这就是为什么你的时间早了一个小时