Apache spark pyspark将负值替换为零
我可能想寻求帮助,将时间戳之间的负值从差替换为零。在spark上运行python3。这是我的密码: 代码:Apache spark pyspark将负值替换为零,apache-spark,pyspark,apache-spark-sql,pyspark-sql,Apache Spark,Pyspark,Apache Spark Sql,Pyspark Sql,我可能想寻求帮助,将时间戳之间的负值从差替换为零。在spark上运行python3。这是我的密码: 代码: timeFmt=“yyyy-MM-dd HH:MM:ss” time_diff_1=何时((col(“time1”).isNotNull())& (col(“time2”).isNotNull(), (unix_时间戳('time2',格式=timeFmt)-unix_时间戳('time1',格式=timeFmt))/60 )。否则(点亮(0)) time_diff_2=何时((col(“
timeFmt=“yyyy-MM-dd HH:MM:ss”
time_diff_1=何时((col(“time1”).isNotNull())&
(col(“time2”).isNotNull(),
(unix_时间戳('time2',格式=timeFmt)-unix_时间戳('time1',格式=timeFmt))/60
)。否则(点亮(0))
time_diff_2=何时((col(“time2”).isNotNull())&
(col(“time3”).isNotNull(),
(unix_时间戳('time3',格式=timeFmt)-unix_时间戳('time2',格式=timeFmt))/60
)。否则(点亮(0))
time_diff_3=何时((col(“time3”).isNotNull())&
(col(“time4”).isNotNull(),
(unix_时间戳('time4',格式=timeFmt)-unix_时间戳('time3',格式=timeFmt))/60
)。否则(点亮(0))
df=(df
.带列('time_diff_1',time_diff_1)
.带列('time_diff_2',time_diff_2)
.带列('time_diff_3',time_diff_3)
)
df=(df
.withColumn('time_diff_1'),when(col('time_diff_1')<0,0)。否则(col('time_diff_1'))
.withColumn('time_diff_2',when(col('time_diff_2')<0,0)。否则(col('time_diff_2'))
.withColumn('time_diff_3'),when(col('time_diff_3')<0,0)。否则(col('time_diff_3'))
)
当我运行上面的代码时,我得到一个错误。
以下是错误:
Py4JJavaError:调用o1083.showString时出错:
org.apache.spark.sparkeexception:由于阶段失败,作业中止:
阶段56.0中的任务0失败4次,最近一次失败:任务丢失
阶段56.0中的0.3(TID 7246,fxhclxcdh8.dftz.local,executor 21):org.codehaus.janino.JaninoRuntimeException:未能编译:
org.codehaus.janino.JaninoRuntimeException:方法代码
“应用9$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V”
一流的
“org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection”
超过64 KB/*001/public java.lang.Object生成(Object[]
参考资料){/002/返回新的
详细信息(参考资料);/003/}/004//005/
类特定的投影扩展
org.apache.spark.sql.catalyst.expressions.UnsafeProjection{/006/
/007/私有对象[]引用;/008/私有布尔值
evalExprIsNull;/009/专用布尔值evalExprValue;/010/私有布尔值evalexpr1isnll;/011/私有布尔值 evalExpr1Value;/012/private java.text.DateFormat格式化程序5; /013/private java.text.DateFormat格式化程序8;/014/
private java.text.DateFormat格式化程序12;/015/private java.text.DateFormat格式化程序13;/016/private UTF8String.IntWrapper;/017/private java.text.DateFormat格式化程序15;/018/private java.text.DateFormat格式化程序18;/019/private java.text.DateFormat formatter19;/020/private java.text.DateFormat formatter23;/021/private java.text.DateFormat formatter26;/022/private java.text.DateFormat formatter27;/023/private java.text.DateFormat格式化程序30;/024*/private java.text.DateFormat formatter32
任何人都可以提供帮助?我认为更简单的方法是编写一个简单的UDF(用户定义函数)并将其应用于所需的列。下面是一个示例代码:
import pyspark.sql.functions as f
correctNegativeDiff = f.udf(lambda diff: 0 if diff < 0 else diff, LongType())
df = df.withColumn('time_diff_1', correctNegativeDiff(df.time_diff_1))\
.withColumn('time_diff_2', correctNegativeDiff(df.time_diff_2))\
.withColumn('time_diff_3', correctNegativeDiff(df.time_diff_3))
导入pyspark.sql.f函数
correctNegativeDiff=f.udf(lambda diff:0如果diff<0,则为其他diff,LongType())
df=df.withColumn('time_diff_1',correctednegativediff(df.time_diff_1))\
.with列('time_diff_2',correctednegativediff(df.time_diff_2))\
.with列('time_diff_3',correctednegativediff(df.time_diff_3))
请提供一个小问题。谢谢,这段代码帮助我解决了这个问题,只是有一个小问题,它不是返回0而是返回null。不客气,我想这是因为LongType()。将其更改为IntegerType()或FloatType()可能会解决您的问题!我已经更改为双精度类型,因为我的值是双精度类型,但仍然会得到null而不是0。谢谢。这样更改可能会解决:f.udf(lambda diff:0如果diff
import pyspark.sql.functions as f
correctNegativeDiff = f.udf(lambda diff: 0 if diff < 0 else diff, LongType())
df = df.withColumn('time_diff_1', correctNegativeDiff(df.time_diff_1))\
.withColumn('time_diff_2', correctNegativeDiff(df.time_diff_2))\
.withColumn('time_diff_3', correctNegativeDiff(df.time_diff_3))