Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 创建一个Spark udf函数来迭代字节数组并将其转换为数字_Apache Spark_Lambda_Apache Spark Sql_User Defined Functions - Fatal编程技术网

Apache spark 创建一个Spark udf函数来迭代字节数组并将其转换为数字

Apache spark 创建一个Spark udf函数来迭代字节数组并将其转换为数字,apache-spark,lambda,apache-spark-sql,user-defined-functions,Apache Spark,Lambda,Apache Spark Sql,User Defined Functions,我在spark(python)中有一个带有字节数组的数据帧 我正在尝试将此数组转换为字符串 '008F2B9C80' 然后返回数值 int('008F2B9C80',16)/1000000 > 2402.0 我发现了一些udf样本,因此我已经可以提取数组的一部分,如下所示: u = f.udf(lambda a: format(a[1],'x')) DF.select(u(DF['myfield'])).show() +------------------+

我在spark(python)中有一个带有字节数组的数据帧

我正在尝试将此数组转换为字符串

'008F2B9C80'
然后返回数值

int('008F2B9C80',16)/1000000
> 2402.0
我发现了一些udf样本,因此我已经可以提取数组的一部分,如下所示:

u = f.udf(lambda a: format(a[1],'x'))
DF.select(u(DF['myfield'])).show()
+------------------+                                                            
|<lambda>(myfield) |
+------------------+
|                8f|
+------------------+
u=f.udf(lambda:format(a[1],'x'))
选择(u(DF['myfield'])).show()
+------------------+                                                            
|(我的领域)|
+------------------+
|8f|
+------------------+
现在如何迭代整个数组? 是否可以在udf函数中完成所有必须编码的操作

也许有一个最好的方式做演员


感谢您的帮助

这是scala df解决方案。您需要导入scala.math.biginger

scala> val df = Seq((Array("00","8F","2B","9C","80"))).toDF("id")
df: org.apache.spark.sql.DataFrame = [id: array<string>]

scala> df.withColumn("idstr",concat_ws("",'id)).show
+--------------------+----------+
|                  id|     idstr|
+--------------------+----------+
|[00, 8F, 2B, 9C, 80]|008F2B9C80|
+--------------------+----------+


scala> import scala.math.BigInt
import scala.math.BigInt

scala> def convertBig(x:String):String = BigInt(x.sliding(2,2).map( x=> Integer.parseInt(x,16)).map(_.toByte).toArray).toString
convertBig: (x: String)String

scala> val udf_convertBig =  udf( convertBig(_:String):String )
udf_convertBig: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))

scala> df.withColumn("idstr",concat_ws("",'id)).withColumn("idBig",udf_convertBig('idstr)).show(false)
+--------------------+----------+----------+
|id                  |idstr     |idBig     |
+--------------------+----------+----------+
|[00, 8F, 2B, 9C, 80]|008F2B9C80|2402000000|
+--------------------+----------+----------+


scala>
scala>val df=Seq((数组(“00”、“8F”、“2B”、“9C”、“80”)).toDF(“id”)
df:org.apache.spark.sql.DataFrame=[id:array]
scala>df.withColumn(“idstr”,concat_ws(“,'id)).show
+--------------------+----------+
|id | idstr|
+--------------------+----------+
|[00,8F,2B,9C,80]| 008F2B9C80|
+--------------------+----------+
scala>导入scala.math.BigInt
导入scala.math.BigInt
scala>def convertBig(x:String):String=BigInt(x.sliding(2,2).map(x=>Integer.parseInt(x,16)).map(u.toByte.toArray).toString
convertBig:(x:String)String
scala>val-udf\u-convertBig=udf(convertBig(\u:String):String)
udf_convertBig:org.apache.spark.sql.expressions.UserDefinedFunction=UserDefinedFunction(,StringType,Some(List(StringType)))
scala>df.withColumn(“idstr”,concat_ws(“,'id)).withColumn(“idBig”,udf_convertBig('idstr)).show(false)
+--------------------+----------+----------+
|id | idstr | idBig|
+--------------------+----------+----------+
|[00,8F,2B,9C,80]| 008F2B9C80 | 2402000000|
+--------------------+----------+----------+
斯卡拉>

scala的BigInteger没有spark等价物,因此我将udf()结果转换为字符串。

我也找到了python解决方案

from pyspark.sql.functions import udf
spark.udf.register('ByteArrayToDouble', lambda x: int.from_bytes(x, byteorder='big', signed=False) / 10e5)
spark.sql('select myfield, ByteArrayToDouble(myfield) myfield_python, convert_binary(hex(myfield))/1000000 myfield_scala from my_table').show(1, False)
+-------------+-----------------+----------------+
|myfield      |myfield_python   |myfield_scala   |
+-------------+-----------------+----------------+
|[52 F4 92 80]|1391.76          |1391.76         |
+-------------+-----------------+----------------+
only showing top 1 row
我现在可以测试这两种解决方案


谢谢你宝贵的帮助

我在回答你最新的问题时遇到了这个问题

假设您将
df
设置为

+--------------------+
|             myfield|
+--------------------+
|[00, 8F, 2B, 9C, 80]|
|    [52, F4, 92, 80]|
+--------------------+
现在您可以使用以下lambda函数

def func(val):
    return int("".join(val), 16)/1000000
func_udf = udf(lambda x: func(x), FloatType())
要创建输出,请使用

df = df.withColumn("myfield1", func_udf("myfield"))
这就产生了,

+--------------------+--------+
|             myfield|myfield1|
+--------------------+--------+
|[00, 8F, 2B, 9C, 80]|  2402.0|
|    [52, F4, 92, 80]| 1391.76|
+--------------------+--------+

听起来很有趣,我现在将尝试在pyspark项目中调用scala udf函数。()谢谢我创建了一个udf函数并成功编译:
package com.mycompany.spark.udf import org.apache.spark.sql.api.java.UDF1 import scala.math.BigInt import scala.util.Try类ConvertBinaryDecimal扩展UDF1[String,String]{override def调用(TableauBinary:String):String=BigInt(TableauBinary.Slideing(2,2).map(TableauBinary=>Integer.parseInt(TableauBinary,16)).map(u.toByte).toArray.toString}
我的最后一个问题是直接用dataframe二进制字段调用它。是否可以在函数中将二进制转换为字符串?这就是我在UDF中所做的,idstr在我的答案中是字符串。我错过了convertBig(:String):String),对不起!事实上,我想知道是否可以使用参数中的二进制字段直接调用udf,而不是“withcolumn”字符串结果。类似于
df.withcolumn(“idBig”,udf\u convertBig('id))。show(false)
将二进制数组“id”发送给ConvertBinaryDecimal函数
+--------------------+--------+
|             myfield|myfield1|
+--------------------+--------+
|[00, 8F, 2B, 9C, 80]|  2402.0|
|    [52, F4, 92, 80]| 1391.76|
+--------------------+--------+