pyspark生成特定列的行哈希，并将其添加为新列_Pyspark_String Concatenation_Sha256

pyspark生成特定列的行哈希，并将其添加为新列

pyspark

pyspark生成特定列的行哈希，并将其添加为新列,pyspark,string-concatenation,sha256,Pyspark,String Concatenation,Sha256,我正在使用spark 2.2.0和pyspark2 我已经创建了一个数据帧df，现在尝试添加一个新列“rowhash”，它是数据帧中特定列的sha2哈希例如，假设df有以下列：（第1列，第2列，…，第10列）我需要在一个新的列“rowhash”中使用sha2（（第2列| | |第3列| | |第4列| |第8列），256）目前，我尝试使用以下方法： 1）使用了hash（）函数，但由于它提供整数输出，因此没有多大用处 2）尝试使用sha2（）函数，但失败假设columnarray有我需

我正在使用spark 2.2.0和pyspark2

我已经创建了一个数据帧

df

，现在尝试添加一个新列

“rowhash”

，它是数据帧中特定列的sha2哈希

例如，假设

df

有以下列：

（第1列，第2列，…，第10列）

我需要在一个新的列

“rowhash”

中使用

sha2（（第2列| | |第3列| | |第4列| |第8列），256）
目前，我尝试使用以下方法：
1） 使用了hash（）
函数，但由于它提供整数输出，因此没有多大用处
2） 尝试使用sha2（）
函数，但失败
假设columnarray
有我需要的列数组
def concat（列数组）：
concat_str=''
对于columnarray中的val：
concat_str=concat_str+'| |'+str（val）
concat_str=concat_str[2:]
返回concat_街

然后
df1=df1.withColumn（“row_sha2”，sha2（concat（columnarray），256））

此操作失败，出现“无法解决”错误
谢谢你的回答。由于我必须只对特定列进行散列，因此我创建了这些列名的列表（以散列形式），并将您的函数更改为：
def sha_concat（行、列数组）：
row_dict=row.asDict（）#将行转换为dict
concat_str=''
对于列数组中的v：
concat_str=concat_str+'| |'+str（第五行）
concat_str=concat_str[2:]
#为测试保留连接的值（稍后可以删除）
行dict[“sha_值”]=concat_str
行dict[“sha_hash”]=hashlib.sha256（concat_str.hexdigest（））
返回行（**行内容）

然后通过为：
df1.rdd.map（lambda行：sha_concat（行，哈希列））.toDF（）.show（truncate=False）

但是，它现在失败了，出现了错误：
UnicodeEncodeError:“ascii”编解码器无法对位置8中的字符u'\ufffd'进行编码：序号不在范围内（128）

我可以在其中一列中看到\ufffd的值，因此我不确定是否有办法处理此问题？
如果您想在数据集的不同列中对每个值进行哈希，可以通过map
将自行设计的函数应用于数据帧的rdd
导入hashlib
test_df=spark.createDataFrame([
(1,"2",5,1),(3,"4",7,8),              
]（“col1”、“col2”、“col3”、“col4”））
def sha_concat（世界其他地区）：
row_dict=row.asDict（）#将行转换为dict
columnarray=row_dict.keys（）#获取列名
concat_str=''
对于第w行中的v，dict.values（）：
concat_str=concat_str+“| |”+str（v）#串联值
concat_str=concat_str[2:]
行dict[“sha_values”]=concat_str#保留连接值以供测试（稍后可以删除）
行dict[“sha_hash”]=hashlib.sha256（concat_str）.hexdigest（）#计算sha256
返回行（**行内容）
test_df.rdd.map（sha_concat.toDF（）.show（truncate=False）

结果如下：
+----+----+----+----+----------------------------------------------------------------+----------+
|col1 | col2 | col3 | col4 | sha|u hash | sha|u值|
+----+----+----+----+----------------------------------------------------------------+----------+
|1 | 2 | 5 | 1 | 1BE4B8CE031CF585E9BB79DF7D32C3B93C8C73C27D8F2C2DDC2DE9C8EDCD | 1 | 2 | 5 | 1|
|3 | 4 | 7 | 8 | cb8f8c5d9fd7165cf3c0f019e0fb10fa0e8f147960c715b7f6a60e149d3923a5 | 8 | 4 | 7 | 3|
+----+----+----+----+----------------------------------------------------------------+----------+
如果您想在数据集的不同列中为每个值设置哈希，您可以通过映射将自行设计的函数应用于数据帧的rdd
导入hashlib
test_df=spark.createDataFrame([
(1,"2",5,1),(3,"4",7,8),              
]（“col1”、“col2”、“col3”、“col4”））
def sha_concat（世界其他地区）：
row_dict=row.asDict（）#将行转换为dict
columnarray=row_dict.keys（）#获取列名
concat_str=''
对于第w行中的v，dict.values（）：
concat_str=concat_str+“| |”+str（v）#串联值
concat_str=concat_str[2:]
行dict[“sha_values”]=concat_str#保留连接值以供测试（稍后可以删除）
行dict[“sha_hash”]=hashlib.sha256（concat_str）.hexdigest（）#计算sha256
返回行（**行内容）
test_df.rdd.map（sha_concat.toDF（）.show（truncate=False）

结果如下：
+----+----+----+----+----------------------------------------------------------------+----------+
|col1 | col2 | col3 | col4 | sha|u hash | sha|u值|
+----+----+----+----+----------------------------------------------------------------+----------+
|1 | 2 | 5 | 1 | 1BE4B8CE031CF585E9BB79DF7D32C3B93C8C73C27D8F2C2DDC2DE9C8EDCD | 1 | 2 | 5 | 1|
|3 | 4 | 7 | 8 | cb8f8c5d9fd7165cf3c0f019e0fb10fa0e8f147960c715b7f6a60e149d3923a5 | 8 | 4 | 7 | 3|
+----+----+----+----+----------------------------------------------------------------+----------+
您可以使用来连接列并获取SHA256散列
使用@gaw中的数据：
从pyspark.sql.functions导入sha2，concat\ws
df=spark.createDataFrame(
[(1,"2",5,1),(3,"4",7,8)],
（“col1”、“col2”、“col3”、“col4”）
)
df.withColumn（“row_sha2”，sha2（concat_ws（“||“，*df.columns），256））.show（truncate=False）
#+----+----+----+----+----------------------------------------------------------------+
#|col1 | col2 | col3 | col4 | row|sha2|
#+----+----+----+----+----------------------------------------------------------------+
#|1 | 2 | 5 | 1 | 1B0AE4BB8C