如何在python中为OneHotEncoded值和HashLib创建数字签名?
我想为数据帧中的一个热编码值分配一个数字:如何在python中为OneHotEncoded值和HashLib创建数字签名?,python,pandas,assign,hashlib,Python,Pandas,Assign,Hashlib,我想为数据帧中的一个热编码值分配一个数字: import pandas as pd scale = df.ServiceSubCodeKey.max() + 1 onehot = [] for claimid, ssc in df.groupby('ClaimId').ServiceSubCodeKey: ssc_list = ssc.to_list() onehot.append([claimid, ''.join(['1' if i in ssc_list e
import pandas as pd
scale = df.ServiceSubCodeKey.max() + 1
onehot = []
for claimid, ssc in df.groupby('ClaimId').ServiceSubCodeKey:
ssc_list = ssc.to_list()
onehot.append([claimid,
''.join(['1' if i in ssc_list else '0' for i in range(1, scale)])])
onehot = pd.DataFrame(onehot, columns=['ClaimId', 'onehot'])
print(onehot)
onehot
Out[25]:
ClaimId onehot
0 1902659 0000000000000000000000000000000000000000000000...
1 1902663 0000000000000000000000000000000000000000000000...
2 1902674 0000000000010000000000100000000000000000100000...
3 1904129 0000000000000000000000100000000000000000000000...
4 1904130 0000000000000000000010000000000000000000000000...
... ...
626853 2592904 0000000000000000000000100000000000000000000000...
626854 2592920 0000000000000000000000100000000000000000000000...
626855 2593386 0000000000000000000000000000000000000000000000...
626856 2593387 0000000000000000000000000000000000000000000000...
626857 2593533 0000000000000000000000000000000000000000000000...
我希望每个hotcoded值都表示一个唯一的数字,除非重复。我该怎么做
类似地,我创建了一个哈希算法
import hashlib
hashes1 = df2.apply(lambda x:hashlib.sha1(str(x[0]*1024+x[1]).encode('utf8')).hexdigest(), axis=1)
# Create a DataFrame from the above Series
df_hash = pd.DataFrame(hashes1, columns=['hash'])
df2 = df2.join(df_hash)
df2
Out[24]:
ClaimId SubDiagnosisId hash
0 2094825 141 ad0334de4a944401aa6c847b06246d553362b45a
1 2259956 155 8b9eb6f311d4a9f98dedb32dae7a2effeaf46fe9
2 2327668 583 ef87b808734992ddfd480a87eb1fe7269111062f
3 1985370 100 7a0907f4818a3edb3414b51c85a85605bc367787
4 2417177 47 24fa886d4e01f5c581ae171ffe5ce1323e3201b0
... ... ...
1063955 1958912 355 de0c5fb7ee479c8b7a174f517349fcb5edea4602
1063956 1994638 163 300c0845403d9936cb80d1afa898452fd11a606c
1063957 2371059 74 87f0c57ac85a169c425f2d31e70011f9bd0db366
1063958 2522719 155 b2c5114e4de1be96959d0425711b926d350fe3f0
1063959 2349207 18 b829ce393ac5c1e5948c3b72f7f000f9737ca005
我还想给这些散列分配一个唯一的数字。我该怎么做呢?您尝试做的事情叫做标签编码。你可以用skleran来得到这个 试试这个
#Import label encoder
from sklearn import preprocessing
#label_encoder object knows how to understand word labels.
label_encoder = preprocessing.LabelEncoder()
#Encode labels in column 'species'.
df['uniquevalue']= label_encoder.fit_transform(df['hash'])