Python 如何根据给定的字典对数据帧进行聚类?
我有一个熊猫数据框,我正试图根据下面的dict进行集群。 示例:在“Info 1”集群中,根据字典,我总共有7个值,而在熊猫数据帧中,我只有4个值。根据这一点进行聚类。我将得到低于输出 输入:Python 如何根据给定的字典对数据帧进行聚类?,python,pandas,dataframe,dictionary,Python,Pandas,Dataframe,Dictionary,我有一个熊猫数据框,我正试图根据下面的dict进行集群。 示例:在“Info 1”集群中,根据字典,我总共有7个值,而在熊猫数据帧中,我只有4个值。根据这一点进行聚类。我将得到低于输出 输入: PII Counts CREDIT_CARD 158 DATE_TIME 544 DOMAIN_NAME 609 EMAIL_ADDRESS 90 IP_ADDRESS 405 LOCATION
PII Counts
CREDIT_CARD 158
DATE_TIME 544
DOMAIN_NAME 609
EMAIL_ADDRESS 90
IP_ADDRESS 405
LOCATION 346
PERSON 202
BANK_NUMBER 202
PASSPORT 130
NHS 6
NRP 20
dict = {'Info 1': ['PERSON', 'LOCATION', 'PHONE_NUMBER', 'EMAIL_ADDRESS', 'PASSPORT', 'SSN',
'DRIVER_LICENSE'],
'Info 2': ['NHS'],
'Info 3': ['IP_ADDRESS', 'DOMAIN_NAME'],
'Info 4': ['CRYPTO', 'DATE_TIME', 'NRP'],
'Info 5': ['CREDIT_CARD', 'BANK_NUMBER', 'ITIN', 'CODE']}
输出:
Names Count Info
0 Info 5 [158, 202] ['CREDIT_CARD','BANK_NUMBER']
1 Info 2 [6] ['NHS']
2 Info 3 [405, 609] ['IP_ADDRESS','DOMAIN_NAME']
3 Info 4 [20, 544] ['NRP','DATE_TIME']
4 Info 1 [202, 346, 90, 130] ['PERSON','LOCATION','EMAIL_ADDRESS','PASSPORT']
首先不要使用变量
dict
,因为python代码是变量
然后用交换的键和值展平dict列表,使用byPII
并通过聚合list
传递到:
d = {'Info 1': ['PERSON', 'LOCATION', 'PHONE_NUMBER', 'EMAIL_ADDRESS', 'PASSPORT', 'SSN',
'DRIVER_LICENSE'],
'Info 2': ['NHS'],
'Info 3': ['IP_ADDRESS', 'DOMAIN_NAME'],
'Info 4': ['CRYPTO', 'DATE_TIME', 'NRP'],
'Info 5': ['CREDIT_CARD', 'BANK_NUMBER', 'ITIN', 'CODE']}
d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}
df1 = (df.groupby(df['PII'].map(d1).rename('Names'), sort=False)
.agg(list)
.reset_index())
print (df1)
Names PII Counts
0 Info 5 [CREDIT_CARD, BANK_NUMBER] [158, 202]
1 Info 4 [DATE_TIME, NRP] [544, 20]
2 Info 3 [DOMAIN_NAME, IP_ADDRESS] [609, 405]
3 Info 1 [EMAIL_ADDRESS, LOCATION, PERSON, PASSPORT] [90, 346, 202, 130]
4 Info 2 [NHS] [6]