Python 来自具有多个值的dataframe字符串列的一个热编码
我有一个数据帧“df1”,由1245行组成,有一列文本(对象类型)和主题(对象类型)。主题列包含与文本标签对应的不同编号。 以下是一个例子:Python 来自具有多个值的dataframe字符串列的一个热编码,python,dataframe,data-cleaning,one-hot-encoding,Python,Dataframe,Data Cleaning,One Hot Encoding,我有一个数据帧“df1”,由1245行组成,有一列文本(对象类型)和主题(对象类型)。主题列包含与文本标签对应的不同编号。 以下是一个例子: text topic 1207 June 2019: The French Facility for Global Envi... 3 12 7 1208 May 2019: Participants from multi-stak
text topic
1207 June 2019: The French Facility for Global Envi... 3 12 7
1208 May 2019: Participants from multi-stakeholder ... 8
1209 2 July 2019: UN Member States have reached agr... 1 7
1210 30 June 2019: The G20 Leaders’ Summit and asso... 7 8 9 11 12 13 14 15 17
我想获得一个这样的单热编码表单(在列名中的数字前添加一个“S”):
这里的“难点”是我的文本是多标签的,所以简单的单色热编码的代码不适用于我的情况。
有什么想法吗?只使用熊猫,你可以做如下事情:
import pandas as pd
data = [['June 2019: The French Facility for Global Envi...', '3 12 7'],
['May 2019: Participants from multi-stakeholder ...','8'],
['2 July 2019: UN Member States have reached agr...','1 7'],
['30 June 2019: The G20 Leaders’ Summit and asso...','7 8 9 11 12 13 14 15 17']]
df = pd.DataFrame(data , columns=['text', 'topic'])
# creating list of strings where each value is one number out of topic column
unique_values = ' '.join(df['topic'].values.tolist()).split(' ')
# creating new column for each value in unique_values
for number in unique_values:
df[f'S{number}'] = 0
# changing 0 to 1 for every Snumber column where topic contains number
for idx, row in df.iterrows():
for number in row['topic'].split(' '):
df.loc[idx, f'S{number}'] = 1
df.drop('topic', axis=1, inplace=True)
结果:
text S3 S12 S7 S8 S1 S9 S11 S13 S14 S15 S17
0 June 2019: The French Facility for Global Envi... 1 1 1 0 0 0 0 0 0 0 0
1 May 2019: Participants from multi-stakeholder ... 0 0 0 1 0 0 0 0 0 0 0
2 2 July 2019: UN Member States have reached agr... 0 0 1 0 1 0 0 0 0 0 0
3 30 June 2019: The G20 Leaders’ Summit and asso... 0 1 1 1 0 1 1 1 1 1 1
仅使用熊猫,您可以执行以下操作:
import pandas as pd
data = [['June 2019: The French Facility for Global Envi...', '3 12 7'],
['May 2019: Participants from multi-stakeholder ...','8'],
['2 July 2019: UN Member States have reached agr...','1 7'],
['30 June 2019: The G20 Leaders’ Summit and asso...','7 8 9 11 12 13 14 15 17']]
df = pd.DataFrame(data , columns=['text', 'topic'])
# creating list of strings where each value is one number out of topic column
unique_values = ' '.join(df['topic'].values.tolist()).split(' ')
# creating new column for each value in unique_values
for number in unique_values:
df[f'S{number}'] = 0
# changing 0 to 1 for every Snumber column where topic contains number
for idx, row in df.iterrows():
for number in row['topic'].split(' '):
df.loc[idx, f'S{number}'] = 1
df.drop('topic', axis=1, inplace=True)
结果:
text S3 S12 S7 S8 S1 S9 S11 S13 S14 S15 S17
0 June 2019: The French Facility for Global Envi... 1 1 1 0 0 0 0 0 0 0 0
1 May 2019: Participants from multi-stakeholder ... 0 0 0 1 0 0 0 0 0 0 0
2 2 July 2019: UN Member States have reached agr... 0 0 1 0 1 0 0 0 0 0 0
3 30 June 2019: The G20 Leaders’ Summit and asso... 0 1 1 1 0 1 1 1 1 1 1
使用稍微修改的数据(出于可读性原因…):
从io导入StringIO
作为pd进口熊猫
s=“”id、文本、主题
1207,1,1,2,5
1208,2,3
1209,3,14
1210,4,1,2,3“
df=pd.read_csv(字符串)
df.topic=df.topic.str.split(“”).apply(λx:[int(y)表示x中的y])
b=np.零((df.topic.size,max(df.topic中x的max(x)+1))
对于df.index中的i:
b[i,df.topic[i]]=1
idx={'id':df.id,'text':df.text}
update({f'S{i}):范围(1,b.shape[1])中i的b[:,i]
idx
df=pd.数据帧(idx)
打印(df.set_index('id')。到_markdown())
这将为您提供:
| id | text | S1 | S2 | S3 | S4 | S5 |
|-----:|:-------|-----:|-----:|-----:|-----:|-----:|
| 1207 | One | 1 | 1 | 0 | 0 | 1 |
| 1208 | Two | 0 | 0 | 1 | 0 | 0 |
| 1209 | Three | 1 | 0 | 0 | 1 | 0 |
| 1210 | Four | 1 | 1 | 1 | 0 | 0 |
使用稍微修改的数据(出于可读性原因…):
从io导入StringIO
作为pd进口熊猫
s=“”id、文本、主题
1207,1,1,2,5
1208,2,3
1209,3,14
1210,4,1,2,3“
df=pd.read_csv(字符串)
df.topic=df.topic.str.split(“”).apply(λx:[int(y)表示x中的y])
b=np.零((df.topic.size,max(df.topic中x的max(x)+1))
对于df.index中的i:
b[i,df.topic[i]]=1
idx={'id':df.id,'text':df.text}
update({f'S{i}):范围(1,b.shape[1])中i的b[:,i]
idx
df=pd.数据帧(idx)
打印(df.set_index('id')。到_markdown())
这将为您提供:
| id | text | S1 | S2 | S3 | S4 | S5 |
|-----:|:-------|-----:|-----:|-----:|-----:|-----:|
| 1207 | One | 1 | 1 | 0 | 0 | 1 |
| 1208 | Two | 0 | 0 | 1 | 0 | 0 |
| 1209 | Three | 1 | 0 | 0 | 1 | 0 |
| 1210 | Four | 1 | 1 | 1 | 0 | 0 |
您是否希望得到Python(问题标签)的答案。您尝试的代码是R库索引的,我只使用python。我现在知道了为什么这段代码不起作用了……您是否希望Python给出答案(问题的标签)。您尝试的代码是R库索引的,我只使用python。我现在知道为什么这段代码不起作用了。。。