Python 来自具有多个值的dataframe字符串列的一个热编码_Python_Dataframe_Data Cleaning_One Hot Encoding

Python 来自具有多个值的dataframe字符串列的一个热编码

python dataframe

Python 来自具有多个值的dataframe字符串列的一个热编码,python,dataframe,data-cleaning,one-hot-encoding,Python,Dataframe,Data Cleaning,One Hot Encoding,我有一个数据帧“df1”，由1245行组成，有一列文本（对象类型）和主题（对象类型）。主题列包含与文本标签对应的不同编号。以下是一个例子： text topic 1207 June 2019: The French Facility for Global Envi... 3 12 7 1208 May 2019: Participants from multi-stak

我有一个数据帧“df1”，由1245行组成，有一列文本（对象类型）和主题（对象类型）。主题列包含与文本标签对应的不同编号。以下是一个例子：

        text                                                topic
1207    June 2019: The French Facility for Global Envi...   3 12 7
1208    May 2019: Participants from multi-stakeholder ...   8
1209    2 July 2019: UN Member States have reached agr...   1 7
1210    30 June 2019: The G20 Leaders’ Summit and asso...   7 8 9 11 12 13 14 15 17

我想获得一个这样的单热编码表单（在列名中的数字前添加一个“S”）：

这里的“难点”是我的文本是多标签的，所以简单的单色热编码的代码不适用于我的情况。

有什么想法吗？

只使用熊猫，你可以做如下事情：

import pandas as pd


data = [['June 2019: The French Facility for Global Envi...', '3 12 7'],
       ['May 2019: Participants from multi-stakeholder ...','8'],
       ['2 July 2019: UN Member States have reached agr...','1 7'],
       ['30 June 2019: The G20 Leaders’ Summit and asso...','7 8 9 11 12 13 14 15 17']]
df = pd.DataFrame(data , columns=['text', 'topic'])

# creating list of strings where each value is one number out of topic column
unique_values = ' '.join(df['topic'].values.tolist()).split(' ')

# creating new column for each value in unique_values
for number in unique_values:
    df[f'S{number}'] = 0
    
# changing 0 to 1 for every Snumber column where topic contains number
for idx, row in df.iterrows():
    for number in row['topic'].split(' '):
        df.loc[idx, f'S{number}'] = 1
df.drop('topic', axis=1, inplace=True)

结果:


    text                                                S3  S12 S7  S8  S1  S9  S11 S13 S14 S15 S17
0   June 2019: The French Facility for Global Envi...   1   1   1   0   0   0   0   0   0   0   0
1   May 2019: Participants from multi-stakeholder ...   0   0   0   1   0   0   0   0   0   0   0
2   2 July 2019: UN Member States have reached agr...   0   0   1   0   1   0   0   0   0   0   0
3   30 June 2019: The G20 Leaders’ Summit and asso...   0   1   1   1   0   1   1   1   1   1   1

仅使用熊猫，您可以执行以下操作：

import pandas as pd


data = [['June 2019: The French Facility for Global Envi...', '3 12 7'],
       ['May 2019: Participants from multi-stakeholder ...','8'],
       ['2 July 2019: UN Member States have reached agr...','1 7'],
       ['30 June 2019: The G20 Leaders’ Summit and asso...','7 8 9 11 12 13 14 15 17']]
df = pd.DataFrame(data , columns=['text', 'topic'])

# creating list of strings where each value is one number out of topic column
unique_values = ' '.join(df['topic'].values.tolist()).split(' ')

# creating new column for each value in unique_values
for number in unique_values:
    df[f'S{number}'] = 0
    
# changing 0 to 1 for every Snumber column where topic contains number
for idx, row in df.iterrows():
    for number in row['topic'].split(' '):
        df.loc[idx, f'S{number}'] = 1
df.drop('topic', axis=1, inplace=True)

结果:


    text                                                S3  S12 S7  S8  S1  S9  S11 S13 S14 S15 S17
0   June 2019: The French Facility for Global Envi...   1   1   1   0   0   0   0   0   0   0   0
1   May 2019: Participants from multi-stakeholder ...   0   0   0   1   0   0   0   0   0   0   0
2   2 July 2019: UN Member States have reached agr...   0   0   1   0   1   0   0   0   0   0   0
3   30 June 2019: The G20 Leaders’ Summit and asso...   0   1   1   1   0   1   1   1   1   1   1

使用稍微修改的数据（出于可读性原因…）：

从io导入StringIO
作为pd进口熊猫
s=“”id、文本、主题
1207，1，1，2，5
1208，2，3
1209，3，14
1210，4，1，2，3“
df=pd.read_csv（字符串）
df.topic=df.topic.str.split（“”）.apply（λx:[int（y）表示x中的y]）
b=np.零（（df.topic.size，max（df.topic中x的max（x）+1））
对于df.index中的i：
b[i，df.topic[i]]=1
idx={'id'：df.id，'text'：df.text}
update（{f'S{i}）：范围（1，b.shape[1]）中i的b[：，i]
idx
df=pd.数据帧（idx）
打印（df.set_index（'id'）。到_markdown（））

这将为您提供：

|   id | text   |   S1 |   S2 |   S3 |   S4 |   S5 |
|-----:|:-------|-----:|-----:|-----:|-----:|-----:|
| 1207 | One    |    1 |    1 |    0 |    0 |    1 |
| 1208 | Two    |    0 |    0 |    1 |    0 |    0 |
| 1209 | Three  |    1 |    0 |    0 |    1 |    0 |
| 1210 | Four   |    1 |    1 |    1 |    0 |    0 |

使用稍微修改的数据（出于可读性原因…）：

从io导入StringIO
作为pd进口熊猫
s=“”id、文本、主题
1207，1，1，2，5
1208，2，3
1209，3，14
1210，4，1，2，3“
df=pd.read_csv（字符串）
df.topic=df.topic.str.split（“”）.apply（λx:[int（y）表示x中的y]）
b=np.零（（df.topic.size，max（df.topic中x的max（x）+1））
对于df.index中的i：
b[i，df.topic[i]]=1
idx={'id'：df.id，'text'：df.text}
update（{f'S{i}）：范围（1，b.shape[1]）中i的b[：，i]
idx
df=pd.数据帧（idx）
打印（df.set_index（'id'）。到_markdown（））

这将为您提供：

|   id | text   |   S1 |   S2 |   S3 |   S4 |   S5 |
|-----:|:-------|-----:|-----:|-----:|-----:|-----:|
| 1207 | One    |    1 |    1 |    0 |    0 |    1 |
| 1208 | Two    |    0 |    0 |    1 |    0 |    0 |
| 1209 | Three  |    1 |    0 |    0 |    1 |    0 |
| 1210 | Four   |    1 |    1 |    1 |    0 |    0 |

您是否希望得到Python（问题标签）的答案。您尝试的代码是R库索引的，我只使用python。我现在知道了为什么这段代码不起作用了……您是否希望Python给出答案（问题的标签）。您尝试的代码是R库索引的，我只使用python。我现在知道为什么这段代码不起作用了。。。