Python 使用关键字提取在熊猫中动态创建列_Python_Pandas

Python 使用关键字提取在熊猫中动态创建列

python pandas

Python 使用关键字提取在熊猫中动态创建列,python,pandas,Python,Pandas,我有一个熊猫数据框，看起来像这样 Col1,Col2,Col3 1,"this is a text","more text" 2,"this is another text","even more" 3,"here is one more", "something also here" 4,"let's get another one","we are close" 5,"one last text","finally" 然后，我在这些文本上应用名称实体识别，并提取一些重要的关键字。像这样 d

我有一个熊猫数据框，看起来像这样

Col1,Col2,Col3
1,"this is a text","more text"
2,"this is another text","even more"
3,"here is one more", "something also here"
4,"let's get another one","we are close"
5,"one last text","finally"

然后，我在这些文本上应用名称实体识别，并提取一些重要的关键字。像这样

def get_entities(ocr, title):
    doc = nlp(' '.join([ocr, title]))
    entities = []
    for ent in doc.ents:
        entity = '_'.join([ent.label_, ent.text])
        entities.append(entity)
    return set(entities)

df['entities'] = df.apply(lambda row: get_entities( row.Col2, row.Col3), axis = 1)

上面创建了一个名为

entities

的新列，该列的行值为不同关键字的列表。就这么说吧

Col1,Col3
1,['key1', 'key2']
2,['key3', 'key2']
3,['key4', 'key1']
4,['key3', 'key4']
5,['key5', 'key2']

现在，我尝试在该列上应用一个

get_dummies

，并创建所有可能的具有行值

0-1

的关键字。以上是

Col1,Col3,key1,key2,key3,key4,key5
1,['key1', 'key2'],1,1,0,0,0
2,['key3', 'key2'],0,1,1,0,0
3,['key4', 'key1'],1,0,0,1,0
4,['key3', 'key4'],0,0,1,1,0
5,['key5', 'key2'],0,1,0,0,1

当然，直接在列表列上应用

get_dummies

，是行不通的

df = pd.concat([df,pd.get_dummies(df['entities'], prefix='entities')],axis=1)

如果您有任何想法，我将不胜感激。

最简单的解决方案是更改函数返回的内容

get\u假人

可以处理分隔符分隔的字符串，从

get\u实体

方法返回这些字符串非常简单

现在，您可以直接在结果上使用

get_dummies

。以第二个示例帧为例，您将得到：

df['Col3'].str.get_dummies(',')

如果不想更改函数的返回，请在尝试使用

str.join

get_dummies之前添加另一个步骤

df.join(df.Col3.explode().str.get_dummies().max(level=0))

Out[206]:
   Col1          Col3  key1  key2  key3  key4  key5
0     1  [key1, key2]     1     1     0     0     0
1     2  [key3, key2]     0     1     1     0     0
2     3  [key4, key1]     1     0     0     1     0
3     4  [key3, key4]     0     0     1     1     0
4     5  [key5, key2]     0     1     0     0     1

尝试

explode

，

str.get\u dummies

和

join

df.join(df.Col3.explode().str.get_dummies().max(level=0))

Out[206]:
   Col1          Col3  key1  key2  key3  key4  key5
0     1  [key1, key2]     1     1     0     0     0
1     2  [key3, key2]     0     1     1     0     0
2     3  [key4, key1]     1     0     0     1     0
3     4  [key3, key4]     0     0     1     1     0
4     5  [key5, key2]     0     1     0     0     1

如果

df.Col3

是一系列

set

，您需要

agg

或

str.join

在

获取虚拟对象之前和join
返回df

df.join(df.Col3.agg('|'.join).str.get_dummies())

Out[224]:
   Col1          Col3  key1  key2  key3  key4  key5
0     1  {key1, key2}     1     1     0     0     0
1     2  {key2, key3}     0     1     1     0     0
2     3  {key1, key4}     1     0     0     1     0
3     4  {key4, key3}     0     0     1     1     0
4     5  {key5, key2}     0     1     0     0     1

不幸的是，他的函数返回一个集
，并且不能分解集
。您需要先应用（列表）
，这会非常慢。@user3483203:哦，我的答案基于Col1
，Col3
列的示例表，这些列是key1
，key2
的列表。六羟甲基三聚氰胺六甲醚。。。我需要再看看这个问题。您的函数返回set
，但在描述和示例输出中，您说它们是lists。请澄清它们是列表或集合，并更新您的问题以反映这一点。
   key1  key2  key3  key4  key5
0     1     1     0     0     0
1     0     1     1     0     0
2     1     0     0     1     0
3     0     0     1     1     0
4     0     1     0     0     1

df.join(df.Col3.explode().str.get_dummies().max(level=0))

Out[206]:
   Col1          Col3  key1  key2  key3  key4  key5
0     1  [key1, key2]     1     1     0     0     0
1     2  [key3, key2]     0     1     1     0     0
2     3  [key4, key1]     1     0     0     1     0
3     4  [key3, key4]     0     0     1     1     0
4     5  [key5, key2]     0     1     0     0     1

df.join(df.Col3.agg('|'.join).str.get_dummies())

Out[224]:
   Col1          Col3  key1  key2  key3  key4  key5
0     1  {key1, key2}     1     1     0     0     0
1     2  {key2, key3}     0     1     1     0     0
2     3  {key1, key4}     1     0     0     1     0
3     4  {key4, key3}     0     0     1     1     0
4     5  {key5, key2}     0     1     0     0     1