Python 使用带参数的字符串列表交叉引用多列的有效方法_Python_Pandas_Vectorization

Python 使用带参数的字符串列表交叉引用多列的有效方法

python pandas

Python 使用带参数的字符串列表交叉引用多列的有效方法,python,pandas,vectorization,Python,Pandas,Vectorization,我需要在dataframe文本列中搜索国家名称或大写名称，然后将点击结果保存在新列中。我目前的解决方案可行，但需要很长时间。我想知道是否有可能使它更有效，理想的是以矢量化的方式国家和首都列表存储在单独的国家数据框中我的主数据帧df： date text 0 2016-01-01 Bla bla bla bla 1 2016-01-01 Blu blu Nigeria 2 2016-01-01 Hey ho Nor

我需要在dataframe文本列中搜索国家名称或大写名称，然后将点击结果保存在新列中。我目前的解决方案可行，但需要很长时间。我想知道是否有可能使它更有效，理想的是以矢量化的方式

国家和首都列表存储在单独的

国家

数据框中

我的主数据帧

df

：

    date             text
0   2016-01-01       Bla bla bla bla
1   2016-01-01       Blu blu Nigeria
2   2016-01-01       Hey ho Norway
3   2016-01-01       This is text Paris
4   2016-01-01       Lorem lorem ipsum

国家

数据框：

    name             capital
0   France           Paris
1   Germany          Berlin
2   Norway           Oslo
3   China            Beijing

我当前的解决方案：

def extract_countries(row):
    matches = []
    for country, adj in countries[['name', 'capital']].values:
        if any([country in row.text, adj in row.text]):
            matches.append(country)
    return ', '.join(matches)

df['countries'] = df.apply(extract_countries, axis=1)

预期结果：

    date             text                      countries
0   2016-01-01       Bla bla bla bla           NaN
1   2016-01-01       Blu blu Nigeria           Nigeria
2   2016-01-01       Hey ho Norway             Norway
3   2016-01-01       This is text Paris        France
4   2016-01-01       Lorem lorem ipsum         NaN
5   2016-01-01       Germany attacked Benin    Germany, Benin

这里有一个方法。注意

NaN

（“非数字”）不适用于字符串列，因此我保留了未找到匹配项的空字符串

import pandas as pd

df = pd.DataFrame([['2016-01-01', 'Bla bla bla bla'], ['2016-01-01', 'Blu blu Nigeria'],
                   ['2016-01-01', 'Hey ho Norway'], ['2016-01-01', 'This is text Paris'],
                   ['2016-01-01', 'Lorem lorem ipsum']], columns=['date', 'text'])

countries = pd.DataFrame([['France', 'Paris'], ['Germany', 'Berlin'], ['Norway', 'Oslo'],
                          ['China', 'Beijing']], columns=['name', 'capital'])

ctry_set = set(countries.name)
cap_set = set(countries.capital)

df['countries'] = df['text'].apply(lambda x: ', '.join(i for i in ctry_set if i in x))
df['capitals'] = df['text'].apply(lambda x: ', '.join(i for i in cap_set if i in x))

#          date                text countries capitals
# 0  2016-01-01     Bla bla bla bla                   
# 1  2016-01-01     Blu blu Nigeria                   
# 2  2016-01-01       Hey ho Norway    Norway         
# 3  2016-01-01  This is text Paris              Paris
# 4  2016-01-01   Lorem lorem ipsum

请添加一些数据以及您想要的结果。添加了更多上下文，谢谢。效果很好！最后只需要创建一个从首都到国家的映射，并将

国家

和

首都

列组合起来。