Python 从匹配列表中单词的列中检索单词_Python_Pandas

Python 从匹配列表中单词的列中检索单词

python pandas

Python 从匹配列表中单词的列中检索单词,python,pandas,Python,Pandas,我有一个有文字的专栏。此文本可以包含国家的名称。我想在正文的同一行列中列出所有提到的国家。我已经有了一个关于我想提取的国家的系列 SomeText | ... | .... | CountryInText Something Canada | | | RUSSIAAreACountry | | | Me

我有一个有文字的专栏。此文本可以包含国家的名称。我想在正文的同一行列中列出所有提到的国家。我已经有了一个关于我想提取的国家的系列

    SomeText                          | ... | .... | CountryInText
    Something Canada                  |     |      |   
    RUSSIAAreACountry                 |     |      |   
    Mexicoand Brazil is South of USA




    SomeText                          | ... | .... | CountryInText
    Something Canada                  |     |      |  Canada 
    RUSSIAAreACountry                 |     |      |  Russia
    Mexicoand Brazil is South of USA  |     |      |  Mexico, Brazil, USA

我试过了

pd.Series(df['SomeText'].str.findall(f"({'|'.join(countryname['CommonName'])})"))

然而，这给了我一个无法与原始数据帧匹配的对象列表。countryname['CommonName']是一系列国家名称

有人能帮我吗

提前感谢使用

re

软件包的解决方案（带有一个小测试示例）（以获得更大的灵活性）：

将熊猫作为pd导入
进口稀土
df=pd.DataFrame（{“SomeText”：[“Something Canada”，“RUSSIAAreACountry”]}）
countryname=pd.Series（{“CommonName”：[“加拿大”、“俄罗斯”]}）
df[“CountryInText”]=df[“SomeText”].str.title（）.map（λx:
关于findall（“|”.join（countryname['CommonName']），x，re.I））

更新（基于Erfan在评论中的反馈）：

将熊猫作为pd导入
进口稀土
df=pd.DataFrame（{“SomeText”：[“Something Canada”，“RUSSIAAreACountry”]}）
countryname=pd.Series（{“CommonName”：[“加拿大”、“俄罗斯”]}）
df[“CountryInText”]=df[“SomeText”].str.title（）.str.findall（“|”.join（countryname['CommonName]”），re.I）

更新2（基于OP发布的有用的附加测试用例）：

上述方法将返回美国，而不是美国。下面的方法解决了这一问题：

将熊猫作为pd导入
df=pd.DataFrame（{“SomeText”：[“Something Canada”，
“俄罗斯国家”，
“墨西哥和巴西位于美国南部”]}）
countryname=pd.系列（{“CommonName”：[“加拿大”、“俄罗斯”、“墨西哥”、“巴西”、“美国”]}）
df[“CountryInText”]=df[“SomeText”].map（lambda x:[c代表countryname['CommonName'中的c]
如果x.lower（）]中的c.lower（））

有点太晚了，有点重复，但代码是我写的，所以我也可以：）

将给你：

国家名称：

            Name CommonName
0  Rep. of Congo      Congo
1    Russia Long     Russia
2    Canada Long     Canada

df:

姓名：

Congo|Russia|Canada

然后，使用findall和一个简单的函数，您可以在公共名称中找到字符串的所有实例，如果找到任何实例，则选择第一个实例并将其设置为标题大小写，如果没有找到任何实例，则返回空字符串。此方法忽略所有cap选项，并将所有内容更改为标题大小写。在我写了答案之后，我还看到了最右边的名字添加，所以这也是错误的

# re.I is there to do case insensitive matching
df["CountryInText"] = df["SomeText"].str.findall(names, flags = re.I)
def cleanup(country_list):
    if len(country_list) > 0:
        return str(country_list[0])
    return ""
df["CountryInText"] = df["CountryInText"].apply(cleanup).apply(str.title)

现在df：

            SomeText CountryInText
0   Something Canada        Canada
1  RUSSIAAreACountry        Russia
2      Rep ofIreland              
3          Unrelated

是您要找的吗？为什么要使用

findall

？如果在

SomeText

中有两个国家名，会发生什么情况？看起来您实际想要的可能与您的措辞不同。根据您的示例，您似乎希望某一行最右边的列包含该行最左边列中出现的所有国家/地区。是吗？@accumulation是的，这是正确的，对不起-我现在正在更新问题。最好使用本机pandas方法：

Series.str.findall

是的，除此之外，这是一般性的评论，当我们有本机pandas方法@QuangHoangWhy

.title（）时，使用re.findall
是没有意义的

和

.lower（）

？@AMC

title（）

用于返回仅首字母大写的国家名称（例如，OP示例俄罗斯->俄罗斯）。但这种方法并没有正确处理后来添加的场景（美国->美国）。最后一种方法也处理该测试用例，它使用

lower（）

使匹配不区分大小写。

# re.I is there to do case insensitive matching
df["CountryInText"] = df["SomeText"].str.findall(names, flags = re.I)
def cleanup(country_list):
    if len(country_list) > 0:
        return str(country_list[0])
    return ""
df["CountryInText"] = df["CountryInText"].apply(cleanup).apply(str.title)

            SomeText CountryInText
0   Something Canada        Canada
1  RUSSIAAreACountry        Russia
2      Rep ofIreland              
3          Unrelated