Python 数据帧单元中的搜索关键字_Python_Python 3.x_Pandas_Numpy

Python 数据帧单元中的搜索关键字

python python-3.x pandas numpy

Python 数据帧单元中的搜索关键字,python,python-3.x,pandas,numpy,Python,Python 3.x,Pandas,Numpy,我目前有一个数据框，其中有一列包含一些单词或字符，我试图通过相应单元格中的搜索关键字对每一行进行分类范例 words | category ----------------------------------- im a test email | email here is my handout | handout 这是我的 conditions = [ (df['words'].str.contains('flyer',Fals

我目前有一个数据框，其中有一列包含一些单词或字符，我试图通过相应单元格中的搜索关键字对每一行进行分类

范例

  words             |   category
-----------------------------------
im a test email     |  email
here is my handout  |  handout

这是我的

conditions = [
        (df['words'].str.contains('flyer',False,regex=True)),
        (df['words'].str.contains('report',False,regex=True)),
        (df['words'].str.contains('form',False,regex=True)), 
        (df['words'].str.contains('scotia',False,regex=True)),  
        (df['words'].str.contains('news',False,regex=True)), 
         (df_prt_copy['words'].str.contains('questions.*\.pdf',False,regex=True)),
         .
         .
         .
         .
    ]
    choices = ['open house flyer', 
               'report', 
               'form', 
               'report',
               'news', 
               ‘question',
                  .
                  .
                  .
                  .
              ]
     df['category']=np.select(conditions, choices, default='others')

这很好，但问题是我有很多关键字（可能超过120个左右），所以维护这个关键字列表非常困难，有没有更好的方法呢？顺便说一句，我用的是蟒蛇3

注意：我正在寻找一种更简单的方法来管理一个大的关键字列表，这与简单的查找关键字的方法不同，您可以动态创建

条件列表。如果你有一个关键字列表，比如keywords
，你可以for
循环浏览关键字列表，并附加条件，如（df['words'].str.contains（keywords[iter]，False，regex=True））
到列表条件
如果一行中有多个关键字，您可以加入所有关键字并使用str.findall
，然后将
映射到一系列条件和选项：
df = pd.DataFrame({"words":["im a test email",
                            "here is my handout",
                            "This is a flyer"]})

choices = {"flyer":"open house flyer",
           "email":"email from someone",
           "handout":"some handout"}

df["category"] = df["words"].str.findall("|".join(choices.keys())).str.join(",").map(choices)

print (df)

#
                words            category
0     im a test email  email from someone
1  here is my handout        some handout
2     This is a flyer    open house flyer

您可以使用flashtext
 import pandas as pd
 from flashtext import KeywordProcessor

 keyword_dict = {
 'programming': ['python', 'pandas','java','java_football'],
 'sport': ['cricket','football','baseball']
 } 

 kp = KeywordProcessor()
 kp.add_keywords_from_dict(keyword_dict)
 df = pd.DataFrame(['i love working in python','pandas is very popular library','i love playing football'],columns= ['text'])

 df['category'] = df['text'].apply(lambda x: kp.extract_keywords(x, span_info = True))


现在，像“todayIgotAemailReport”这样的词的问题来了，你们可以参考一下
您认为这可能有助于拆分任何类型的未知连接词吗
import wordninja
' '.join(wordninja.split('todayIgotAemailReport'))

#this will break this into their respective word which can make your stuff easy, while searching
#op
'today I got A email Report' 

这回答了你的问题吗？或者不，这只适用于少量的关键字，我正在寻找一种更简单的方法来处理大量关键字在这种情况下，我仍然需要匹配“选择”列表中的顺序，而该列表应该类似于类别列表，我希望可以用dict来代替那些“选择”和“条件”列表。这可以用字典来完成吗？当我把它们放在一起的时候，我遇到了匹配关键字序列和相应类别的问题，因为它们太多了，我今天会测试它，然后回到这里，谢谢@Henry Yiks一些单词被嵌入，例如“TodayIgotateMailReport”，这不会返回“email”类别，我想这个方法不适用正则表达式，有什么办法吗？为什么它不返回电子邮件类别？我想今天的IgotateMailReport是一个与单词“email”不匹配的单词，我原来的方法启用了正则表达式，所以这不是一个问题尝试了这个方法，但java_footbal没有得到识别做了更多的事情，这个方法只找到第一个匹配项，对吗？还有，我注意到flashtext看起来更快，但是有没有办法用正则表达式来实现呢？现在，当我测试时，“TodayIgotateMailReport”没有被识别为电子邮件类别YNO，它将为您提供所有匹配项kp。提取_关键字（x）这个给定列表，我选择了索引为零的项目，这就是为什么当找不到关键字时，它会抛出错误，因为列表为零empty@ikel我修改了代码，包括span_info=True，这样你就可以找到单词的位置