String 从标题、相关和最终列中提取关键字_String_Python 2.7_Pandas_Dataframe_Text Extraction

String 从标题、相关和最终列中提取关键字

string python-2.7 pandas dataframe

String 从标题、相关和最终列中提取关键字,string,python-2.7,pandas,dataframe,text-extraction,String,Python 2.7,Pandas,Dataframe,Text Extraction,我有一个数据帧，其结构如下： Title; Total Visits; Rank The dog; 8 ; 4 The cat; 9 ; 4 The dog cat; 10 ; 3 第二个数据帧包含： Keyword; Rank snail ; 5 dog ; 1 cat ; 2 我正在努力实现的是： Title

我有一个数据帧，其结构如下：

 Title;         Total Visits;    Rank
 The dog;       8           ;    4
 The cat;       9           ;    4
 The dog cat;   10          ;    3

第二个数据帧包含：

Keyword;     Rank
snail ;      5
dog   ;      1
cat   ;      2

我正在努力实现的是：

 Title;         Total Visits;    Rank  ; Keywords    ; Score
 The dog;       8           ;    4     ; dog         ; 1
 The cat;       9           ;    4     ; cat         ; 2
 The dog cat;   10          ;    3     ; dog,cat     ; 1.5

我已经利用了，但有些

df['Tweet'].map(lambda x: tuple(re.findall(r'({})'.format('|'.join(w.values)), x)))

返回null。任何帮助都将不胜感激。

您可以使用：

#create list of all words
wants = df2.Keyword.tolist()
#dict for maping
d = df2.set_index('Keyword')['Rank'].to_dict()
#split all values by whitespaces, create series
s = df1.Title.str.split(expand=True).stack()
#filter by list wants
s = s[s.isin(wants)]
print (s)
0  1    dog
1  1    cat
2  1    dog
   2    cat
dtype: object

#create new columns
df1['Keywords'] = s.groupby(level=0).apply(','.join)
df1['Score'] = s.map(d).groupby(level=0).mean()

print (df1)
         Title  Total Visits  Rank Keywords  Score
0      The dog             8     4      dog    1.0
1      The cat             9     4      cat    2.0
2  The dog cat            10     3  dog,cat    1.5

另一个具有列表操作的解决方案：

wants = df2.Keyword.tolist()
d = df2.set_index('Keyword')['Rank'].to_dict()
#create list from each value
df1['Keywords'] = df1.Title.str.split()
#remove unnecessary words
df1['Keywords'] = df1.Keywords.apply(lambda x: [item for item in x if item in wants])
#maping each word
df1['Score'] = df1.Keywords.apply(lambda x: [d[item] for item in x])

#create ne columns
df1['Keywords'] = df1.Keywords.apply(','.join)
#mean
df1['Score'] = df1.Score.apply(lambda l: sum(l) / float(len(l)))

print (df1)
         Title  Total Visits  Rank Keywords  Score
0      The dog             8     4      dog    1.0
1      The cat             9     4      cat    2.0
2  The dog cat            10     3  dog,cat    1.5

计时：

In [96]: %timeit (a(df11, df22))
100 loops, best of 3: 3.71 ms per loop

In [97]: %timeit (b(df1, df2))
100 loops, best of 3: 2.55 ms per loop

测试代码：

df11 = df1.copy()    
df22 = df2.copy() 

def a(df1, df2):
    wants = df2.Keyword.tolist()
    d = df2.set_index('Keyword')['Rank'].to_dict()
    s = df1.Title.str.split(expand=True).stack()
    s = s[s.isin(wants)]
    df1['Keywords'] = s.groupby(level=0).apply(','.join)
    df1['Score'] = s.map(d).groupby(level=0).mean()
    return (df1)

def b(df1,df2):   
    wants = df2.Keyword.tolist()
    d = df2.set_index('Keyword')['Rank'].to_dict()
    df1['Keywords'] = df1.Title.str.split()
    df1['Keywords'] = df1.Keywords.apply(lambda x: [item for item in x if item in wants])
    df1['Score'] = df1.Keywords.apply(lambda x: [d[item] for item in x])
    df1['Keywords'] = df1.Keywords.apply(','.join)
    df1['Score'] = df1.Score.apply(lambda l: sum(l) / float(len(l)))
    return (df1)

print (a(df11, df22))    
print (b(df1, df2))

按注释编辑：

如果存在包含多个单词的

关键字

，则可以应用

列表压缩

：

print (df1)
         Title  Total Visits  Rank
0      The dog             8     4
1      The cat             9     4
2  The dog cat            10     3

print (df2)
   Keyword  Rank
0    snail     5
1      dog     1
2      cat     2
3  The dog     8
4  the Dog     1
5  The Dog     3

wants = df2.Keyword.tolist()
print (wants)
['snail', 'dog', 'cat', 'The dog', 'the Dog', 'The Dog']

d = df2.set_index('Keyword')['Rank'].to_dict()
df1['Keywords'] = df1.Title.apply(lambda x: [item for item in wants if item in x])
df1['Score'] = df1.Keywords.apply(lambda x: [d[item] for item in x])
df1['Keywords'] = df1.Keywords.apply(','.join)
df1['Score'] = df1.Score.apply(lambda l: sum(l) / float(len(l)))
print (df1)
         Title  Total Visits  Rank         Keywords     Score
0      The dog             8     4      dog,The dog  4.500000
1      The cat             9     4              cat  2.000000
2  The dog cat            10     3  dog,cat,The dog  3.666667

谢谢你的回复。第一个选项的关键字和分数产生NaN，除了一个结果显示一个关键字（尽管它应该有两个），字符串操作选项以零除法错误结束：浮点除法为零。我遇到的问题是-如果字符串包含例如：“星球大战：胭脂一号”，并且关键字为“星球大战”字符串存储为“[”星球大战“，”盗贼“，”一“]，没有匹配项。如果有两个或两个以上的单词作为关键字，则解决方案更复杂。主要问题是，如果存在一个单词关键字和两个或更多单词作为关键字的组合，则在

df1

的

Title

列中拆分。然后按空格拆分只拆分一个单词关键字。有可能解决这个问题吗？不幸的是，有些关键词是复合词，我还没有找到一种方法来调整有复合词的标题。如果有一种方法可以为mList中的项目复制

checkResult=[]mList=[“狗”、“猫”、“苹果”、“狗”、“狗”]mString=“狗在追猫”：如果mString中的项目：checkResult.append（item）

使用熊猫，我想这会解决问题。谢谢。但现在我整个周末都在拜访，所以还是发新问题，或者等到周一，对不起。