Python:combine str.contains和merge in pandas_Python_Regex_Pandas_Dataframe_Merge

Python:combine str.contains和merge in pandas

python regex pandas dataframe merge

Python:combine str.contains和merge in pandas,python,regex,pandas,dataframe,merge,Python,Regex,Pandas,Dataframe,Merge,我有两个数据帧，看起来有点像下面的内容（df1中的Content列实际上是一篇文章的全部内容，而不是像我的示例中那样，只有一句话）：（总数：5709份）（总数：10228份）我想通过在df1的Content中从df2搜索标题来合并这两个数据帧。如果标题出现在内容的前2500个字符中的某个地方，则为匹配项。注意：保存来自df1的所有条目非常重要。相反，我只想保留匹配的df2中的条目（即左连接）。注意：所有标题都是唯一的值所需输出（列顺序无关紧要）：我想我需要在pd.merge和str

我有两个数据帧，看起来有点像下面的内容（df1中的

Content

列实际上是一篇文章的全部内容，而不是像我的示例中那样，只有一句话）：

（总数：5709份）

（总数：10228份）

我想通过在

df1

的

Content

中从

df2

搜索

标题来合并这两个数据帧。如果标题出现在内容的前2500个字符中的某个地方，则为匹配项。
注意：保存来自df1
的所有条目非常重要。相反，我只想保留匹配的df2中的条目（即左连接）。
注意：所有标题都是唯一的值
所需输出（列顺序无关紧要）：
我想我需要在pd.merge
和str.contains
之间进行组合，但我不知道如何组合
 警告：解决方案可能很慢：）。

1.获取标题列表

2.根据标题列表顺序为df1创建索引

3.在idx上连接df1和df2

  lst = [item.lower() for item in df2.Title.tolist()]
  end = len(lst)
  def func(row):
    content = row[:2500].lower()
    for i, item in enumerate(lst):
      if item in content:
        return i
    end += 1
    return end
  df1 = df1.assign(idx=df1.Content.apply(func))

  res = pd.concat([df1.set_index('idx'), df2], axis=1)

输出
      PDF                                            Content    Author  \
0  1111.0  Johannes writes about apples and oranges and t...  Johannes
1  1234.0  This article is about bananas and pears and gr...     Peter
2  3993.0  There is an interesting piece on bananas plus ...    Hannah
3     NaN                                                NaN    Helena
4  8000.0  Content that cannot be matched to the anything...       NaN

                          Title
0            Apples and oranges
1  Bananas and pears and grapes
2            Bananas plus kiwis
3            Mangos and peaches
4                           NaN

你可以做一个完整的笛卡尔连接/交叉积，然后过滤。由于无法执行哈希查找，因此它不应比等效的“Join”语句慢：
df1['key'] = 1
df2['key'] = 2
df3 = pd.merge(df1, df2, on='key')
df3['key'] = df3.apply(lambda row: row['Title'].lower() in row['Content'][:2500].lower(), axis=1)
df3 = df3.loc[df3['key'], ['PDF', 'Author', 'Title', 'Content']]

将生成以下表格：
       PDF    Author                         Title  \
0   1234.0  Johannes            Apples and oranges
1   1234.0     Peter  Bananas and pears and grapes
4   1111.0  Johannes            Apples and oranges
14  3993.0    Hannah            Bananas plus kiwis

                                              Content
0   This article is about bananas and pears and gr...
1   This article is about bananas and pears and gr...
4   Johannes writes about apples and oranges and t...
14  There is an interesting piece on bananas plus ...

如果存在多个匹配项，您希望/期望的行为是什么？标题列中的所有条目都是唯一的。关于内容列，我希望标题条目与内容条目中的第一个找到的匹配项相匹配。“第一个找到的匹配项”如。。。？首先在数据集中（一行一行）或在字符串中的位置？尝试完全笛卡尔连接，然后设计自己的过滤器？我编辑了我的问题，请参见PDF 1234，提到了“香蕉、梨和葡萄”以及“苹果和橙子”。首先，在字符串中的位置。尽管我必须说，两个标题不太可能同时出现在前2500个字符中。我得到以下错误，即使最初，两个数据帧都只有非空对象：--------------------------------------------------------------------------------------AttributeError Traceback（最近一次调用last）in（）2#在第二个df的前2500个字符中。3-->4 lst=[item.lower（）（用于df2.Title.tolist（）中的项）5 end=len（lst）6 def func（行）：AttributeError:“float”对象没有属性“lower”。有什么想法吗？@nynklys使用以下命令将内容更改为strI have，但仍然会得到相同的错误：df1.Content=df1.Content.astype（'str'）@nynklys convert title tostr@NynkeLys要运行代码，标题和内容必须为字符串。：）非常感谢。我尝试了，但出现了以下错误：ValueError:无法设置没有定义索引的帧和无法转换为序列的值。有想法吗？有想法吗？运行代码会产生一个持续的错误。我使用Python2.7，即使使用与我为我的问题创建的dfs完全相同的dfs。
      PDF                                            Content    Author  \
0  1111.0  Johannes writes about apples and oranges and t...  Johannes
1  1234.0  This article is about bananas and pears and gr...     Peter
2  3993.0  There is an interesting piece on bananas plus ...    Hannah
3     NaN                                                NaN    Helena
4  8000.0  Content that cannot be matched to the anything...       NaN

                          Title
0            Apples and oranges
1  Bananas and pears and grapes
2            Bananas plus kiwis
3            Mangos and peaches
4                           NaN

df1['key'] = 1
df2['key'] = 2
df3 = pd.merge(df1, df2, on='key')
df3['key'] = df3.apply(lambda row: row['Title'].lower() in row['Content'][:2500].lower(), axis=1)
df3 = df3.loc[df3['key'], ['PDF', 'Author', 'Title', 'Content']]

       PDF    Author                         Title  \
0   1234.0  Johannes            Apples and oranges
1   1234.0     Peter  Bananas and pears and grapes
4   1111.0  Johannes            Apples and oranges
14  3993.0    Hannah            Bananas plus kiwis

                                              Content
0   This article is about bananas and pears and gr...
1   This article is about bananas and pears and gr...
4   Johannes writes about apples and oranges and t...
14  There is an interesting piece on bananas plus ...