Python 匹配模式的DataFrame列之间的字符串搜索_Python_Pandas

Python 匹配模式的DataFrame列之间的字符串搜索

python pandas

Python 匹配模式的DataFrame列之间的字符串搜索,python,pandas,Python,Pandas,我有一个包含字符串的表 a = pd.DataFrame({"strings_to_search" : ["AA1 BB2 CVC GF2","AR1 KP1","PL3 4OR 91K GZ3"]}) 还有一个搜索参数是正则表达式 re = pd.DataFrame({"regex_search" : ["^(?=.*AA1).*$", "^(?=.*AR1)(?=.*PL3).*$", "^(?=.*4OR)(?=.*GZ3).*$"]}) 我的目标是将字符串与搜索参数匹配（如果它是字符

我有一个包含字符串的表

a = pd.DataFrame({"strings_to_search" : ["AA1 BB2 CVC GF2","AR1 KP1","PL3 4OR 91K GZ3"]})

还有一个搜索参数是正则表达式

re = pd.DataFrame({"regex_search" : ["^(?=.*AA1).*$", "^(?=.*AR1)(?=.*PL3).*$", "^(?=.*4OR)(?=.*GZ3).*$"]})

我的目标是将字符串与搜索参数匹配（如果它是字符串的一部分）。我想将每个字符串与每个模式进行比较，并按如下方式连接匹配的字符串模式：

| AA1 BB2 CVC GF2 | ^(?=.*AA1).*$
| PL3 4OR 91K GZ3 | ^(?=.*4OR)(?=.*GZ3).*$

有没有办法在熊猫身上做到这一点？我在sparkSQL中使用rlike函数实现了类似的功能，但是spark在连接大型表时做得不太好

由于pandas没有类似rlike的函数，我的方法是对两个表进行交叉连接，然后比较列

a["key"] = 0
re["key"] = 0
res = a.merge(re, on="key")

但如何使用regex-in-column-regex-u-search搜索列字符串

您可以组合数据帧，然后使用

apply

函数执行正则表达式搜索。在本例中，我将您的

re

数据帧重命名为

，因为

re

是模块的名称。首先执行两个数据帧的笛卡尔乘积。然后在

lambda

中，在每一行计算正则表达式

regex_search

，并生成一个布尔输出，指示搜索是否生成

True

，如果表达式存在于

字符串_-to_search

中，或者如果表达式不存在

False

。最后，将数据帧过滤到匹配发生的位置，在

strings\u to\u search

上分组，并生成所有匹配

regex\u search

的列表

import pandas as pd
import re

a["idx"] = 1
r["idx"] = 1
df = a.merge(r, on="idx").drop("idx", axis=1)

df["output"] = df.apply(lambda x: bool(re.compile(x["regex_search"]).search(x["strings_to_search"])), axis=1)

df[df["output"] == True].groupby("strings_to_search")["regex_search"].apply(list)

您可以组合数据帧，然后使用

apply

函数执行正则表达式搜索。在本例中，我将您的

re

数据帧重命名为

，因为

re

是模块的名称。首先执行两个数据帧的笛卡尔乘积。然后在

lambda

中，在每一行计算正则表达式

regex_search

，并生成一个布尔输出，指示搜索是否生成

True

，如果表达式存在于

字符串_-to_search

中，或者如果表达式不存在

False

。最后，将数据帧过滤到匹配发生的位置，在

strings\u to\u search

上分组，并生成所有匹配

regex\u search

的列表

import pandas as pd
import re

a["idx"] = 1
r["idx"] = 1
df = a.merge(r, on="idx").drop("idx", axis=1)

df["output"] = df.apply(lambda x: bool(re.compile(x["regex_search"]).search(x["strings_to_search"])), axis=1)

df[df["output"] == True].groupby("strings_to_search")["regex_search"].apply(list)

这将得到你的结果，但速度很慢

import re
import pandas as pd

a = pd.DataFrame({"strings_to_search" : ["AA1 BB2 CVC GF2","AR1 KP1","PL3 4OR 91K GZ3"]})
b = pd.DataFrame({"regex_search" : ["^(?=.*AA1).*$", "^(?=.*AR1)(?=.*PL3).*$", "^(?=.*4OR)(?=.*GZ3).*$"]})

a.insert(1,'regex','')

for item in b.regex_search:
    for s in a.strings_to_search:
        if(re.match(item,s)):
            a.regex.loc[a.strings_to_search == s] = item

print(a)

这将得到你的结果，但速度很慢

import re
import pandas as pd

a = pd.DataFrame({"strings_to_search" : ["AA1 BB2 CVC GF2","AR1 KP1","PL3 4OR 91K GZ3"]})
b = pd.DataFrame({"regex_search" : ["^(?=.*AA1).*$", "^(?=.*AR1)(?=.*PL3).*$", "^(?=.*4OR)(?=.*GZ3).*$"]})

a.insert(1,'regex','')

for item in b.regex_search:
    for s in a.strings_to_search:
        if(re.match(item,s)):
            a.regex.loc[a.strings_to_search == s] = item

print(a)

如果要将每个字符串与每个正则表达式进行比较，请使用列表理解和重新匹配：

import re
result = [string+' | '+reg for reg in r['regex_search'] for string in a['strings_to_search']
          if re.compile(reg).match(string)]
result
['AA1 BB2 CVC GF2|^(?=.*AA1).*$', 'PL3 4OR 91K GZ3|^(?=.*4OR)(?=.*GZ3).*$']

如果需要新的数据帧：

new_df = pd.DataFrame({'matches': result })
new_df
         matches
0   AA1 BB2 CVC GF2|^(?=.*AA1).*$
1   PL3 4OR 91K GZ3|^(?=.*4OR)(?=.*GZ3).*$

如果要将每个字符串与每个正则表达式进行比较，请使用列表理解和重新匹配：

import re
result = [string+' | '+reg for reg in r['regex_search'] for string in a['strings_to_search']
          if re.compile(reg).match(string)]
result
['AA1 BB2 CVC GF2|^(?=.*AA1).*$', 'PL3 4OR 91K GZ3|^(?=.*4OR)(?=.*GZ3).*$']

如果需要新的数据帧：

new_df = pd.DataFrame({'matches': result })
new_df
         matches
0   AA1 BB2 CVC GF2|^(?=.*AA1).*$
1   PL3 4OR 91K GZ3|^(?=.*4OR)(?=.*GZ3).*$

您想只使用相应的正则表达式还是所有正则表达式检查字符串？我想找到匹配的正则表达式并将其连接到字符串您关心速度吗？“如果不是的话，我的回答应该对你有用。”丹尼尔，我试着回答。让我知道我的答案是否符合您的需要。您想只使用相应的正则表达式或所有正则表达式检查字符串吗？我想找到匹配的正则表达式并将其连接到字符串您关心速度吗？“如果不是的话，我的回答应该对你有用。”丹尼尔，我试着回答。让我知道我的答案是否是你所需要的。谢谢你的回复。不幸的是，concat在这里不起作用，因为我不仅要检查相应的正则表达式，还要检查所有正则表达式。当将“df[“output”]的代码应用到交叉连接的df时，我得到一个错误TypeError:（“'Series'对象是可变的，因此不能对它们进行散列”，“在索引0处发生”）啊，好吧，这一要求在OP中并不明确。我已经更新了答案，在开头加入了笛卡尔连接，以确保所有正则表达式都被搜索。然后我添加了一个

groupby

，以总结您要搜索的每个字符串的所有正匹配。感谢您的回复。不幸的是，concat在这里不起作用，因为我没有只需检查相应的正则表达式，但不检查所有正则表达式。当将“df[“output”]的代码应用于交叉连接的df时，我得到一个错误TypeError:（“'Series'对象是可变的，因此不能对其进行散列”，'occurrent at index 0'）啊，好吧，这一要求在OP中并不明确。我已经更新了答案，在开头包含笛卡尔连接，以确保搜索所有正则表达式。然后，我添加了一个

groupby

，以汇总您要搜索的每个字符串的所有正匹配项。