Python 基于pandas的大数据集复合词模式检测_Python_Pandas_Pattern Matching_Iteration

Python 基于pandas的大数据集复合词模式检测

python pandas

Python 基于pandas的大数据集复合词模式检测,python,pandas,pattern-matching,iteration,Python,Pandas,Pattern Matching,Iteration,假设我有两个单词列表，一个紧跟着另一个。它们通过空格或破折号连接。简单地说，它们将是相同的词： First=['Derp','Foo','Bar','Python','Monte','Snake'] Second=['Derp','Foo','Bar','Python','Monte','Snake'] 因此，存在以下词语的以下组合（用“是”表示）：我有一个这样的数据集，我正在检测特定的单词： df=pd.DataFrame({'Name': [ 'Al Gore', 'Foo-Bar',

假设我有两个单词列表，一个紧跟着另一个。它们通过空格或破折号连接。简单地说，它们将是相同的词：

First=['Derp','Foo','Bar','Python','Monte','Snake']
Second=['Derp','Foo','Bar','Python','Monte','Snake']

因此，存在以下词语的以下组合（用“是”表示）：

我有一个这样的数据集，我正在检测特定的单词：

df=pd.DataFrame({'Name': [ 'Al Gore', 'Foo-Bar', 'Monte-Python', 'Python Snake', 'Python Anaconda', 'Python-Pandas', 'Derp Bar', 'Derp Python', 'JavaScript', 'Python Monte'],
                 'Class': ['Politician','L','H','L','L','H', 'H','L','L','Circus']})

如果我使用正则表达式并标记模式中的所有数据，它将如下所示：

import pandas as pd


df=pd.DataFrame({'Name': [ 'Al Gore', 'Foo-Bar', 'Monte-Python', 'Python Snake', 'Python Anaconda', 'Python-Pandas', 'Derp Bar', 'Derp Python', 'JavaScript', 'Python Monte'],
                 'Class': ['Politician','L','H','L','L','H', 'H','L','L','Circus']})
df['status']=''

patterns=['^Derp(-|\s)(Foo|Bar|Snake)$', '^Foo(-|\s)(Bar|Python|Monte)$', '^Python(-|\s)(Derp|Foo|Bar|Snake)', '^Monte(-|\s)(Derp|Foo|Bar|Python|Snake)$']


for i in range(len(patterns)):
    df.loc[df.Name.str.contains(patterns[i]),'status'] = 'Found'

print (df)

这是印刷品：

>>> 

        Class             Name status
0  Politician          Al Gore       
1           L          Foo-Bar  Found
2           H     Monte-Python  Found
3           L     Python Snake  Found
4           L  Python Anaconda       
5           H    Python-Pandas       
6           H         Derp Bar  Found
7           L      Derp Python       
8           L       JavaScript       
9      Circus     Python Monte       

[10 rows x 3 columns]

对于较大的数据集，写出所有正则表达式模式似乎不是很可行。那么，有没有方法使循环或某种东西从组合矩阵中遍历模式，以检索存在的模式（在上表中表示为是），并跳过不存在的模式（在上表中表示为否）？我知道在

itertools

库中有一个名为

combines

的函数，它可以遍历并通过循环生成所有可能的模式。

我认为从组合矩阵生成这些正则表达式并不难：

# Reading in your combination matrix:
pattern_mat = pd.read_clipboard()
# Map from first words to following words:
w2_dict = {}
for w1, row in pattern_mat.iterrows():
    w2_dict[w1] = list(row.loc[row == 'Yes'].index)
# Print all the resulting regexes:
# (not sure if the backspace needs to be escaped?)
for w1, w2_list in w2_dict.items():
    pattern = "^{w1}(-|\s)({w2s})$".format(w1=w1, w2s='|'.join(w2_list))
    print(pattern)

输出：

^Monte(-|\s)(Foo|Bar)$
^Snake(-|\s)(Derp|Bar|Python|Monte)$
^Bar(-|\s)(Derp|Foo|Python|Monte|Snake)$
^Foo(-|\s)(Derp|Python|Monte|Snake)$
^Python(-|\s)(Foo|Bar|Monte|Snake)$
^Derp(-|\s)(Bar|Python|Monte|Snake)$

感谢您的回复，您基本上就是这样做的：存储表生成dict，在表中循环查找yes值，并将其存储在字典中，并将行用作键。然后使用join生成模式。@ccsv是的，基本上就是这样。可能有比使用

iterrows

更优雅的方法来遍历模式矩阵，但我认为这一切都相当简单，您可能只想将一些更为Xomplex的单行语句扩展为多行以提高可读性。

^Monte(-|\s)(Foo|Bar)$
^Snake(-|\s)(Derp|Bar|Python|Monte)$
^Bar(-|\s)(Derp|Foo|Python|Monte|Snake)$
^Foo(-|\s)(Derp|Python|Monte|Snake)$
^Python(-|\s)(Foo|Bar|Monte|Snake)$
^Derp(-|\s)(Bar|Python|Monte|Snake)$