在Python中使用字典作为正则表达式_Python_Regex_List_Parsing_Dictionary

在Python中使用字典作为正则表达式

python regex list parsing dictionary

在Python中使用字典作为正则表达式,python,regex,list,parsing,dictionary,Python,Regex,List,Parsing,Dictionary,我有一个Python问题，希望能得到一些帮助让我们从重要的部分开始，以下是我当前的代码： import re #for regex import numpy as np #for matrix f1 = open('file-to-analyze.txt','r') #file to analyze #convert files of words into arrays. #These words are used to be matched against in the "file-t

我有一个Python问题，希望能得到一些帮助

让我们从重要的部分开始，以下是我当前的代码：

import re #for regex
import numpy as np #for matrix

f1 = open('file-to-analyze.txt','r') #file to analyze

#convert files of words into arrays. 
#These words are used to be matched against in the "file-to-analyze"
math = open('sample_math.txt','r')
matharray = list(math.read().split())
math.close()

logic = open('sample_logic.txt','r')
logicarray = list(logic.read().split())
logic.close()

priv = open ('sample_priv.txt','r')
privarray = list(priv.read().split())
priv.close()

... Read in 5 more files and make associated arrays

#convert arrays into dictionaries
math_dict = dict()
math_dict.update(dict.fromkeys(matharray,0))

logic_dict = dict()
logic_dict.update(dict.fromkeys(logicarray,1))

...Make more dictionaries from the arrays (8 total dictionaries - the same number as there are arrays)

#create big dictionary of all keys
word_set = dict(math_dict.items() + logic_dict.items() + priv_dict.items() ... )

statelist = list()

for line in f1:
     for word in word_set:
         for m in re.finditer(word, line):
            print word.value()

该程序的目标是获取一个大型文本文件并对其进行分析。本质上，我希望程序循环遍历文本文件，匹配Python字典中的单词，并将它们与类别关联，并在列表中跟踪它

比如说，我在解析文件时遇到了“ADD”这个词。“添加”列在单词的“数学”或“0”类别下。然后，程序应将其添加到它跨0类别运行的列表中，然后继续解析该文件。基本上生成一个类似于[0,4,6,7,4,3,4,1,2,7,1,2,2,4…]的大列表，每个数字对应于如上所示的特定状态或单词类别。为了便于理解，我们将此大型列表称为“状态列表”

从我的代码中可以看出，到目前为止，我可以将文件作为输入来分析、获取包含单词列表的文本文件，并将其存储到数组中，然后将其存储到字典中，并使用正确的对应列表值（1-7的数值）。但是，我在分析部分遇到了问题

从我的代码中可以看出，我正试图逐行浏览文本文件，并使用字典对找到的任何单词进行正则表达式。这是通过一个循环和一个附加的第9个字典来完成的，它或多或少是一个“超级”字典，有助于简化解析

但是，我在匹配文件中的所有单词时遇到问题，当我找到单词时，将其匹配到字典值，而不是键。因为它是0或“数学”类别的一部分，所以它会运行并“添加”以将0添加到列表中

有人能帮我弄明白怎么写这个脚本吗？我真的很感激！很抱歉写了这么长的文章，但是代码需要很多解释，以便您知道发生了什么。非常感谢您的帮助

对现有代码最简单的更改就是跟踪循环中的单词和类别：

for line in f1:
    for word, category in word_set.iteritems():
        for m in re.finditer(word, line):
            print word, category
            statelist.append(category)

更新：使用此循环打印单词名称，但不打印类别值。[对于f1中的行：对于word中的word\u set:对于re中的m.finditer（word，line）：statelist.append（word）]^抱歉，我试图在注释中将其作为代码块，但它不起作用