Regex 从一个标记而不是另一个标记重新提取数据_Regex_Loops

Regex 从一个标记而不是另一个标记重新提取数据

regex loops

Regex 从一个标记而不是另一个标记重新提取数据,regex,loops,Regex,Loops,我正在尝试让一个程序能够解析类似html的标记——这是针对TREC集合的。除了数据库之外，我不经常编程，而且我对语法也很在行。以下是我当前的代码： parseTREC ('LA010189.txt') #Following Code-re P worked in Python def parseTREC (atext): atext=open(atext, "r") filePath= "testLA.txt" docID= [] docTXT=[] p = re.comp

我正在尝试让一个程序能够解析类似html的标记——这是针对TREC集合的。除了数据库之外，我不经常编程，而且我对语法也很在行。以下是我当前的代码：

parseTREC ('LA010189.txt')

#Following Code-re P worked in Python
def parseTREC (atext):
  atext=open(atext, "r")
  filePath= "testLA.txt"
  docID= []
  docTXT=[]
  p = re.compile ('<DOCNO>(.*?)</DOCNO>', re.IGNORECASE)
  m= re.compile ('<P>(.*?)</P>', re.IGNORECASE)
  for aline in atext:
    values=str(aline)
    if p.findall(values):
      docID.append(p.findall(values))
      if m.findall(values):
        docID.append(p.findall(values))
  print docID
  atext.close()

警察按原计划重新撤回了文件。m re虽然不会提取数据，但会打印一个空列表。我很确定这里有空白，还有一条新线。我尝试了re.M，但这无助于从其他行中提取数据。理想情况下，我希望能够到达我存储在字典{DOCNO，Count}中的位置。计数将通过对P标记和列表[]中的每个单词求和来确定。如有任何建议，我将不胜感激

如果您认为会影响正则表达式结果，可以尝试删除文件中的所有换行符。另外，确保没有嵌套的

标记，因为您的正则表达式可能与预期不匹配。例如：

<p>
  <p>
    <p>here's some data</p>
    And some more data.
  </p>
  And even more data.
</p>

如果是：

docID.append(m.findall(values))

在最后一行？

添加re.DOTALL标志，如下所示：

m=重新编译“

*？

”， re.IGNORECASE | re.DOTALL

您可能还想将其添加到其他正则表达式中。

从xml.dom.minidom导入*

进口稀土

def parseTREC2 (atext):  
    fc = open(atext,'r').read()  
    fc = '<DOCS>\n' + fc + '\n</DOCS>'  
    dom = parseString(fc)  
    w_re = re.compile('[a-z]+',re.IGNORECASE)  
    doc_nodes = dom.getElementsByTagName('DOC')  
    for doc_node in doc_nodes:  
        docno = doc_node.getElementsByTagName('DOCNO')[0].firstChild.data  
        cnt = 1  
        for p_node in doc_node.getElementsByTagName('P'):  
            p = p_node.firstChild.data  
            words = w_re.findall(p)  
            print "\t".join([docno,str(cnt),p])  
            print words  
            cnt += 1

parseTREC2'LA010189.txt'

代码将标记添加到文档的前面，因为没有父标记。然后程序通过xml解析器检索信息。和文本在不同的行上，因此re.DOTALL不能解决我的问题。谢谢你的建议谢谢-是的，最后一行是打字错误。我以前的编程老师教我使用我在答题帖上写的代码。

docID.append(m.findall(values))

def parseTREC2 (atext):  
    fc = open(atext,'r').read()  
    fc = '<DOCS>\n' + fc + '\n</DOCS>'  
    dom = parseString(fc)  
    w_re = re.compile('[a-z]+',re.IGNORECASE)  
    doc_nodes = dom.getElementsByTagName('DOC')  
    for doc_node in doc_nodes:  
        docno = doc_node.getElementsByTagName('DOCNO')[0].firstChild.data  
        cnt = 1  
        for p_node in doc_node.getElementsByTagName('P'):  
            p = p_node.firstChild.data  
            words = w_re.findall(p)  
            print "\t".join([docno,str(cnt),p])  
            print words  
            cnt += 1