Python：读取文本文件的一部分_Python

Python：读取文本文件的一部分

python

Python：读取文本文件的一部分,python,Python,大家好我是python和编程新手。我需要读取大块的大型文本文件，格式如下所示： <word id="8" form="hibernis" lemma="hibernus1" postag="n-p---nb-" head-"7" relation="ADV"/> 如有任何反馈/批评，将不胜感激谢谢我建议使用正则表达式模块：也许是这样的 #!/usr/bin/python import re if __name__ == '__main__': data = open

大家好

我是python和编程新手。我需要读取大块的大型文本文件，格式如下所示：

<word id="8" form="hibernis" lemma="hibernus1" postag="n-p---nb-" head-"7" relation="ADV"/>

如有任何反馈/批评，将不胜感激

谢谢

我建议使用正则表达式模块：

也许是这样的

#!/usr/bin/python
import re

if __name__ == '__main__':
    data = open('x').read()
    RE = re.compile('.*form="(.*)" lemma="(.*)" postag="(.*?)"', re.M)
    matches = RE.findall(data)
    for m in matches:
        print m

这确实假设

行都在一行上，并且每个部分都是按精确顺序排列的，并且您不需要处理完整的xml解析。

您的文件是正确的xml吗？如果是，请尝试SAX解析器：

import xml.sax
class Handler (xml.sax.ContentHandler):
   def startElement (self, tag, attrs):
       if tag == 'word':
           print 'form=', attrs['form']
           print 'lemma=',attrs['lemma']
           print 'postag=',attrs['postag']

ch = Handler ()
f = open ('myfile')
xml.sax.parse (f, ch)

（这很粗糙。它可能不完全正确）。

除了通常的正则表达式答案之外，由于这似乎是XML的一种形式，您可以尝试类似BeautifulSoup（）的方法

它非常容易使用，并且可以在HTML/XML之类的东西中找到标记/属性，即使它们不是“格式良好的”。也许值得一看

手工解析xml通常是错了。首先，你的代码如果有人逃走的话，它会碎的在任何属性中引用。从xml获取属性解析器可能更干净、更少容易出错

如果有与格式不匹配的行，这种方法也可能在解析整个文件时遇到问题。您可以通过创建parseline方法（类似

def parse (line):
      try: 
          return parsed values here
        except:

您还可以使用过滤器和映射功能简化此过程：

lines = filter( lambda line: parseable(line), f.readlines())
values = map (parse, lines)

为了突出你的问题：

finished = False
counter = 0
while not finished:
   counter += 1
   finished=True
print counter

对于正则表达式，这是要点（您可以执行file.readline（）部分）：

重新导入
行=“”
r=re.compile（'form=“（[^”]*）”*lemma=“（[^”]*）”*postag=“（[^”]*）”）
匹配=r.search（行）
打印match.groups（）
>>> 
（‘hibernis’、‘hibernus1’、‘n-p---nb-’）
>>>

首先，不要花太多时间重写文件。这通常是浪费时间。清理和解析标记的过程非常快，您将非常乐意一直从源文件工作

source= open( "blank.txt", "r" )
for line in source:
    # line has a tag-line structure
    # <word id="8" form="hibernis" lemma="hibernus1" postag="n-p---nb-" head-"7" relation="ADV"/>
    # Assumption -- no spaces in the quoted strings.
    parts = line.split()
    # parts is [ '<word', 'id="8"', 'form="hibernis"', ... ]
    assert parts[0] == "<word"
    nameValueList = [ part.partition('=') for part in parts[1:] ]
    # nameValueList is [ ('id','=','"8"'), ('form','=','"hibernis"'), ... ]
    attrs = dict( (n,eval(v)) for n, _, v in nameValueList )
    # attrs is { 'id':'8', 'form':'hibernis', ... }
    print attrs['form'], attrs['lemma'], attrs['posttag']

source=open（“blank.txt”、“r”）
对于行输入源：
#行具有标记行结构
# 
#假设——带引号的字符串中没有空格。
parts=line.split（）
#零件是['哇，你们真快：）
如果您想要列表的所有属性（并且顺序是已知的），则可以使用如下内容：
import re
print re.findall('"(.+?)"',INPUT)

输入是一行，如：
<word id="8" form="hibernis" lemma="hibernus1" postag="n-p---nb-" head="7" relation="ADV"/>

如果是XML，请使用以下命令对其进行解析：
from xml.etree import ElementTree

line = '<word id="8" form="hibernis" lemma="hibernus1" postag="n-p---nb-" head="7" relation="ADV"/>'

element = ElementTree.fromstring(line)

因此，如果您有一个包含大量word
XML元素的文档，类似这样的内容将从每个元素中提取您想要的信息：
from xml.etree import ElementTree

XML = '''
<words>
    <word id="8" form="hibernis" lemma="hibernus1" postag="n-p---nb-" head="7" relation="ADV"/>
</words>'''

root = ElementTree.fromstring(XML)

for element in root.findall('word'):
    form = element.attrib['form']
    lemma = element.attrib['lemma']
    postag = element.attrib['postag']

    print form, lemma, postag

从xml.etree导入元素树
XML=“”
'''
root=ElementTree.fromstring（XML）
对于root.findall（'word'）中的元素：
form=element.attrib['form']
引理=元素.attrib['lemma']
postag=element.attrib['postag']
打印表单、引理、邮资

如果您只有一个文件名，请使用parse（）
而不是fromstring（）
。谢谢重新编译。刚刚尝试了您的代码，这正是我所需要的。非常感谢您的帮助。我确实首先尝试了re模块，并得到了以下表达式：for-in-f:if-re.match（（.*）（f | 1）orm（.*），line）：print>>rfformat，line，但愚蠢地放弃了该方法，转而使用list方法。我现在要研究re模块，确保我知道您的代码在regex（以及下面我的代码）中做什么但是，假设标记的顺序可能对所有entries都无效。使用正则表达式解析XML很少是一个好主意。例如，如果属性用单引号分隔，则此RE将失败，并且不会扩展文本中的字符实体，而应用程序可能需要它。如果我们的问题是读取XML并尝试使用正则表达式，现在您有3个问题：原始问题，试图强迫正则表达式解决它，甚至不知道您走错了路。事实上，您的答案是最好的。：）所有其他人都不想更正代码。您好，这里的文件都是XML，必须查找sax解析器，而且还需要查找sax解析器ow.Will prob会让事情变得更简单。感谢您的帮助。请记住，BeautifulSoup不是标准python发行版的一部分（以防您必须在没有添加包权限的环境中使用此脚本）。这里真的需要eval吗？不会去掉（“”）有更好的选择吗？@SilentGhost:这是其他六种情况中的一种。有些人喜欢说“eval是邪恶的”--这在很大程度上是毫无意义的。但是，这也是一个巧合，示例中显示的字符串似乎是一个有效的Python字符串。可能有转义字符与Python不同，这使得eval由于非Python字符串语法而无效。
['8', 'hibernis', 'hibernus1', 'n-p---nb-', '7', 'ADV']

from xml.etree import ElementTree

line = '<word id="8" form="hibernis" lemma="hibernus1" postag="n-p---nb-" head="7" relation="ADV"/>'

element = ElementTree.fromstring(line)

>>> element.tag
'word'
>>> element.attrib
{'head': '7', 'form': 'hibernis', 'postag': 'n-p---nb-', 'lemma': 'hibernus1', 'relation': 'ADV', 'id': '8'}

from xml.etree import ElementTree

XML = '''
<words>
    <word id="8" form="hibernis" lemma="hibernus1" postag="n-p---nb-" head="7" relation="ADV"/>
</words>'''

root = ElementTree.fromstring(XML)

for element in root.findall('word'):
    form = element.attrib['form']
    lemma = element.attrib['lemma']
    postag = element.attrib['postag']

    print form, lemma, postag