用Python解析大型文本文件_Python_Regex_File_Text

用Python解析大型文本文件

python regex file text

用Python解析大型文本文件,python,regex,file,text,Python,Regex,File,Text,我试图用Python解析一个大的文本文件（~20000行）。这是一项考试，因此每个文本块的格式如下所示：（3.1.1.1）第一个问题包含几行内容（3.1.1.2）下一个问题包含更多行我试图通过匹配模式（3.1.*和带有以下代码的正则表达式）来拆分文本： for line in data(0,10): #start with the first 10 lines to check it results = re.match("^(3.1.*", line) if len(results

我试图用Python解析一个大的文本文件（~20000行）。这是一项考试，因此每个文本块的格式如下所示：

（3.1.1.1）第一个问题包含几行内容

（3.1.1.2）下一个问题包含更多行

我试图通过匹配模式（3.1.*和带有以下代码的正则表达式）来拆分文本：

for line in data(0,10):    #start with the first 10 lines to check it
results = re.match("^(3.1.*", line)
if len(results.group()) != 0:
  print line

我可以在拆分后处理其余的内容（将其转移到字典等），但我需要一些帮助，以便根据模式开始拆分。谢谢。

以下表达式：

从“

（3.1.

”到下一个之前的匹配：

```
\（\d+\.\d+\.
```
请求另一个问题，或
```
\Z
```
文件的结尾

您需要设置以下标志：

在线测试

此解决方案的关键在于使用

*？

，a-检查该链接。基本上，它会尽可能少地匹配。

以下表达式：

从“

（3.1.

”到下一个之前的匹配：

```
\（\d+\.\d+\.
```
请求另一个问题，或
```
\Z
```
文件的结尾

您需要设置以下标志：

在线测试

此解决方案的关键在于使用

*？

，a-检查该链接。基本上，它会尽可能少地匹配。

您还可以将字符串拆分为问题编号和问题值。然后，在列表上迭代并设置到字典中

import re

data = """(3.1.1.1) The first question contains several lines.

(3.1.1.2) The next question contains more lines."""

splitted = re.split('\(([\d\.]*)\)',data)

paired = {}
# splitted contains an empty string on the 0th index
for i in range(1, len(splitted) - 1, 2):
    paired[splitted[i]] = splitted[i+1]

您还可以将字符串拆分为问题编号和问题值。然后，迭代列表并将其设置到字典中

import re

data = """(3.1.1.1) The first question contains several lines.

(3.1.1.2) The next question contains more lines."""

splitted = re.split('\(([\d\.]*)\)',data)

paired = {}
# splitted contains an empty string on the 0th index
for i in range(1, len(splitted) - 1, 2):
    paired[splitted[i]] = splitted[i+1]

以下内容将创建一个问题块列表（

已解析

），在出现问题编号模式时分割问题块：

import re
import pprint

parsed = []
lastblock = []
newblockregex = re.compile('^\(\d+\.\d+\.\d+\.\d+\).*')
with open('data.txt') as exam:
    for line in exam.readlines():
        if newblockregex.match(line.rstrip('\n')):
            if lastblock:
                parsed.append(lastblock)
                lastblock = []
            lastblock = [line.rstrip('\n')]
        else:
            lastblock.append(line.rstrip('\n'))
parsed.append(lastblock)
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(parsed)

示例数据：

(splitexam)macbook:splitexam joeyoung$ cat data.txt
(3.1.1.1) The first question contains several lines.
line1 words1
line2 words
line3 words

(3.1.1.2) The next question contains more lines.
line1 words2
line2 words
line3 words

(3.1.1.3) The next question contains more lines.
line1 words3
line2 words
line3 words

输出：

[   [   '(3.1.1.1) The first question contains several lines.',
        'line1 words1',
        'line2 words',
        'line3 words',
        ''],
    [   '(3.1.1.2) The next question contains more lines.',
        'line1 words2',
        'line2 words',
        'line3 words',
        ''],
    [   '(3.1.1.3) The next question contains more lines.',
        'line1 words3',
        'line2 words',
        'line3 words',
        '']]

以下内容将创建一个问题块列表（

已解析

），在出现问题编号模式时分割问题块：

import re
import pprint

parsed = []
lastblock = []
newblockregex = re.compile('^\(\d+\.\d+\.\d+\.\d+\).*')
with open('data.txt') as exam:
    for line in exam.readlines():
        if newblockregex.match(line.rstrip('\n')):
            if lastblock:
                parsed.append(lastblock)
                lastblock = []
            lastblock = [line.rstrip('\n')]
        else:
            lastblock.append(line.rstrip('\n'))
parsed.append(lastblock)
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(parsed)

示例数据：

(splitexam)macbook:splitexam joeyoung$ cat data.txt
(3.1.1.1) The first question contains several lines.
line1 words1
line2 words
line3 words

(3.1.1.2) The next question contains more lines.
line1 words2
line2 words
line3 words

(3.1.1.3) The next question contains more lines.
line1 words3
line2 words
line3 words

输出：

[   [   '(3.1.1.1) The first question contains several lines.',
        'line1 words1',
        'line2 words',
        'line3 words',
        ''],
    [   '(3.1.1.2) The next question contains more lines.',
        'line1 words2',
        'line2 words',
        'line3 words',
        ''],
    [   '(3.1.1.3) The next question contains more lines.',
        'line1 words3',
        'line2 words',
        'line3 words',
        '']]

也许，它应该尽量避免帕伦斯内部问题，即

（3.1.1.1）第一个问题包含几行（太多）

也许，它应该尽量避免帕伦斯内部问题，即

（3.1.1.1）第一个问题包含几行（太多）