Python 任意阶正则表达式_Python_Regex

Python 任意阶正则表达式

python regex

Python 任意阶正则表达式,python,regex,Python,Regex,这是我遇到的真实情况，每本书的信息都应该被提取出来。在原始文本中，每本书的信息都通过ENTER键与其他文本分开每本书都有书名。但是作者/格式/。。。信息可以省略；但是，如果显示其中任何一个，则可以用ENTER或WhiteCPACE分隔。对我来说，最困难的部分是信息可以按任意顺序处理，因此，让我举一个例子： title: book 1 author: Mike Language: Eng format: pdf pages: 12 some other text author: Jack t

这是我遇到的真实情况，每本书的信息都应该被提取出来。在原始文本中，每本书的信息都通过ENTER键与其他文本分开

每本书都有书名。但是作者/格式/。。。信息可以省略；但是，如果显示其中任何一个，则可以用ENTER或WhiteCPACE分隔。对我来说，最困难的部分是信息可以按任意顺序处理，因此，让我举一个例子：

title: book 1
author: Mike
Language: Eng
format: pdf
pages: 12

some other text

author: Jack
title: book 2

some other text 2

title: book 3 pages: 300

应该被认定为3本书。在我想要的python代码中：

用于项目re.findall（“标题：.{1，}（（？=.*作者：.{1，}））{0,1}（（？=.*语言：.{1，}））{0,1}（（？=.*格式：.{1，}））{0,1}（（？=.*页：.{1，}））{0,1}\n），主题，re IGNORECASE | re VERBOSE）：
打印（'未知作者'或项目['作者]）
打印（项目[“标题]）
打印（'Unknown pages'或item['Author']））
打印（“\n”）
#我所期望的是
迈克
第一册
12
杰克
第二册
未知页面
无名作者
第三册
300

请注意两件事：

对于

第二册

，作者在正文的标题前面，这就是我所说的使用

任意顺序

对于

第3册

，页面信息不会放在新行上。由于所有标题（

作者：

，

标题：

，等等，对不起，我不知道如何用英语称呼它）不会出现在其他信息中，所以可以肯定地说这不是一本300页的书

我已经阅读、模拟并得到了上面的正则表达式。但正如你所知，这是错误的：

重新导入
主题=“”
标题：第一册
作者：迈克
语言：英语
格式：pdf
页数：12
其他一些文本
作者：杰克
标题：第二册
其他一些文本2
书名：书3页：300
'''
result=re.findall（“标题：.{1，}（（？=.*作者：.{1，}））{0,1}（（？=.*语言：.{1，}））{0,1}（（？=.*格式：.{1，}））{0,1}，主题，re.IGNORECASE | re VERBOSE）
就我而言，结果是：
印刷品（一）

产生

('', '', '', '')
('', '', '', '')
('', '', '', '')

有什么帮助吗？谢谢

如果您不必使用正则表达式，您可以检查

'：'

是否在第一个字符中，我不知道，大约10个字符。如果是，假设它是一本书的财产。如果不是，则表示给定书籍的属性已结束。因此，您拥有该书的所有属性。然后将它们添加到某种“最终”图书列表中

以字符串形式显示您的数据：

subject = '''
title: book 1
author: Mike
Language: Eng
format: pdf
pages: 12

some other text

author: Jack
title: book 2

some other text 2

title: book 3 pages: 300
'''

一些代码：

from copy import copy

books = []
book_properties = []

lines = subject.splitlines()

for i,line in enumerate(lines, start=1):
    if ": " in line[:10]:
        book_properties.append(line)
        if i == len(lines):
            book = copy(book_properties)
            books.append(book)
    else:
        if len(book_properties) > 0:
            book = copy(book_properties)
            books.append(book)
            book_properties.clear()

print(books)

结果

[['title: book 1', 'author: Mike', 'Language: Eng', 'format: pdf', 'pages: 12'],
 ['author: Jack', 'title: book 2'],
 ['title: book 3 pages: 300']]

如果不必使用正则表达式，可以检查

：“

以字符串形式显示您的数据：

subject = '''
title: book 1
author: Mike
Language: Eng
format: pdf
pages: 12

some other text

author: Jack
title: book 2

some other text 2

title: book 3 pages: 300
'''

一些代码：

from copy import copy

books = []
book_properties = []

lines = subject.splitlines()

for i,line in enumerate(lines, start=1):
    if ": " in line[:10]:
        book_properties.append(line)
        if i == len(lines):
            book = copy(book_properties)
            books.append(book)
    else:
        if len(book_properties) > 0:
            book = copy(book_properties)
            books.append(book)
            book_properties.clear()

print(books)

结果

[['title: book 1', 'author: Mike', 'Language: Eng', 'format: pdf', 'pages: 12'],
 ['author: Jack', 'title: book 2'],
 ['title: book 3 pages: 300']]

这是一个有点复杂的混合解决方案，我使用了正则表达式，但不仅如此。我将文本分割成块，并对其应用正则表达式

import re

text="""

title: book 1
author: Mike
Language: Eng
format: pdf
pages: 12

some other text

author: Jack
title: book 2

some other text 2

title: book 3 pages: 300

title: Adventures of Huckleberry Finn author: Mark Twain pages: 500

title: Captain Python

"""

recs=[[]]
last=recs[-1]
for line in text.splitlines():

    line=line.strip()
    if not line:
        if not last:
            continue
        recs.append([])
        last=recs[-1]
        continue

    founds= re.findall(r"(?m)(title|author|pages):(.*?)(?:$|(?=title:|author:|pages:))",line)
    if founds and founds[0]:
        last.extend(founds)


for l in recs:
    if l:
        d={"title":"unknown", "author":"unknown", "pages":"unknown"}
        d.update( dict(l) )
        print(d)

输出：

{'title': ' book 1', 'author': ' Mike', 'pages': ' 12'}
{'title': ' book 2', 'author': ' Jack', 'pages': 'unknown'}
{'title': ' book 3 ', 'author': 'unknown', 'pages': ' 300'}
{'title': ' Adventures of Huckleberry Finn ', 'author': ' Mark Twain ', 'pages': ' 500'}
{'title': ' Captain Python', 'author': 'unknown', 'pages': 'unknown'}

这是一个有点复杂的混合解决方案，我使用了正则表达式，但不仅如此。我将文本分割成块，并对其应用正则表达式

import re

text="""

title: book 1
author: Mike
Language: Eng
format: pdf
pages: 12

some other text

author: Jack
title: book 2

some other text 2

title: book 3 pages: 300

title: Adventures of Huckleberry Finn author: Mark Twain pages: 500

title: Captain Python

"""

recs=[[]]
last=recs[-1]
for line in text.splitlines():

    line=line.strip()
    if not line:
        if not last:
            continue
        recs.append([])
        last=recs[-1]
        continue

    founds= re.findall(r"(?m)(title|author|pages):(.*?)(?:$|(?=title:|author:|pages:))",line)
    if founds and founds[0]:
        last.extend(founds)


for l in recs:
    if l:
        d={"title":"unknown", "author":"unknown", "pages":"unknown"}
        d.update( dict(l) )
        print(d)

输出：

{'title': ' book 1', 'author': ' Mike', 'pages': ' 12'}
{'title': ' book 2', 'author': ' Jack', 'pages': 'unknown'}
{'title': ' book 3 ', 'author': 'unknown', 'pages': ' 300'}
{'title': ' Adventures of Huckleberry Finn ', 'author': ' Mark Twain ', 'pages': ' 500'}
{'title': ' Captain Python', 'author': 'unknown', 'pages': 'unknown'}

其他一些文本

和

其他一些文本2

属于哪本书？如前所述，这个问题无法解决。你必须在一本书的细节和下一本书的细节之间有明确的划分。@Jarad。只需跳过

其他一些文本

和

其他一些文本2

，因为其中没有

标题：xxx

。

其他一些文本

和

其他一些文本2

属于哪本书？如前所述，这个问题无法解决。你必须在一本书的细节和下一本书的细节之间有明确的划分。@Jarad。只需跳过

一些其他文本

和

一些其他文本2

，因为其中没有

标题：xxx

。