Python 如何在文件中搜索以特定格式以数字开头的所有文本行并将其移动到新行

Python 如何在文件中搜索以特定格式以数字开头的所有文本行并将其移动到新行,python,regex,Python,Regex,我在KJV圣经的平面文本文件版本中搜索一个单词或一组单词,以获得一个匹配项,该匹配项返回找到该单词的行、书籍、章节和韵文。我的问题是,我必须手动找到每本书开始的行号,并把它们放在字典里,但当时我没有考虑文件有杂乱的行,例如: 1:16 And God made two great lights; the greater light to rule the day, and the lesser light to rule the night: he made the stars also. 1

我在KJV圣经的平面文本文件版本中搜索一个单词或一组单词,以获得一个匹配项,该匹配项返回找到该单词的行、书籍、章节和韵文。我的问题是,我必须手动找到每本书开始的行号,并把它们放在字典里,但当时我没有考虑文件有杂乱的行,例如:

1:16 And God made two great lights; the greater light to rule the day,
and the lesser light to rule the night: he made the stars also.

1:17 And God set them in the firmament of the heaven to give light
upon the earth, 1:18 And to rule over the day and over the night, and
to divide the light from the darkness: and God saw that it was good.
所以,如果我寻找上帝,在1:16之后的一行,章节列为1,诗节列为16,1:17也是一样。。。但1:18中的这一行将被列为第1章第17节

我需要弄清楚如何搜索像1:18这样的所有行,并将它们移动到新行。显然,下面代码中first_lines dictionary中的行号将发生更改,但这是次要的(我将简单地返回文本文件并手动查看起始行号)。我真的很感谢你的帮助。圣经文本可在此处找到:此外,代码如下:

import os
import sys
import re


print "%30s %-3s %s %4s\n" % ("","King", "James", "Bible")
word_search = raw_input(r'Enter a word to search: ')
book = open("KJV.txt", "rb")
first_lines = {36: 'Genesis', 4812: 'Exodus', 8867: 'Leviticus', 11749: 'Numbers', 15718: 'Deuteronomy',
           18909: 'Joshua', 21070: 'Judges', 23340: 'Ruth', 23651: 'I Samuel', 26641: 'II Samuel',
           29094: 'I Kings', 31990: 'II Kings', 34706: 'I Chronicles', 37378: 'II Chronicles',
           40502: 'Ezra', 41418: 'Nehemiah', 42710: 'Esther', 43352: 'Job', 45937: 'Psalms', 53537: 'Proverbs',
           56015: 'Ecclesiastes', 56711: 'Song of Solomon', 57076: 'Isaiah', 61550: 'Jeremiah',
           66480: 'Lamentations', 66961: 'Ezekiel', 71548: 'Daniel', 72933: 'Hosea', 73620: 'Joel',
           73874: 'Amos', 74359: 'Obadiah', 74441: 'Jonah', 74604: 'Micah', 74985: 'Nahum', 75160: 'Habakkuk',
           75348: 'Zephaniah',75550: 'Haggai', 75676: 'Zechariah', 76428: 'Malachi', 76646: 'Matthew',
           79708: 'Mark', 81680: "Luke", 85006: 'John', 87543: 'Acts', 90654: 'Romans', 91851: 'I Corinthians',
           93065: 'II Corinthians', 93830: 'Galatians', 94257: 'Ephesians', 94612: 'Philippians', 94896: 'Colossians',
           95145: 'I Thessalonians', 95390: 'II Thessalonians', 95515: 'I Timothy', 95833: 'II Timothy',
           96063: 'Titus', 96183: 'Philemon', 96243: 'Hebrews', 97113: 'James', 97430: 'I Peter', 97719: 'II Peter',
           97906: 'I John', 98249: 'II John', 98295: 'III John', 98340: 'Jude', 98427: 'Revelation'}

for ln, line in enumerate(book):
     match = re.match(r'(\d+):(\d+)', line)

     if match:
          chapter = match.group(1)
          verse = match.group(2)

     if word_search in line: 
          first_line = max(l for l in first_lines if l < ln)
          bibook = first_lines[first_line]

          template = "\nLine: {0}\nString: {1}\nBook: {2}\nChapter: {3}\nVerse: {4}\n"
          output = template.format(ln, line, bibook, chapter, verse)
          print output
导入操作系统
导入系统
进口稀土
打印“%30s%-3s%s%4s\n”%(“”、“国王”、“詹姆斯”、“圣经”)
单词搜索=原始输入(r'输入要搜索的单词:')
簿记=打开(“KJV.txt”、“rb”)
第一行={36:'创世记',4812:'出埃及记',8867:'利未记',11749:'数字',15718:'申命记',
18909:《约书亚》,21070:《法官》,23340:《路得》,23651:《一代撒母耳》,26641:《二代撒母耳》,
29094:“一代国王”,31990:“二代国王”,34706:“一代编年史”,37378:“二代编年史”,
40502:《以斯拉》,41418:《尼希米》,42710:《以斯帖》,43352:《约伯记》,45937:《诗篇》,53537:《箴言》,
56015:《传道书》,56711:《所罗门之歌》,57076:《以赛亚书》,61550:《耶利米书》,
66480:《哀歌》,66961:《以西结》,71548:《但以理》,72933:《何西阿》,73620:《约珥》,
73874:‘阿摩司’,74359:‘奥巴第亚’,74441:‘约拿’,74604:‘弥迦’,74985:‘那鸿’,75160:‘哈巴谷’,
75348:《西番雅》,75550:《哈该》,75676:《撒迦利亚》,76428:《玛拉基》,76646:《马太福音》,
79708:《马可福音》,81680:《路加福音》,85006:《约翰福音》,87543:《使徒行传》,90654:《罗马人》,91851:《哥林多前书》,
93065:《哥林多后书》,93830:《加拉太书》,94257:《以弗所书》,94612:《腓立比书》,94896:《歌罗西书》,
95145:“我帖撒罗尼迦人”,95390:“第二帖撒罗尼迦人”,95515:“我提摩太人”,95833:“第二提摩太人”,
96063:'提多',96183:'腓利门',96243:'希伯来人',97113:'雅各',97430:'我彼得',97719:'彼得二世',
97906:'我约翰',98249:'第二约翰',98295:'第三约翰',98340:'裘德',98427:'启示'}
对于ln,枚举(书本)中的行:
匹配=重新匹配(r'(\d+):(\d+),行)
如果匹配:
第章=匹配组(1)
组(2)
如果在行中搜索单词:
第一条线=最大值(如果l
尝试将正则表达式更改为:

^(\d+):(\d+)


^
应该锚定与文本开头的匹配。

尝试将正则表达式更改为:

^(\d+):(\d+)


^
应该锚定与文本开头的匹配。

这里有一个与(我想!)章节:诗句标题匹配的正则表达式

r'[^\n\d](\d+:\d+)'
如果您想将它们分组,就像在代码中一样

r'[^\n\d](\d+):(\d+)'
我用下面的代码重新整理了古腾堡计划的文本。不过,这仍然会留下一些尴尬的断行,不是每行一节

>>> with open('pg10.txt', 'r') as kjb_file:
...     kjb_text = kjb_file.read()
... 
>>> kjb_text = re.sub(r'[^\n\d](\d+:\d+)', r'\r\n\r\n\g<1>', kjb_text)
>>> with open('kjb_new.txt', 'w') as kjb_new:
...     kjb_new.write(kjb_text)
... 
打开('pg10.txt',r')作为kjb\U文件的
>>:
...     kjb_text=kjb_file.read()
... 
>>>kjb_text=re.sub(r'[^\n\d](\d+:\d+),r'\r\n\r\n\g',kjb_text)
>>>将open('kjb_new.txt','w')作为kjb_new:
...     kjb_new.write(kjb_文本)
... 

这里有一个与(我想!)章节:诗句标题相匹配的正则表达式

r'[^\n\d](\d+:\d+)'
如果您想将它们分组,就像在代码中一样

r'[^\n\d](\d+):(\d+)'
我用下面的代码重新整理了古腾堡计划的文本。不过,这仍然会留下一些尴尬的断行,不是每行一节

>>> with open('pg10.txt', 'r') as kjb_file:
...     kjb_text = kjb_file.read()
... 
>>> kjb_text = re.sub(r'[^\n\d](\d+:\d+)', r'\r\n\r\n\g<1>', kjb_text)
>>> with open('kjb_new.txt', 'w') as kjb_new:
...     kjb_new.write(kjb_text)
... 
打开('pg10.txt',r')作为kjb\U文件的
>>:
...     kjb_text=kjb_file.read()
... 
>>>kjb_text=re.sub(r'[^\n\d](\d+:\d+),r'\r\n\r\n\g',kjb_text)
>>>将open('kjb_new.txt','w')作为kjb_new:
...     kjb_new.write(kjb_文本)
... 

让我们看一下9:3左右的片段:

站在亚衲族人面前!所以你们要明白,今日

如果搜索Anak的
子类
,那么您发布的代码(假设正则表达式可以修复)将返回9:3,即使它应该是9:2。因此,我们需要重新思考如何解决这个问题

我建议

contents=book.read()
re.split(r'(\d+:\d+)',contents)
这会在章节/诗句编号上拆分整个文本

import re
import itertools
import textwrap

if __name__=='__main__':
    print "{0:^78}".format("King James Bible")

    books=iter(['Genesis', 'Exodus', 'Leviticus', 'Numbers', 'Deuteronomy', 'Joshua',
           'Judges', 'Ruth', 'I Samuel', 'II Samuel', 'I Kings', 'II Kings',
           'I Chronicles', 'II Chronicles', 'Ezra', 'Nehemiah', 'Esther', 'Job', 'Psalms',
           'Proverbs', 'Ecclesiastes', 'Song of Solomon', 'Isaiah', 'Jeremiah',
           'Lamentations', 'Ezekiel', 'Daniel', 'Hosea', 'Joel', 'Amos', 'Obadiah',
           'Jonah', 'Micah', 'Nahum', 'Habakkuk', 'Zephaniah', 'Haggai', 'Zechariah',
           'Malachi', 'Matthew', 'Mark', 'Luke', 'John', 'Acts', 'Romans', 'I Corinthians',
           'II Corinthians', 'Galatians', 'Ephesians', 'Philippians',
           'Colossians', 'I Thessalonians', 'II Thessalonians', 'I Timothy', 'II Timothy',
           'Titus', 'Philemon', 'Hebrews', 'James', 'I Peter', 'II Peter', 'I John',
           'II John', 'III John', 'Jude', 'Revelation'])

    with open("KJV.txt", "rb") as book:
        contents=book.read()
        data=re.split(r'(\d+:\d+)',contents)[1:]    
        del contents

    word_search = raw_input(r'Enter a word to search: ')

    for chapter_verse, line in itertools.izip(*[iter(data)]*2):
        if chapter_verse=='1:1':
            book=next(books)
        line=' '.join(line.split())
        if word_search in line:
            line=textwrap.fill(line,width=78)
            print('''\
{b} {c}
{l}
'''.format(b=book,c=chapter_verse,l=line))
上运行
test.py
“消耗火焰”
会产生

% test.py 
                               King James Bible                               
Enter a word to search: consuming fire
Deuteronomy 4:24
For the LORD thy God is a consuming fire, even a jealous God.

Deuteronomy 9:3
Understand therefore this day, that the LORD thy God is he which goeth over
before thee; as a consuming fire he shall destroy them, and he shall bring
them down before thy face: so shalt thou drive them out, and destroy them
quickly, as the LORD hath said unto thee.

Hebrews 12:29
For our God is a consuming fire.

注:硬编码第一行的书号是易碎的——不要使用它们。(如果有人决定删除Gutenberg文件附带的标题文本,或者意外地在某处插入一些空白换行符,等等,会发生什么情况。)


你真正需要的是书的顺序,因为每本新书都是以章节开头的。

让我们看看9:3左右的片段:

站在亚衲族人面前!所以你们要明白,今日

如果搜索Anak的
子类
,那么您发布的代码(假设正则表达式可以修复)将返回9:3,即使它应该是9:2。因此,我们需要重新思考如何解决这个问题

我建议

contents=book.read()
re.split(r'(\d+:\d+)',contents)
这会在章节/诗句编号上拆分整个文本

import re
import itertools
import textwrap

if __name__=='__main__':
    print "{0:^78}".format("King James Bible")

    books=iter(['Genesis', 'Exodus', 'Leviticus', 'Numbers', 'Deuteronomy', 'Joshua',
           'Judges', 'Ruth', 'I Samuel', 'II Samuel', 'I Kings', 'II Kings',
           'I Chronicles', 'II Chronicles', 'Ezra', 'Nehemiah', 'Esther', 'Job', 'Psalms',
           'Proverbs', 'Ecclesiastes', 'Song of Solomon', 'Isaiah', 'Jeremiah',
           'Lamentations', 'Ezekiel', 'Daniel', 'Hosea', 'Joel', 'Amos', 'Obadiah',
           'Jonah', 'Micah', 'Nahum', 'Habakkuk', 'Zephaniah', 'Haggai', 'Zechariah',
           'Malachi', 'Matthew', 'Mark', 'Luke', 'John', 'Acts', 'Romans', 'I Corinthians',
           'II Corinthians', 'Galatians', 'Ephesians', 'Philippians',
           'Colossians', 'I Thessalonians', 'II Thessalonians', 'I Timothy', 'II Timothy',
           'Titus', 'Philemon', 'Hebrews', 'James', 'I Peter', 'II Peter', 'I John',
           'II John', 'III John', 'Jude', 'Revelation'])

    with open("KJV.txt", "rb") as book:
        contents=book.read()
        data=re.split(r'(\d+:\d+)',contents)[1:]    
        del contents

    word_search = raw_input(r'Enter a word to search: ')

    for chapter_verse, line in itertools.izip(*[iter(data)]*2):
        if chapter_verse=='1:1':
            book=next(books)
        line=' '.join(line.split())
        if word_search in line:
            line=textwrap.fill(line,width=78)
            print('''\
{b} {c}
{l}
'''.format(b=book,c=chapter_verse,l=line))
上运行
test.py
“消耗火焰”
会产生

% test.py 
                               King James Bible                               
Enter a word to search: consuming fire
Deuteronomy 4:24
For the LORD thy God is a consuming fire, even a jealous God.

Deuteronomy 9:3
Understand therefore this day, that the LORD thy God is he which goeth over
before thee; as a consuming fire he shall destroy them, and he shall bring
them down before thy face: so shalt thou drive them out, and destroy them
quickly, as the LORD hath said unto thee.

Hebrews 12:29
For our God is a consuming fire.

注:硬编码第一行的书号是易碎的——不要使用它们。(如果有人