Python正则表达式-在带有>；20查尔_Python_Regex_Text Extraction

Python正则表达式-在带有>；20查尔

python regex

Python正则表达式-在带有>；20查尔,python,regex,text-extraction,Python,Regex,Text Extraction,我有一封信需要摘录其中的某一部分。开头和结尾由清晰的开头/结尾表达式标记（letter\u beg/letter\u end）。我的问题是，文本的“录制”需要在字母\u end的“匹配”之后的第一行之前结束，并且超过20个字符。在我的代码中，它在两行新行之后执行。以下是我目前为止的示例文本和代码： sample_text = """Some random text right here ......... Dear Shareholders: We are pleased to provide

我有一封信需要摘录其中的某一部分。开头和结尾由清晰的开头/结尾表达式标记（

letter\u beg

letter\u end

）。我的问题是，文本的“录制”需要在

字母\u end

的“匹配”之后的第一行之前结束，并且超过20个字符。在我的代码中，它在两行新行之后执行。以下是我目前为止的示例文本和代码：

sample_text = """Some random text right here 
.........
Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards 
Douglas - Director
Other random text with more than 20 chars in this line    """

letter_begin = ["dear", "to our", "fellow investors"] # All expressions for "beginning" of Letter to the Shareholders (LttS)
openings = "|".join(letter_begin)
letter_end = ["sincerely", "best regards", "cordially,"] # All expressions for "ending" of Letter to the Shareholders (LttS)
closings = "|".join(letter_end)
regex = r"(?:" + openings + r")[\s\S]*?" + r"(?:" + closings + r").*(?:\n.*){0,2}"
output = re.findall(regex, text, re.IGNORECASE) # record all text between Regex (beginning and end expressions)
print(output)

我不完全确定您期望的输出是什么，但是在没有正则表达式的情况下实现这一点非常简单（这样就可以解决一个问题）

下面的解决方案假定

sample\u text

包含

\n

（换行符），如果

sample\u text

是一条长行（即没有任何

\n

），则该解决方案将不起作用

输出是

Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director

编辑

根据你的最后一个建议，我可以想出两种方法。希望其中一个能解决你的问题

选择1

sample_text = """Some random text right here
.........
Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director
Other random text with more than 20 chars in this line
.........
Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director
Other random text with more than 20 chars in this line
"""

letter_begin = ["dear", "to our", "fellow investors"]
letter_end = ["sincerely", "best regards", "cordially,"]

lines = sample_text.strip().split("\n")

target_start_indexes = []
target_end_indexes = []

for index, line in enumerate(lines):
    line = line.lower()

    if any(beg in line for beg in letter_begin):
        target_start_indexes.append(index)
        continue

    if any(end in line for end in letter_end):
        target_end_indexes.append(index)
        continue

for target_index, target_end_idx in enumerate(target_end_indexes):
    for line_index, line in enumerate(lines[target_end_idx + 1 :]):
        if len(line) >= 20:
            target_end_idx += line_index
            target_end_indexes[target_index] = target_end_idx
            break


target = []
if target_start_indexes and target_end_indexes:
    for target_start_idx, target_end_idx in zip(
        target_start_indexes, target_end_indexes
    ):
        target.append("\n".join(lines[target_start_idx : target_end_idx + 1]))

    print("\n".join(target))

输出

Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director
Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director

Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director
Other random text with more than 20 chars in this line
.........
Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director

选择2

输出

Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director
Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director

Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director
Other random text with more than 20 chars in this line
.........
Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director

如果坚持使用单片正则表达式，请为结尾包含20个以上字符的行添加一个：

(?=[^\n]{21,})

您可能还需要添加标志：

非常感谢你对杜山的帮助！我的问题是，“字母开始”或“字母结束”不一定需要在行的开头，但可以在该行的某个位置-是否有方法检查行。contains（beg/end）而不是line.startswith（beg/end）？当然，只要使用

如果有的话（beg in line for beg in letter\u begin）

和

如果有的话（以字母结尾的行结尾）

。嗨，杜桑，很好的解决方案！我在研究过程中注意到一件事：有没有办法将您设置的开始/结束索引修复为文本中字母开头/字母结尾的第一次/最后一次出现（例如，考虑一个字母，其中列表中定义的表达式多于一个）在这种情况下，我将“记录”从LeTelyLead第一次出现的文本直到LeTyTy结尾项的最后出现。

re.IGNORECASE | re.DOTALL