Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/309.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python正则表达式-在带有>;20查尔_Python_Regex_Text Extraction - Fatal编程技术网

Python正则表达式-在带有>;20查尔

Python正则表达式-在带有>;20查尔,python,regex,text-extraction,Python,Regex,Text Extraction,我有一封信需要摘录其中的某一部分。开头和结尾由清晰的开头/结尾表达式标记(letter\u beg/letter\u end)。我的问题是,文本的“录制”需要在字母\u end的“匹配”之后的第一行之前结束,并且超过20个字符。在我的代码中,它在两行新行之后执行。以下是我目前为止的示例文本和代码: sample_text = """Some random text right here ......... Dear Shareholders: We are pleased to provide

我有一封信需要摘录其中的某一部分。开头和结尾由清晰的开头/结尾表达式标记(
letter\u beg
/
letter\u end
)。我的问题是,文本的“录制”需要在
字母\u end
的“匹配”之后的第一行之前结束,并且超过20个字符。在我的代码中,它在两行新行之后执行。以下是我目前为止的示例文本和代码:

sample_text = """Some random text right here 
.........
Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards 
Douglas - Director
Other random text with more than 20 chars in this line    """

letter_begin = ["dear", "to our", "fellow investors"] # All expressions for "beginning" of Letter to the Shareholders (LttS)
openings = "|".join(letter_begin)
letter_end = ["sincerely", "best regards", "cordially,"] # All expressions for "ending" of Letter to the Shareholders (LttS)
closings = "|".join(letter_end)
regex = r"(?:" + openings + r")[\s\S]*?" + r"(?:" + closings + r").*(?:\n.*){0,2}"
output = re.findall(regex, text, re.IGNORECASE) # record all text between Regex (beginning and end expressions)
print(output)

我不完全确定您期望的输出是什么,但是在没有正则表达式的情况下实现这一点非常简单(这样就可以解决一个问题)

下面的解决方案假定
sample\u text
包含
\n
(换行符),如果
sample\u text
是一条长行(即没有任何
\n
),则该解决方案将不起作用

输出是

Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director

编辑

根据你的最后一个建议,我可以想出两种方法。希望其中一个能解决你的问题

选择1

sample_text = """Some random text right here
.........
Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director
Other random text with more than 20 chars in this line
.........
Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director
Other random text with more than 20 chars in this line
"""

letter_begin = ["dear", "to our", "fellow investors"]
letter_end = ["sincerely", "best regards", "cordially,"]

lines = sample_text.strip().split("\n")

target_start_indexes = []
target_end_indexes = []

for index, line in enumerate(lines):
    line = line.lower()

    if any(beg in line for beg in letter_begin):
        target_start_indexes.append(index)
        continue

    if any(end in line for end in letter_end):
        target_end_indexes.append(index)
        continue

for target_index, target_end_idx in enumerate(target_end_indexes):
    for line_index, line in enumerate(lines[target_end_idx + 1 :]):
        if len(line) >= 20:
            target_end_idx += line_index
            target_end_indexes[target_index] = target_end_idx
            break


target = []
if target_start_indexes and target_end_indexes:
    for target_start_idx, target_end_idx in zip(
        target_start_indexes, target_end_indexes
    ):
        target.append("\n".join(lines[target_start_idx : target_end_idx + 1]))

    print("\n".join(target))
输出

Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director
Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director
Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director
Other random text with more than 20 chars in this line
.........
Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director

选择2

输出

Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director
Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director
Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director
Other random text with more than 20 chars in this line
.........
Dear Shareholders: We are pleased to provide you with this semiannual report for the fund.
Best regards
Douglas - Director

如果坚持使用单片正则表达式,请为结尾包含20个以上字符的行添加一个:

(?=[^\n]{21,})
您可能还需要添加标志:


非常感谢你对杜山的帮助!我的问题是,“字母开始”或“字母结束”不一定需要在行的开头,但可以在该行的某个位置-是否有方法检查行。contains(beg/end)而不是line.startswith(beg/end)?当然,只要使用
如果有的话(beg in line for beg in letter\u begin)
如果有的话(以字母结尾的行结尾)
。嗨,杜桑,很好的解决方案!我在研究过程中注意到一件事:有没有办法将您设置的开始/结束索引修复为文本中字母开头/字母结尾的第一次/最后一次出现(例如,考虑一个字母,其中列表中定义的表达式多于一个)在这种情况下,我将“记录”从LeTelyLead第一次出现的文本直到LeTyTy结尾项的最后出现。
re.IGNORECASE | re.DOTALL