Python 如何使这个正则表达式更具包容性和准确性？_Python_Regex_Regex Lookarounds_Regex Group

Python 如何使这个正则表达式更具包容性和准确性？

python regex

Python 如何使这个正则表达式更具包容性和准确性？,python,regex,regex-lookarounds,regex-group,Python,Regex,Regex Lookarounds,Regex Group,我使用Python2.7在一段文本中查找文本。以下文本是我摘录的一部分： Item 1 for Product A: Flour Solution 1 for Product A: Water Items 2 for Product B: Milk Solution 2 for Product B: Oil Item 3 for Product C: Onions Method 我有以下Python代码来提取我想要的特定信息片段： extract = re.findall(r"(?<

我使用Python2.7在一段文本中查找文本。以下文本是我摘录的一部分：

Item 1 for Product A: Flour
Solution 1 for Product A: Water
Items 2 for Product B: Milk
Solution 2 for Product B: Oil
Item 3 for Product C: Onions

Method

我有以下Python代码来提取我想要的特定信息片段：

extract = re.findall(r"(?<=Item|s\s).*(?=\sSolution)", page_content)

如果您能帮助改进正则表达式，我们将不胜感激

如果您的输入看起来像

Item 1 for Product A: FlourSolution 1 for Product A: WaterItems 2 for Product B: MilkSolution 2 for Product B: OilItem 3 for Product C: Onions

Method

下面的模式为您提供了所需的输出

r'(Item[s]{0,1}.*?\:\s[A-Z][a-z]*[^A-Z])'

在这里查看：

每行末尾是否总是有换行符？不一定。实际上，当我使用Python库将PDF转换为文本时，输出文本会连接换行符，因此原始文本实际上是：产品A的第1项：产品A的溶液1：水等。感谢您的回复，我已将您的表达式改编为另一个查询，参见link：但是我认为这有点命中或未命中-有什么建议吗？@qbbq上面的模式在您的示例中不起作用，因为它利用大写字母作为终止的提示。在链接中的情况下，有可能在：和律师之间夹有大写字母。我还没有找到一种方法来正确区分它，但是一种快速、肮脏的方法是在pythonahh中用一个空字符串手动获取整个短语和子顾问，这是一种很好的方法。非常感谢你——你帮助我朝着正确的方向前进。

r'(Item[s]{0,1}.*?\:\s[A-Z][a-z]*[^A-Z])'