文本模式识别Python_Python_Regex_Machine Learning

文本模式识别Python

python regex machine-learning

文本模式识别Python,python,regex,machine-learning,Python,Regex,Machine Learning,假设您有一组非常嘈杂的文本，并且希望每次都选择一个已定义的模式，例如\d{3}（？：\.\124; \ s）\d{3}。问题是，这种模式可能出现在许多情况下，如“443 440$”，“923 140€”，“923 140欧元”，“产品id 001 012”，“id产品001 012”，“产品001 012”，是否在同一文本中。正如我们看到的，模式与所有这些匹配。例如： text1 = "Here it is simple because my text includes only one r

假设您有一组非常嘈杂的文本，并且希望每次都选择一个已定义的模式，例如

\d{3}（？：\.\124; \ s）\d{3}

。问题是，这种模式可能出现在许多情况下，如

“443 440$”

，

“923 140€”

，

“923 140欧元”

，

“产品id 001 012”

，

“id产品001 012”

，

“产品001 012”

，是否在同一文本中。
正如我们看到的，模式与所有这些匹配。例如：

text1 = "Here it is simple because my text includes only one regexp matching which is 443 440 ID"
text2 = "But in some other texts, the regexp can be corresponding to a product profit 956.000 EUR for the product ID 001 023"
text3 = "Also, it can be found that the product 001.079 has a profit of 900 000 $USD"
text4 = "It can be analyzed that the 001789 product contains 001 000 components"

在这里，我想确定我收集的是正确的东西：产品ID

[44344400010230001.079001789]

你会怎么处理这件事

在现实世界中，可以发现一些功能可能有助于确定数字是否实际上是产品ID（文本中regexp的位置-通常在开头，常量判别词-EUR$，…）

您可以尝试以下方法：

import re 
import itertools
text1 = "Here it is simple because my text includes only one regexp matching which is 443 440 ID"
text2 = "But in some other texts, the regexp can be corresponding to a product profit 956.000 EUR for the product ID 001 023"
text3 = "Also, it can be found that the product 001.079 has a profit of 900 000 $USD"
text4 = "It can be analyzed that the 001789 product contains 001 000 components"
s = [text1, text2, text3, text4]
final_ids = [re.findall('[\d\s\.]+(?=ID)|(?<=ID)\s*[\d\s\.]+|[\d\s\.]+(?=product)|(?<=product)\s*[\d\s\.]+', i) for i in s]
new_final_ids = [[re.sub('^\s+|\s+$', '', b) for b in i if re.findall('\d+', b)][0] for i in final_ids]

可以使用基于示例数据生成正则表达式。如果你有一个足够大的训练集，它应该可以完成任务

对于您的四个示例，它生成了以下示例：

001[^\d]\d++

当然，它并不是在所有情况下都有效，但通过更多的示例，您可能会得到更好的结果

['443 440', '001 023', '001.079', '001789']