Python 将同一类别分配给相似的邮件_Python_Regex_Categories

Python 将同一类别分配给相似的邮件

python regex

Python 将同一类别分配给相似的邮件,python,regex,categories,Python,Regex,Categories,我有一个简单的分类程序： import re class Categorizer: ''' Classifies messages according to a pre-defined list of regex-es If no match is found, an automatic algorithm is used to decide the category ''' def __init__(self, categories, messa

我有一个简单的

分类程序

：

import re


class Categorizer:

    '''
    Classifies messages according to a pre-defined list of regex-es
    If no match is found, an automatic algorithm is used to decide the category
    '''

    def __init__(self, categories, message):
        self._categories = categories
        self._message = message

    def _auto_detec_category(self):
        # TODO: auto-detect category based on message (message distance / bayes classifier / ...)
        return message

    @property
    def category(self):
        '''Returns the first matching category, or an automaticaly generated one if no match found'''
        for category, regex in self._categories.items():
            if regex.search(self._message):
                return category
        return self._auto_detec_category()


CATEGORIES = {
    "aaa": "aaa.*AAA",
    "bbb": "bbb.*BBB",
}


categories = { category: re.compile(regex) for category, regex in CATEGORIES.items() }

MESSAGES = [
    "aaa 12345 AAA",
    "aaa 66666 AAA",
    "bbb 12345 BBB",
    "bbb 66666 BBB",
    "ccc 12345 CCC",
    "ccc 66666 CCC",
]


for message in MESSAGES:
    print("{} -> {}".format(Categorizer(categories, message).category, message))

这就给了我：

aaa -> aaa 12345 AAA
aaa -> aaa 66666 AAA
bbb -> bbb 12345 BBB
bbb -> bbb 66666 BBB
ccc 12345 CCC -> ccc 12345 CCC
ccc 66666 CCC -> ccc 66666 CCC

我的目标是，未预先配置的模式仍然可以被分类，以便相似的消息被分配到相同的类别。但是我不知道如何定义“相似”，或者使用什么实现来确保

Categozier

在处理未知消息时做得很好

这些消息是日志条目，包含一般信息，但也包含一些与选择类别无关的特定数据

基本上，我会对这样的输出感到满意：

aaa -> aaa 12345 AAA
aaa -> aaa 66666 AAA
bbb -> bbb 12345 BBB
bbb -> bbb 66666 BBB
auto1 -> ccc 12345 CCC
auto1 -> ccc 66666 CCC

如果最后两条消息自动分类为

auto1

，根据您的评论判断，我认为正则表达式不是您想要的工具

您是否尝试过使用

fuzzyfuzzy

库（请参阅）。它允许您比较两个字符串的“接近度”。这并不完美，但在您的情况下，我会尝试将每个日志条目的“接近度”与一些标准集进行比较。这里我根据最接近的匹配项进行分类

from fuzzywuzzy import fuzz

message = [
    'aaa 12345 AAA',
    'aaa 66666 AAA',
    'bbb 12345 BBB',
    'bbb 66666 BBB',
    'ccc 12345 CCC',
    'ccc 66666 CCC',
]

def classify(input_str):
    # Define standard classifications, and reset them to zero score.
    standards = {
        'AAA': 0,
        'BBB': 0,
        'CCC': 0,
    }

    # Score each classification according to how well is matches.
    for key in standards:
        standards[key] = fuzz.ratio(key, input_str)

    # Return the classification with the highest score.
    return max(standards, key=lambda k: standards[k])


for msg in messages:
    print ('{msg}: {standard}'.format(msg=msg, standard=classify(msg)))

请给出一些“相似”消息的示例，以及一些看似相似但不属于同一类别的消息的示例。@JanneKarila这些消息已在答案中列出：“aaa”消息彼此相似，“bbb”消息也相似，“ccc”消息也相似，并且它们彼此都不同。我要寻找的是一种计算消息之间的“距离”的方法，并使用阈值来确定它们何时是“相同”的消息。除了“低于指标提供的阈值”之外，“相似”的含义没有定义。给出这些示例，简单的解决方案是提取每条消息的第一个单词并单独分类。@JanneKarila消息彼此不相似，因为它们以相同的字母开头，但这是因为它们“根据度量标准彼此接近”。我想知道哪些文本度量是可用的，以及如何在python中计算度量。