Python 加速用于比较句子的比较函数

Python 加速用于比较句子的比较函数,python,pandas,difflib,Python,Pandas,Difflib,我有一个外形为(789174,9)的数据框。有一个名为resolution的列,其中包含长度小于139个字符的句子。我构建了一个函数,从difflib库中查找相似性得分高于0.9的句子。我有一台带有96CPU和384gb内存的虚拟计算机。我已经运行此功能超过2小时了,当I=1000时,它仍然没有处理。我担心这将需要太长的时间来处理,我想知道是否有办法加快这一进程 def replace_similars(input_list): # Replaces %90 and more simil

我有一个外形为
(789174,9)
的数据框。有一个名为
resolution
的列,其中包含长度小于
139个字符的句子。我构建了一个函数,从
difflib
库中查找相似性得分高于
0.9
的句子。我有一台带有
96
CPU和
384
gb内存的虚拟计算机。我已经运行此功能超过2小时了,当
I=1000
时,它仍然没有处理。我担心这将需要太长的时间来处理,我想知道是否有办法加快这一进程

def replace_similars(input_list):
    # Replaces %90 and more similar strings
    start_time = time.time()
    for i in range(len(input_list)):
        if i % 1000 == 0:
            print(f'time = {time.time()-start_time:.2f} - index = {i}')
        for j in range(len(input_list)):
            if i < j and difflib.SequenceMatcher(None, input_list[i], input_list[j]).ratio() >= 0.9:
                input_list[j] = input_list[i]

def generate_mapping(input_list):
    new_list = input_list[:]  # copy list
    replace_similars(new_list)

    mapping = {}
    for i in range(len(input_list)):
        mapping[input_list[i]] = new_list[i]

    return mapping
是否可以加速后一种功能

下面是数据帧中某些数据的示例

d = {'resolution' : ['replaced scanner', 'replaced the scanner for the user with a properly working one from the cage replaced the wire on the damaged one and stored it for later use', 'tc reimage', 'updated pc', 'deploying replacement scanner', 'upgraded and rebooted station', 'printer has been reconfigured', 'cleared linux print queue and now it is working','user reset her password successfully closing tt','have reset the printer to get it to print again','i plugged usb cable into port and scanner works','reconfigured hand scanner and linked to station','replaced the scanner with station is functional','laptops battery needed to be reset asset serial','reconfigured scanner confirmed that it scans as intended','reimaging laptop corrected the anyconnect software issue','printer was unplugged from usb port working properly now','reconnected usb cable and reassign printer ports on port','reconfigured scanner to base and tested with aa all fine','replaced the defective device with a fresh imaged laptop','reconfigured the printer and the media to print properly','tested printer at station connected and working resolved','red scanner reconfigured and base rebooted via usb joint','station scanner was synced to base and station and is now working','printer offlineswitched usb portprinter is now online and working','replaced the barcode label with one reflecting the tcs ip address','restarted the thin client by using ssh to run the restart command','printer reconfigured and test they are functioning normally again','removed old printer for service installed replacement tested good','tc required reboot rebooted tc had aa signin dp is now functional','resetting the printer to factory settings and then reconfigure it','updated windows os forced update and the laptop operated normally','printer settings are set correct and printer is working correctly','power to printer was disconnected reconnected and is working fine','power cycled equipment and restocked spooler with plastic bubbles','laptop checked ive logged into paskiplacowepl without any problem','reseated scanner cables connection into usb port to resolve issue','the scanner has been replaced and the station is working well now']}

df = pd.DataFrame(data=d)
我如何定义相似性:

相似性实际上是由所采取的总体行动来定义的,例如
更换扫描仪
为用户更换了一台工作正常的扫描仪,从机架上更换了损坏的扫描仪上的电线,并将其储存起来供以后使用
。较长字符串的整体操作是替换扫描仪,因此这两个字符串非常相似,这就是为什么我选择使用
partial_ratio
函数的原因,因为它们的分数为100

注意:

请参阅第二个函数
群集分辨率
,因为这是我想要加速的函数。后一个函数将不起作用。

def replace\u similar(输入列表):
def replace_similars(input_list):
    # Replaces %90 and more similar strings
    start_time = time.time()
    for i in range(len(input_list)):
        if i % 1000 == 0:
            print(f'time = {time.time()-start_time:.2f} - index = {i}')
        for j in range(i+1, len(input_list)):
            if -15 < len(list(input_list[i])) - len(list(input_list[i])) < 15:
                if difflib.SequenceMatcher(None, input_list[i], input_list[j]).ratio() >= 0.9:
                    input_list[j] = input_list[i]

def generate_mapping(input_list):
    new_list = input_list[:]  # copy list
    replace_similars(new_list)

    mapping = {}
    for i in range(len(input_list)):
        mapping[input_list[i]] = new_list[i]

    return mapping
#替换%90及更多类似字符串 开始时间=time.time() 对于范围内的i(len(输入列表)): 如果i%1000==0: 打印(f'time={time.time()-start_time:.2f}-index={i}) 对于范围内的j(i+1,len(输入列表)): 如果-15=0.9: 输入列表[j]=输入列表[i] def生成_映射(输入_列表): 新建列表=输入列表[:]#复制列表 替换相似项(新列表) 映射={} 对于范围内的i(len(输入列表)): 映射[输入_列表[i]]=新_列表[i] 返回映射

尽管这可能不是一个实用的解决方案,因为如果每次迭代需要0.1秒,也需要大约90年的时间,但它仍然是一个更加优化的解决方案。

关于您上次的编辑,我会做一些更改(主要使用fuzzywuzzy.process而不是fuzzywuzzy.fuzz):

但我认为你们可以更深入地研究其他的解决方案,比如CountVectorizer和任何适用的度量。这是一种提高速度的方法(因为它是矢量化的),尽管结果可能并不完美。请注意,CountVectorizer可能是一个很好的解决方案,因为您已经选择了
部分_比率

例如,类似这样的内容:

from sklearn.feature_extraction.text import CountVectorizer
from scipy.spatial.distance import pdist, squareform
import hdbscan

df = pd.DataFrame(d)

cv = CountVectorizer(stop_words="english")
transformed = cv.fit_transform(df['resolution'])
transformed = pd.DataFrame(
        transformed.toarray(), 
        columns=cv.get_feature_names(),
        index=df['resolution'])

#keep only columns with more than 1
transformed = transformed[transformed.columns[transformed.sum()>2]]

#compute the distance matrix
d = pdist(transformed, metric="hamming") * transformed.shape[1]
s = squareform(d)

clusterer = hdbscan.HDBSCAN(metric='precomputed', min_cluster_size=2)
clusterer.fit_predict(s)

df['labels'] = clusterer.labels_

print(df.sort_values('labels'))
我认为这仍然是完美的(这是我第一次尝试文本聚类…)。您还可以为CountVectorizer添加自己的stopwords列表,这将有助于实现算法。至少,它可以帮助您在使用以前的函数之前对数据集进行预聚类,例如:

df.groupby('labels')['resolution'].apply(cluster_resolution)
(这样,如果您的第一个集群大致正常,您将只检查每个值与集群中的所有其他值,而不是所有值)

将中距离矩阵的计算归功于@anon01,这似乎比默认的hdbscan给出的结果稍好一些

编辑:

另一种尝试,包括:

  • 指标的变化
  • 添加带有TF-IDF模型的步骤
  • 并添加一个步骤来使用nltk包对单词进行语法化
这就是:

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from scipy.spatial.distance import pdist, squareform
import pandas as pd
import hdbscan
import nltk
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import wordnet

d = {...}
df = pd.DataFrame(d)

lemmatizer = WordNetLemmatizer()

def lemmatization(sentence):
    
    tag_dict = {
                "J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV,
                }

    # Tokenize the sentence
    wordsList = nltk.word_tokenize(sentence) 
    
    # Find the right token
    tagged = nltk.pos_tag(wordsList)   
    
    # Convert the list of (token, tag) to lemmatized tokens
    lems = [
            lemmatizer.lemmatize(token, tag_dict.get(tag[0], wordnet.NOUN) )
            for token, tag
            in tagged
            ]

    lems = ' '.join(lems)
    return lems

df['lemmatized'] = df['resolution'].apply(lemmatization)

corpus = df['lemmatized']
pipe = Pipeline(
        [
                ('cv', CountVectorizer(stop_words="english")),
                ('tfid', TfidfTransformer())
         ]).fit(corpus)

transformed = pipe.transform(corpus)
transformed = pd.DataFrame(
        transformed.toarray(), 
        columns=pipe.named_steps['cv'].get_feature_names(),
        index=df['resolution'])

d = pdist(transformed, metric="cosine") * transformed.shape[1]
s = squareform(d)

clusterer = hdbscan.HDBSCAN(metric="precomputed", min_cluster_size=2)
clusterer.fit_predict(s)

df['labels'] = clusterer.labels_

print(df.sort_values('labels'))
您还可以添加一些特定的代码,因为您的示例似乎涉及非常特定的维护日志

例如,您可以基于硬件/软件的小列表向
转换后的
数据帧添加新功能:

#To create a feature about OS :
cols = ['os', 'linux', 'window']
transformed[cols[0]] = np.ceil(transformed[[x for x in cols if x in transformed.columns]].sum(axis=1))

#To crate a feature about hardware :
cols = ["laptop", "printer", "scanner"]
transformed["hardware"] = np.ceil(transformed[[x for x in cols if x in transformed.columns]].sum(axis=1))

此步骤可能有助于获得更好的结果,但可能不是必需的。我不确定它在匹配字符串方面的性能与FuzzyWozzy相比会如何,但我会对您的反馈感兴趣

你计划做6000亿个句子比较。当然,这是行不通的。你需要重新考虑你的整个战略。也许你能找到一种方法给每个句子分配一个分数,然后对数字进行排序和比较。顺便说一句,不管你有多少个CPU,这段代码只使用一个。若每个句子的比较都需要10毫秒,这将需要200年的时间来运行。你们能提供一个小的(~25个例子)数据集吗?另外,关于如何定义相似性,你还能说些什么?@anon01我添加了一组示例数据。如果你想看更多的例子,请告诉我。我还添加了如何定义相似性。计算亲和矩阵(或任何完整的成对比较度量)会非常慢——这可能不是最好的方法。如果说你想优化Is
f(test_句)->
,这是准确的吗?@justanewb一般来说,我不建议这样做,但对于迭代这么大的数据,并在不到一年的时间内得到结果,我强烈建议您使用
C
C++
或比
python3
更快的语言。谢谢您的回答,但您能看看新代码吗?我能够用这种方法加快速度。
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from scipy.spatial.distance import pdist, squareform
import pandas as pd
import hdbscan
import nltk
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import wordnet

d = {...}
df = pd.DataFrame(d)

lemmatizer = WordNetLemmatizer()

def lemmatization(sentence):
    
    tag_dict = {
                "J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV,
                }

    # Tokenize the sentence
    wordsList = nltk.word_tokenize(sentence) 
    
    # Find the right token
    tagged = nltk.pos_tag(wordsList)   
    
    # Convert the list of (token, tag) to lemmatized tokens
    lems = [
            lemmatizer.lemmatize(token, tag_dict.get(tag[0], wordnet.NOUN) )
            for token, tag
            in tagged
            ]

    lems = ' '.join(lems)
    return lems

df['lemmatized'] = df['resolution'].apply(lemmatization)

corpus = df['lemmatized']
pipe = Pipeline(
        [
                ('cv', CountVectorizer(stop_words="english")),
                ('tfid', TfidfTransformer())
         ]).fit(corpus)

transformed = pipe.transform(corpus)
transformed = pd.DataFrame(
        transformed.toarray(), 
        columns=pipe.named_steps['cv'].get_feature_names(),
        index=df['resolution'])

d = pdist(transformed, metric="cosine") * transformed.shape[1]
s = squareform(d)

clusterer = hdbscan.HDBSCAN(metric="precomputed", min_cluster_size=2)
clusterer.fit_predict(s)

df['labels'] = clusterer.labels_

print(df.sort_values('labels'))
#To create a feature about OS :
cols = ['os', 'linux', 'window']
transformed[cols[0]] = np.ceil(transformed[[x for x in cols if x in transformed.columns]].sum(axis=1))

#To crate a feature about hardware :
cols = ["laptop", "printer", "scanner"]
transformed["hardware"] = np.ceil(transformed[[x for x in cols if x in transformed.columns]].sum(axis=1))