n-grams的count()的python优化

n-grams的count()的python优化,python,sorting,optimization,counting,n-gram,Python,Sorting,Optimization,Counting,N Gram,我试图使用count函数对字符串列表中的项目进行计数,并将结果从最大值到最小值进行排序。虽然该函数在小列表上的性能相当好,但它的扩展性却不太好,正如下面的小实验所示,该函数只需将输入长度加倍5个周期,而第6个周期的等待时间太长。有没有一种方法可以优化第一个列表的理解,或者有没有一种替代方法可以更好地扩展计数 import nltk from operator import itemgetter import time t = "Lorem ipsum dolor sit amet, conse

我试图使用count函数对字符串列表中的项目进行计数,并将结果从最大值到最小值进行排序。虽然该函数在小列表上的性能相当好,但它的扩展性却不太好,正如下面的小实验所示,该函数只需将输入长度加倍5个周期,而第6个周期的等待时间太长。有没有一种方法可以优化第一个列表的理解,或者有没有一种替代方法可以更好地扩展计数

import nltk
from operator import itemgetter
import time

t = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Curabitur pretium tincidunt lacus. Nulla gravida orci a odio. Nullam varius, turpis et commodo pharetra, est eros bibendum elit, nec luctus magna felis sollicitudin mauris. Integer in mauris eu nibh euismod gravida. Duis ac tellus et risus vulputate vehicula. Donec lobortis risus a elit. Etiam tempor. Ut ullamcorper, ligula eu tempor congue, eros est euismod turpis, id tincidunt sapien risus a quam. Maecenas fermentum consequat mi. Donec fermentum. Pellentesque malesuada nulla a mi. Duis sapien sem, aliquet nec, commodo eget, consequat quis, neque. Aliquam faucibus, elit ut dictum aliquet, felis nisl adipiscing sapien, sed malesuada diam lacus eget erat. Cras mollis scelerisque nunc. Nullam arcu. Aliquam consequat. Curabitur augue lorem, dapibus quis, laoreet et, pretium ac, nisi. Aenean magna nisl, mollis quis, molestie eu, feugiat in, orci. In hac habitasse platea dictumst."

unigrams = nltk.word_tokenize(t.lower())

for size in range(1, 6):

    unigrams = unigrams*size

    start = time.time()

    unigram_freqs = [unigrams.count(word) for word in unigrams]    
    freq_pairs = set((zip(unigrams, unigram_freqs)))
    freq_pairs = sorted(freq_pairs, key=itemgetter(1))[::-1]

    end = time.time()

    time_elapsed = round(end-start, 3)

    print("Runtime: " + str(time_elapsed) + "s for " + str(size) + "x the size")

# Runtime: 0.001s for 1x the size
# Runtime: 0.003s for 2x the size
# Runtime: 0.022s for 3x the size
# Runtime: 0.33s for 4x the size 
# Runtime: 8.065s for 5x the size

使用集合中的计数器并通过最常见的成员函数进行排序,无论大小,我都能得到几乎0秒的时间:

import nltk
nltk.download('punkt')


from operator import itemgetter
from collections import Counter
import time
t = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Curabitur pretium tincidunt lacus. Nulla gravida orci a odio. Nullam varius, turpis et commodo pharetra, est eros bibendum elit, nec luctus magna felis sollicitudin mauris. Integer in mauris eu nibh euismod gravida. Duis ac tellus et risus vulputate vehicula. Donec lobortis risus a elit. Etiam tempor. Ut ullamcorper, ligula eu tempor congue, eros est euismod turpis, id tincidunt sapien risus a quam. Maecenas fermentum consequat mi. Donec fermentum. Pellentesque malesuada nulla a mi. Duis sapien sem, aliquet nec, commodo eget, consequat quis, neque. Aliquam faucibus, elit ut dictum aliquet, felis nisl adipiscing sapien, sed malesuada diam lacus eget erat. Cras mollis scelerisque nunc. Nullam arcu. Aliquam consequat. Curabitur augue lorem, dapibus quis, laoreet et, pretium ac, nisi. Aenean magna nisl, mollis quis, molestie eu, feugiat in, orci. In hac habitasse platea dictumst."

unigrams = nltk.word_tokenize(t.lower())

for size in range(1, 5):

    unigrams = unigrams*size

    start = time.time()

    unigram_freqs = [unigrams.count(word) for word in unigrams]    
    freq_pairs = set((zip(unigrams, unigram_freqs)))
    freq_pairs = sorted(freq_pairs, key=itemgetter(1))[::-1]

    end = time.time()

    time_elapsed = round(end-start, 3)

    print("Slow Runtime: " + str(time_elapsed) + "s for " + str(size) + "x the size")

    start = time.time()
    a = Counter(unigrams).most_common()
    #print(a)
    end = time.time()

    time_elapsed = round(end-start, 3)

    print("Fast Runtime: " + str(time_elapsed) + "s for " + str(size) + "x the size")
慢速运行时间:1倍大小为0.003s

快速运行时间:1倍大小为0.0秒

运行速度慢:0.006s为2倍大小

快速运行时间:0.0s为2倍大小

运行速度慢:0.157s,尺寸为3倍

快速运行时间:0.0s,大小为3倍

运行速度慢:1.891s为4倍大小


快速运行时间:0.001s为4倍大小

使用收集计数器并通过成员函数进行排序最常见的是,无论大小,我得到的时间几乎为0秒:

import nltk
nltk.download('punkt')


from operator import itemgetter
from collections import Counter
import time
t = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Curabitur pretium tincidunt lacus. Nulla gravida orci a odio. Nullam varius, turpis et commodo pharetra, est eros bibendum elit, nec luctus magna felis sollicitudin mauris. Integer in mauris eu nibh euismod gravida. Duis ac tellus et risus vulputate vehicula. Donec lobortis risus a elit. Etiam tempor. Ut ullamcorper, ligula eu tempor congue, eros est euismod turpis, id tincidunt sapien risus a quam. Maecenas fermentum consequat mi. Donec fermentum. Pellentesque malesuada nulla a mi. Duis sapien sem, aliquet nec, commodo eget, consequat quis, neque. Aliquam faucibus, elit ut dictum aliquet, felis nisl adipiscing sapien, sed malesuada diam lacus eget erat. Cras mollis scelerisque nunc. Nullam arcu. Aliquam consequat. Curabitur augue lorem, dapibus quis, laoreet et, pretium ac, nisi. Aenean magna nisl, mollis quis, molestie eu, feugiat in, orci. In hac habitasse platea dictumst."

unigrams = nltk.word_tokenize(t.lower())

for size in range(1, 5):

    unigrams = unigrams*size

    start = time.time()

    unigram_freqs = [unigrams.count(word) for word in unigrams]    
    freq_pairs = set((zip(unigrams, unigram_freqs)))
    freq_pairs = sorted(freq_pairs, key=itemgetter(1))[::-1]

    end = time.time()

    time_elapsed = round(end-start, 3)

    print("Slow Runtime: " + str(time_elapsed) + "s for " + str(size) + "x the size")

    start = time.time()
    a = Counter(unigrams).most_common()
    #print(a)
    end = time.time()

    time_elapsed = round(end-start, 3)

    print("Fast Runtime: " + str(time_elapsed) + "s for " + str(size) + "x the size")
慢速运行时间:1倍大小为0.003s

快速运行时间:1倍大小为0.0秒

运行速度慢:0.006s为2倍大小

快速运行时间:0.0s为2倍大小

运行速度慢:0.157s,尺寸为3倍

快速运行时间:0.0s,大小为3倍

运行速度慢:1.891s为4倍大小


快速运行时间:4倍大小时为0.001s

为什么不使用计数器?为什么不使用计数器?不考虑时间>不考虑大小?不考虑时间>不考虑大小?