Python 优化代码以显示字数_Python_Python 3.x_Python Ggplot

Python 优化代码以显示字数

python python-3.x

Python 优化代码以显示字数,python,python-3.x,python-ggplot,Python,Python 3.x,Python Ggplot,我刚刚完成了一个程序，可以从书中读取文本，并用图表显示它们的字数，x轴是一本书的字数，y轴是第二本书的字数。它是有效的，但速度惊人地慢，我希望能得到一些关于如何优化它的提示。我认为我最关心的是为两本书中相似的词编一本词典，为一本书中的词编一本词典，而不是另一本书中的词编一本词典。这个实现为程序增加了很多运行时，我想找到一个pythonic方法来改进它。代码如下： import re # regular expressions import io import collections from

我刚刚完成了一个程序，可以从书中读取文本，并用图表显示它们的字数，x轴是一本书的字数，y轴是第二本书的字数。它是有效的，但速度惊人地慢，我希望能得到一些关于如何优化它的提示。我认为我最关心的是为两本书中相似的词编一本词典，为一本书中的词编一本词典，而不是另一本书中的词编一本词典。这个实现为程序增加了很多运行时，我想找到一个pythonic方法来改进它。代码如下：

import re   # regular expressions
import io
import collections
from matplotlib import pyplot as plt

# xs=[x1,x2,...,xn]
# Number of occurences of the word in book 1

# use

# ys=[y1.y2,...,yn]
# Number of occurences of the word in book 2

# plt.plot(xs,ys)
# save as svg or pdf files

word_pattern = re.compile(r'\w+')
# with version ensures closing even if there are failures
with io.open("swannsway.txt") as f:
    text = f.read() # read as a single large string
    book1 = word_pattern.findall(text)  # pull out words
    book1 = [w.lower() for w in book1 if len(w)>=3]

with io.open("moby_dick.txt") as f:
    text = f.read() # read as a single large string
    book2 = word_pattern.findall(text)  # pull out words
    book2 = [w.lower() for w in book2 if len(w)>=3]


#Convert these into relative percentages/total book length

wordcount_book1 = {}
for word in book1:
    if word in wordcount_book1:
        wordcount_book1[word]+=1
    else:
        wordcount_book1[word]=1

'''
for word in wordcount_book1:
    wordcount_book1[word] /= len(wordcount_book1)

for word in wordcount_book2:
    wordcount_book2[word] /= len(wordcount_book2)
'''

wordcount_book2 = {}
for word in book2:
    if word in wordcount_book2:
        wordcount_book2[word]+=1
    else:
        wordcount_book2[word]=1


common_words = {}

for i in wordcount_book1:
    for j in wordcount_book2:
        if i == j:
            common_words[i] = [wordcount_book1[i], wordcount_book2[j]]
            break

book_singles= {}
for i in wordcount_book1:
    if i not in common_words:
        book_singles[i] = [wordcount_book1[i], 0]
for i in wordcount_book2:
    if i not in common_words:
        book_singles[i] = [0, wordcount_book2[i]]

wordcount_book1 = collections.Counter(book1)
wordcount_book2 = collections.Counter(book2)

# how many words of different lengths?

word_length_book1 = collections.Counter([len(word) for word in book1])
word_length_book2 = collections.Counter([len(word) for word in book2])

print(wordcount_book1)

#plt.plot(list(word_length_book1.keys()),list(word_length_book1.values()), list(word_length_book2.keys()), list(word_length_book2.values()), 'bo')
for i in range(len(common_words)):
    plt.plot(list(common_words.values())[i][0], list(common_words.values())[i][1], 'bo', alpha = 0.2)
for i in range(len(book_singles)):
    plt.plot(list(book_singles.values())[i][0], list(book_singles.values())[i][1], 'ro', alpha = 0.2)
plt.ylabel('Swannsway')
plt.xlabel('Moby Dick')
plt.show()
#key:value

下面是一些优化代码的提示

计算单词的出现次数。使用

集合

库中的

计数器

类（请参阅）：

从集合导入计数器
wordcount\u book1=计数器（book1）
wordcount\u book2=计数器（book2）

查找常见和独特的单词。使用

set

class。所有的词是统一的，共同的词是交叉的，独特的词是不同的

book1\u words=set（wordcount\u book1.keys（））
book2\u words=set（wordcount\u book2.keys（））
所有单词=第一册单词|第二册单词
常用单词=第一册单词和第二册单词
book_singles=[book1_words-common_words，book2_words-common_words]

计算单词长度。首先计算所有单词的长度，然后乘以每本书的字数：

word\u length=计数器（[len（w）表示所有单词中的w]）
word_length_book1={w:word_length[w]*wordcount_book1[w]表示book1_words}中的w）
word_length_book1={w:word_length[w]*wordcount_book2[w]表示book2_words}

也许这些绘图应该是没有循环的，但不幸的是，我不理解您试图绘制的内容。

您的大部分代码只有我试图解决的一些小问题。你最大的耽搁是在策划单曲，我相信我已经解决了。详细信息：我切换了这个：

word_pattern = re.compile(r'\w+')

致：

因为

book_singles

足够大，而且不包括数字！通过在模式中包含最小大小，我们消除了对该循环的需要：

book1 = [w.lower() for w in book1 if len(w)>=3]

和第二册的匹配。在这里：

book1 = word_pattern.findall(text)  # pull out words
book1 = [w.lower() for w in book1 if len(w)>=3]

我移动了

.lower（）
book1 = word_pattern.findall(text.lower())  # pull out words
book1 = [w for w in book1 if len(w) >= 3]

由于它可能在C中实现，因此这可能是一个胜利。这：
wordcount_book1 = {}
for word in book1:
    if word in wordcount_book1:
        wordcount_book1[word]+=1
    else:
        wordcount_book1[word]=1

我切换到使用defaultdict
，因为您已经导入了集合：
wordcount_book1 = collections.defaultdict(int)
for word in book1:
    wordcount_book1[word] += 1

对于这些循环：
common_words = {}

for i in wordcount_book1:
    for j in wordcount_book2:
        if i == j:
            common_words[i] = [wordcount_book1[i], wordcount_book2[j]]
            break

book_singles= {}
for i in wordcount_book1:
    if i not in common_words:
        book_singles[i] = [wordcount_book1[i], 0]
for i in wordcount_book2:
    if i not in common_words:
        book_singles[i] = [0, wordcount_book2[i]]

我重写了第一个循环，这是一个灾难，然后让它执行双重任务，因为它已经完成了第二个循环的工作：
common_words = {}
book_singles = {}

for i in wordcount_book1:
    if i in wordcount_book2:
        common_words[i] = [wordcount_book1[i], wordcount_book2[i]]
    else:
        book_singles[i] = [wordcount_book1[i], 0]

for i in wordcount_book2:
    if i not in common_words:
        book_singles[i] = [0, wordcount_book2[i]]

最后，这些绘图循环的效率非常低，它们一次又一次地遍历常用词.values（）
和book\u singles.values（）
，一次只绘制一个点：
for i in range(len(common_words)):
    plt.plot(list(common_words.values())[i][0], list(common_words.values())[i][1], 'bo', alpha = 0.2)
for i in range(len(book_singles)):
    plt.plot(list(book_singles.values())[i][0], list(book_singles.values())[i][1], 'ro', alpha = 0.2)

我把它们改为：
counts1, counts2 = zip(*common_words.values())
plt.plot(counts1, counts2, 'bo', alpha=0.2)

counts1, counts2 = zip(*book_singles.values())
plt.plot(counts1, counts2, 'ro', alpha=0.2)

完整的返工代码，省略了您计算过但从未在示例中使用过的内容：
import re  # regular expressions
import collections
from matplotlib import pyplot as plt

# xs=[x1,x2,...,xn]
# Number of occurrences of the word in book 1

# use

# ys=[y1.y2,...,yn]
# Number of occurrences of the word in book 2

# plt.plot(xs,ys)
# save as svg or pdf files

word_pattern = re.compile(r'[a-zA-Z]{3,}')

# with ensures closing of file even if there are failures
with open("swannsway.txt") as f:
    text = f.read() # read as a single large string
    book1 = word_pattern.findall(text.lower())  # pull out words

with open("moby_dick.txt") as f:
    text = f.read() # read as a single large string
    book2 = word_pattern.findall(text.lower())  # pull out words

# Convert these into relative percentages/total book length

wordcount_book1 = collections.defaultdict(int)
for word in book1:
    wordcount_book1[word] += 1

wordcount_book2 = collections.defaultdict(int)
for word in book2:
    wordcount_book2[word] += 1

common_words = {}
book_singles = {}

for i in wordcount_book1:
    if i in wordcount_book2:
        common_words[i] = [wordcount_book1[i], wordcount_book2[i]]
    else:
        book_singles[i] = [wordcount_book1[i], 0]

for i in wordcount_book2:
    if i not in common_words:
        book_singles[i] = [0, wordcount_book2[i]]

counts1, counts2 = zip(*common_words.values())
plt.plot(counts1, counts2, 'bo', alpha=0.2)

counts1, counts2 = zip(*book_singles.values())
plt.plot(counts1, counts2, 'ro', alpha=0.2)

plt.xlabel('Moby Dick')
plt.ylabel('Swannsway')
plt.show()

输出

你可能会删掉一些高分的单词，然后拿出有趣的数据。答案不错。这段代码在您的更改下运行的速度快得惊人。谢谢你的详细解释。
import re  # regular expressions
import collections
from matplotlib import pyplot as plt

# xs=[x1,x2,...,xn]
# Number of occurrences of the word in book 1

# use

# ys=[y1.y2,...,yn]
# Number of occurrences of the word in book 2

# plt.plot(xs,ys)
# save as svg or pdf files

word_pattern = re.compile(r'[a-zA-Z]{3,}')

# with ensures closing of file even if there are failures
with open("swannsway.txt") as f:
    text = f.read() # read as a single large string
    book1 = word_pattern.findall(text.lower())  # pull out words

with open("moby_dick.txt") as f:
    text = f.read() # read as a single large string
    book2 = word_pattern.findall(text.lower())  # pull out words

# Convert these into relative percentages/total book length

wordcount_book1 = collections.defaultdict(int)
for word in book1:
    wordcount_book1[word] += 1

wordcount_book2 = collections.defaultdict(int)
for word in book2:
    wordcount_book2[word] += 1

common_words = {}
book_singles = {}

for i in wordcount_book1:
    if i in wordcount_book2:
        common_words[i] = [wordcount_book1[i], wordcount_book2[i]]
    else:
        book_singles[i] = [wordcount_book1[i], 0]

for i in wordcount_book2:
    if i not in common_words:
        book_singles[i] = [0, wordcount_book2[i]]

counts1, counts2 = zip(*common_words.values())
plt.plot(counts1, counts2, 'bo', alpha=0.2)

counts1, counts2 = zip(*book_singles.values())
plt.plot(counts1, counts2, 'ro', alpha=0.2)

plt.xlabel('Moby Dick')
plt.ylabel('Swannsway')
plt.show()