Python 计算单词列表中元音与单词长度的比率_Python

Python 计算单词列表中元音与单词长度的比率

python

Python 计算单词列表中元音与单词长度的比率,python,Python,以下是我的函数代码： def calcVowelProportion(wordList): """ Calculates the proportion of vowels in each word in wordList. """ VOWELS = 'aeiou' ratios = [] for word in wordList: numVowels = 0 for char in word:

以下是我的函数代码：

def calcVowelProportion(wordList):
    """
    Calculates the proportion of vowels in each word in wordList.
    """

    VOWELS = 'aeiou'
    ratios = []

    for word in wordList:
        numVowels = 0
        for char in word:
            if char in VOWELS:
                numVowels += 1
        ratios.append(numVowels/float(len(word)))

现在，我正在处理超过87000个单词的列表，这个算法显然非常慢

有更好的方法吗

编辑：

我测试了@ExP与以下类一起提供的算法：

    import time

    class vowelProportions(object):
        """
        A series of methods that all calculate the vowel/word length ratio
        in a list of words.
        """

        WORDLIST_FILENAME = "words_short.txt"

        def __init__(self):
            self.wordList = self.buildWordList()
            print "Original: " + str(self.calcMeanTime(10000, self.cvpOriginal, self.wordList))
            print "Generator: " + str(self.calcMeanTime(10000, self.cvpGenerator, self.wordList))
            print "Count: " + str(self.calcMeanTime(10000, self.cvpCount, self.wordList))
            print "Translate: " + str(self.calcMeanTime(10000, self.cvpTranslate, self.wordList))

        def buildWordList(self):
            inFile = open(self.WORDLIST_FILENAME, 'r', 0)
            wordList = []
            for line in inFile:
                wordList.append(line.strip().lower())
            return wordList

        def cvpOriginal(self, wordList):
            """ My original, slow algorithm"""
            VOWELS = 'aeiou'
            ratios = []

            for word in wordList:
                numVowels = 0
                for char in word:
                    if char in VOWELS:
                        numVowels += 1
                ratios.append(numVowels/float(len(word)))

            return ratios

        def cvpGenerator(self, wordList):
            """ Using a generator expression """
            return [sum(char in 'aeiou' for char in word)/float(len(word)) for word in wordList]

        def cvpCount(self, wordList):
            """ Using str.count() """
            return [sum(word.count(char) for char in 'aeiou')/float(len(word)) for word in wordList]

        def cvpTranslate(self, wordList):
            """ Using str.translate() """
            return [len(word.translate(None, 'bcdfghjklmnpqrstxyz'))/float(len(word)) for word in wordList]

        def timeFunc(self, func, *args):
            start = time.clock()
            func(*args)
            return time.clock() - start

        def calcMeanTime(self, numTrials, func, *args):
            times = [self.timeFunc(func, *args) for x in range(numTrials)]
            return sum(times)/len(times)

输出为（对于200个单词的列表）：

令人惊讶的是，生成器和计数甚至比原来的还要慢（如果我的实现不正确，请告诉我）

我想测试@John的解决方案，但对树一无所知。

您应该优化最里面的循环

我很确定有几种替代方法。这是我现在能想到的。我不确定它们在速度上会如何比较（相对于彼此和您的解决方案）

使用生成器表达式：

numVowels = sum(x in 'aeiou' for x in word)

使用
```
str.count（）
```
：
使用
```
str.translate（）
```
（假设没有大写字母或特殊符号）：

有了所有这些，您甚至可以在一行中编写整个函数，而无需

list.append（）

我很想知道哪个是最快的。

因为你只关心每个单词中元音与字母的比例，你可以先用

替换所有元音。现在，您可以尝试一些可能更快的方法：

每一步测试一个字母，而不是五个。那肯定会更快
您可以对整个列表进行排序，并搜索从元音（现在分类为
```
a
```
）到非元音的点。这是一个树形结构。单词中字母的数量是树的级别。元音的数量是左分支的数量

更少的决策，应该意味着更少的时间，也使用内置的东西，我相信这会更快。

使用正则表达式匹配元音列表并计算匹配的数量

>>> import re
>>> s = 'supercalifragilisticexpialidocious'
>>> len(re.findall('[aeiou]', s))
16

以下是如何在Linux上使用一个命令行计算它：-

cat wordlist.txt | tr-d aeiouAEIOU | paste-wordlist.txt | gawk'{FS=“\t”；RATIO=length（$1）/length（$2）；print$2，RATIO}'

输出：

aa 0
ab 0.5
abs 0.666667

注意：

wordlist.txt

中的每一行都包含一个单词。空行将产生被零除的错误

我想值得注意的是，您应该

（1）元音='aeiouAEIOU'

或

（2）首先将wordList:

中的word.lower（）改为to，除非列表中包含小写单词already@ryrich我正在处理的列表只包含小写单词，但是谢谢你-我忘了提那件事了。有几个小优化可以尝试：1。尝试在一个步骤中创建

比率

列表，例如

比率=[None]*len（单词列表）

。追加可能需要调整备份阵列的大小，这很慢。2.尝试将

元音

转换成一组。根据我的经验，测试一个值是否在一个集合中至少比测试一个列表要快得多。当然，您需要对此进行测试以确认。用其他字母替换字母会花费时间。。。排序也一样。把所有这些想法都记录下来是很好的。这是真的。我希望内置的搜索/替换会非常有效，并且会将循环执行减少80%。（不幸的是，我现在无法验证。）我认为这个答案将是最有成效的，但是在python之外进行字母替换

cat wordlist | tr aeiou a | tr qwrtpsdfgfhjklzxcvbm b>新的单词列表

，然后进行处理。

numpowers

可以简化为：

sum（单词中x的元音中x的x）

，因为这只需要单词的一次迭代，而

str.count

将导致5次迭代。@Ashwinichaudy说得很好，我编辑了我的答案以反映您的第一个评论。

len（如果“aeiou”中的x是x，则word中的x代表x）

将不起作用——genexps没有长度。您需要

len

一个listcomp或

sum（1代表…）

等等。@DSM谢谢，我没有注意到这一点。令人惊讶的是，

translate

比第一个生成器表达式快三倍左右。第二种是最慢的，正则表达式很可能是一种过度使用。我认为

re

匹配引擎的速度不足以与他的初始解决方案竞争。@ExP，我认为

re

代码是高度优化的C语言。这当然值得一试。如果我有几秒钟的时间，我可能会尝试对它进行基准测试。@MarkRansom:compiled regex只比生成器表达式慢一点。对于Python，它更复杂，请将它通过管道传输到：`Python-c“import sys；tmp=lambda x:sys.stdout.write（x.split（'\t'）[1]+str（float（len（x.split（'\t'）[0]）））/float（len（x.split（'\t'）[1]））-1））+\n'）；map（tmp，系统标准）

numVowels = len(word.translate(None, 'bcdfghjklmnpqrstxyz'))

for word in wordlist:
    numVowels = 0
    for letter in VOWELS:
        numVowels += word.count(letter)
    ratios.append(numVowels/float(len(word)))

>>> import re
>>> s = 'supercalifragilisticexpialidocious'
>>> len(re.findall('[aeiou]', s))
16

import timeit

words = 'This is a test string'

def vowelProportions(words):
    counts, vowels = {}, 'aeiou'
    wordLst = words.lower().split()
    for word in wordLst:
        counts[word] = float(sum(word.count(v) for v in vowels)) / len(word)
    return counts

def f():
    return vowelProportions(words)

print timeit.timeit(stmt = f, number = 17400) # 5 (len of words) * 17400 = 87,000
# 0.838676

aa 0
ab 0.5
abs 0.666667