Python 前导字长

Python 前导字长,python,Python,我必须创建一个函数,该函数接受单个参数word,并返回文本中word前面的单词的平均长度(以字符为单位)。如果单词恰好是文本中出现的第一个单词,则该出现的前一个单词的长度应为零。比如说 >>> average_length("the") 4.4 >>> average_length('whale') False average_length('ship.') 3.0 这是我到目前为止写的 def average_length(word): text

我必须创建一个函数,该函数接受单个参数word,并返回文本中word前面的单词的平均长度(以字符为单位)。如果单词恰好是文本中出现的第一个单词,则该出现的前一个单词的长度应为零。比如说

>>> average_length("the")
4.4
>>> average_length('whale')
False
average_length('ship.')
3.0 
这是我到目前为止写的

def average_length(word):
    text = "Call me Ishmael. Some years ago - never mind how long..........."
    words = text.split()
    wordCount = len(words)

    Sum = 0
    for word in words:
        ch = len(word)
        Sum = Sum + ch
    avg = Sum/wordCount
    return avg
我知道这一点都不正确,但我很难正确处理这个问题。这个问题要求我找出课文中单词的每个实例,当你这样做时,计算课文中紧跟在它前面的单词的长度。不是从一开始到那个单词的每一个单词,只有一个

我还应该提到,所有的测试都只会使用《白鲸》的第一段来测试我的代码:

“叫我以实玛利。几年前,我的钱包里几乎没有钱,或者根本没有钱,在岸上也没有什么特别让我感兴趣的东西,我想我会航行一段时间,看看世界上有水的地方。这是我驱除脾脏和调节循环的一种方法。每当我发现自己的嘴变得冷酷;每当我的灵魂里有一个潮湿、细雨蒙蒙的十一月;每当我发现自己在棺材仓库前不由自主地停下来,提起我遇到的每一个葬礼的后面;尤其是当我的海波人占据了我的上风,这需要一个强有力的道德原则来防止我故意上街,有条不紊地敲掉别人的帽子——然后,我认为是时候尽快出海了。这是我手枪和球的替代品。卡托带着一种哲学的兴致,扑到了他的剑上;我悄悄地走上船。这并不令人惊讶。如果他们知道这一点,那么几乎所有在他们这个学位上的人,无论何时,都会和我对海洋怀有几乎相同的感情。”


通过只检查一次数据,似乎可以节省大量计算时间:

from collections import defaultdict
prec = defaultdict(list)
text = "Call me Ishmael. Some years ago..".split()
在列表上创建两个迭代器。我们在第二个迭代器上调用
next
,这样从现在开始,每当我们从迭代器中得到一个元素时,我们就得到一个单词及其后继词

first, second = iter(text), iter(text)
next(second)
压缩两个迭代器(
“abc”、“def”
→ <代码>“ad”、“be”、“cf”),我们将第一个单词的长度附加到第二个单词的前一个长度列表中。这是有效的,因为我们使用的是
defaultdict(list)
,它为任何尚未存在的键返回一个空列表

for one, two in zip(first, second):  # pairwise
    prec[two].append(len(one))
最后,我们可以创建一个新的字典,从单词到其前一个单词长度的平均值:总和除以长度

# avg_prec_len = {key: sum(prec[key]) / len(prec[key]) for key in prec}
avg_prec_len = {}
for key in prec:
    # prec[key] is a list of lengths
    avg[key] = sum(prec[key]) / len(prec[key])
那你就可以在那本字典里查一下


(如果您使用的是Python 2,请使用izip而不是zip,并使用来自未来导入部门的

基于您对无导入的要求和简单的方法,以下函数不需要任何注释就可以完成,注释和变量名应该使函数逻辑非常清晰:

def match_previous(lst, word):
    # keep matches_count of how many times we find a match and total lengths
    matches_count = total_length_sum = 0.0
    # pull first element from list to use as preceding word
    previous_word = lst[0]
    # slice rest of words from the list 
    # so we always compare two consecutive words
    rest_of_words = lst[1:]
    # catch where first word is "word" and add 1 to matches_count
    if previous_word == word:
        matches_count += 1
    for current_word in rest_of_words:
        # if the current word matches our "word"
        # add length of previous word to total_length_sum
        # and increase matches_count.
        if word == current_word:
            total_length_sum += len(previous_word)
            matches_count += 1
        # always update to keep track of word just seen
        previous_word = current_word
    # if  matches_count is 0 we found no word in the text that matched "word"
    return total_length_sum / matches_count if matches_count else False
需要两个参数,即单词的拆分列表和要搜索的单词:

In [41]: text = "Call me Ishmael. Some years ago - never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen and regulating the circulation. Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul; whenever I find myself involuntarily pausing before coffin warehouses, and bringing up the rear of every funeral I meet; and especially whenever my hypos get such an upper hand of me, that it requires a strong moral principle to previous_wordent me from deliberately stepping into the street, and methodically knocking people's hats off - then, I acmatches_count it high time to get to sea as soon as I can. This is my substitute for pistol and ball. With a philosophical flourish Cato throws himself upon his sword; I quietly take to the ship. There is nothing surprising in this. If they but knew it, almost all men in their degree, some time or other, cherish very nearly the same feelings towards the ocean with me."

In [42]: match_previous(text.split(),"the")
Out[42]: 4.4

In [43]: match_previous(text.split(),"ship.")
Out[43]: 3.0

In [44]: match_previous(text.split(),"whale")
Out[44]: False

In [45]: match_previous(text.split(),"Call")
Out[45]: 0.0
显然,您可以执行与您自己的函数相同的操作,使用单个参数在函数中拆分文本。返回False的唯一方法是,如果我们发现该单词不匹配,您可以看到调用返回0.0,因为它是文本中的第一个单词

如果我们在代码中添加一些打印并使用enumerate:

def match_previous(lst, word):
    matches_count = total_length_sum = 0.0
    previous_word = lst[0]
    rest_of_words = lst[1:]
    if previous_word == word:
        print("First word matches.")
        matches_count += 1
    for ind, current_word in enumerate(rest_of_words, 1):
        print("On iteration {}.\nprevious_word = {} and current_word = {}.".format(ind, previous_word, current_word))
        if word == current_word:
            total_length_sum += len(previous_word)
            matches_count += 1
            print("We found a match at index {} in our list of words.".format(ind-1))
        print("Updating previous_word from {} to {}.".format(previous_word, current_word))
        previous_word = current_word
    return total_length_sum / matches_count if matches_count else False
用一个小样本列表运行它,我们可以看到发生了什么:

In [59]: match_previous(["bar","foo","foobar","hello", "world","bar"],"bar")
First word matches.
On iteration 1.
previous_word = bar and current_word = foo.
Updating previous_word from bar to foo.
On iteration 2.
previous_word = foo and current_word = foobar.
Updating previous_word from foo to foobar.
On iteration 3.
previous_word = foobar and current_word = hello.
Updating previous_word from foobar to hello.
On iteration 4.
previous_word = hello and current_word = world.
Updating previous_word from hello to world.
On iteration 5.
previous_word = world and current_word = bar.
We found a match at index 4 in our list of words.
Updating previous_word from world to bar.
Out[59]: 2.5
使用
iter
的优点是我们不需要通过切片剩余部分来创建新列表,要在代码中使用它,只需将函数的开头更改为:

def match_previous(lst, word):
    matches_count = total_length_sum = 0.0
    # create an iterator
    _iterator = iter(lst)
    # pull first word from iterator
    previous_word = next(_iterator)
    if previous_word == word:
        matches_count += 1
    # _iterator will give us all bar the first word we consumed with  next(_iterator)
    for current_word in _iterator:
每次使用迭代器中的元素时,我们都会移动到下一个元素:

In [61]: l = [1,2,3,4]

In [62]: it = iter(l)

In [63]: next(it)
Out[63]: 1

In [64]: next(it)
Out[64]: 2
# consumed two of four so we are left with two
In [65]: list(it)
Out[65]: [3, 4]
dict真正有意义的唯一方法是将多个单词带到函数中,您可以使用这些单词:

然后只需输入文本和所有要搜索的单词,您就可以在以下列表中调用:

或者对其进行迭代:

In [70]: for tup in match_previous_generator("the","Call", "whale", "ship."):
   ....:     print(tup)
   ....:     
('the', 4.4)
('Call', 0.0)
('whale', False)
('ship.', 3.0)

我建议将此任务拆分为一些原子部分:

from __future__ import division  # int / int should result in float

# Input data:
text = "Lorem ipsum dolor sit amet dolor ..."
word = "dolor"

# First of all, let's extract words from string
words = text.split()

# Find indices of picked word in words
indices = [i for i, some_word in enumerate(words) if some_word == word]

# Find indices of preceding words
preceding_indices = [i-1 for i in indices]

# Find preceding words, handle first word case
preceding_words = [words[i] if i != -1 else "" for i in preceding_indices]

# Calculate mean of words length
mean = sum(len(w) for w in preceding_words) / len(preceding_words)

# Check if result is correct
# (len('ipsum') + len('amet')) / 2 = 9 / 2 = 4.5
assert mean == 4.5
很明显,我们可以将其包装以发挥作用。我在此处发表了评论:

def mean_length_of_preceding_words(word, text):
    words = text.split()
    indices = [i for i, some_word in enumerate(words) if some_word == word]
    preceding_indices = [i-1 for i in indices]
    preceding_words = [words[i] if i != -1 else "" for i in preceding_indices]
    mean = sum(len(w) for w in preceding_words) / len(preceding_words)
    return mean
显然,性能不是这里的关键-我试图只使用内置的(
from\uuuuu future\uuuu…
在我看来是一个内置的),并保持中间步骤干净和不言自明

一些测试用例:

assert mean_length_of_preceding_words("Lorem", "Lorem ipsum dolor sit amet dolor ...") == 0.0
assert mean_length_of_preceding_words("dolor", "Lorem ipsum dolor sit amet dolor ...") == 4.5
mean_length_of_preceding_words("E", "A B C D")  # ZeroDivisionError - average length of zero words does not exist
如果您想以某种方式处理标点符号,应该调整拆分过程(
words=…
),规范没有提到它,所以我保持了简单明了

我不喜欢为特殊情况更改退货类型,但如果您坚持,您可以提前退出

def mean_length_of_preceding_words(word, text):
    words = text.split()
    if word not in words:
        return False
    indices = [i for i, some_word in enumerate(words) if some_word == word]
    preceding_indices = [i-1 for i in indices]
    preceding_words = [words[i] if i != -1 else "" for i in preceding_indices]
    mean = sum(len(w) for w in preceding_words) / len(preceding_words)
    return mean
最后一个测试用例更改为:

assert mean_length_of_preceding_words("E", "A B C D") is False

这个答案基于这样一个假设:你想去掉所有的标点符号,只留下单词

我在单词列表前加了一个空字符串,这样就满足了您关于文本第一个单词的前置词的要求

使用
numpy
提供的一些智能索引计算结果

class Preceding_Word_Length():
    def __init__(self, text):
        import numpy as np
        self.words = np.array(
            ['']+[w.strip(''',.?!'":''') for w in text.split() if w != '-'])
        self.indices = np.arange(len(self.words))
        self.lengths = np.fromiter((len(w) for w in self.words), float)
    def mean(self, word):
        import numpy as np
        if word not in self.words:
            return 0.0
        return np.average(self.lengths[self.indices[word==self.words]-1])

text = '''Call me Ishmael. Some years ago - never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen and regulating the circulation. Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul; whenever I find myself involuntarily pausing before coffin warehouses, and bringing up the rear of every funeral I meet; and especially whenever my hypos get such an upper hand of me, that it requires a strong moral principle to prevent me from deliberately stepping into the street, and methodically knocking people's hats off - then, I account it high time to get to sea as soon as I can. This is my substitute for pistol and ball. With a philosophical flourish Cato throws himself upon his sword; I quietly take to the ship. There is nothing surprising in this. If they but knew it, almost all men in their degree, some time or other, cherish very nearly the same feelings towards the ocean with me.'''

ishmael = Preceding_Word_Length(text)

print(ishmael.mean('and'))   # -> 6.28571428571
print(ishmael.mean('Call'))  # -> 0.0
print(ishmael.mean('xyz'))   # -> 0.0

我想强调的是,在一个类中实现这种行为可以很容易地缓存一些重复的计算,以便对同一文本进行连续分析。

与我之前的回答非常相似,不导入
numpy

def average_length(text, word):
    words = ['']+[w.strip(''',.?!'":''') for w in text.split() if w != '-']
    if word not in words: return False
    match = [len(prev) for prev, curr in zip(words[:-1],words[1:]) if curr==word]
    return 1.0*sum(match)/len(match)

这似乎与此非常相似:
class Preceding_Word_Length():
    def __init__(self, text):
        import numpy as np
        self.words = np.array(
            ['']+[w.strip(''',.?!'":''') for w in text.split() if w != '-'])
        self.indices = np.arange(len(self.words))
        self.lengths = np.fromiter((len(w) for w in self.words), float)
    def mean(self, word):
        import numpy as np
        if word not in self.words:
            return 0.0
        return np.average(self.lengths[self.indices[word==self.words]-1])

text = '''Call me Ishmael. Some years ago - never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen and regulating the circulation. Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul; whenever I find myself involuntarily pausing before coffin warehouses, and bringing up the rear of every funeral I meet; and especially whenever my hypos get such an upper hand of me, that it requires a strong moral principle to prevent me from deliberately stepping into the street, and methodically knocking people's hats off - then, I account it high time to get to sea as soon as I can. This is my substitute for pistol and ball. With a philosophical flourish Cato throws himself upon his sword; I quietly take to the ship. There is nothing surprising in this. If they but knew it, almost all men in their degree, some time or other, cherish very nearly the same feelings towards the ocean with me.'''

ishmael = Preceding_Word_Length(text)

print(ishmael.mean('and'))   # -> 6.28571428571
print(ishmael.mean('Call'))  # -> 0.0
print(ishmael.mean('xyz'))   # -> 0.0
def average_length(text, word):
    words = ['']+[w.strip(''',.?!'":''') for w in text.split() if w != '-']
    if word not in words: return False
    match = [len(prev) for prev, curr in zip(words[:-1],words[1:]) if curr==word]
    return 1.0*sum(match)/len(match)