Python 如何查找字符串中单词的计数？_Python

Python 如何查找字符串中单词的计数？
python
Python 如何查找字符串中单词的计数？,python,Python,我有一个字符串“你好，我要和你好我一起去”。我想知道一个单词在字符串中出现了多少次。示例hello发生2次。我尝试了这种只打印字符的方法- def countWord(input_string): d = {} for word in input_string: try: d[word] += 1 except: d[word] = 1 for k in d.keys():
我有一个字符串“
你好，我要和你好我一起去”。我想知道一个单词在字符串中出现了多少次。示例hello发生2次。我尝试了这种只打印字符的方法-
def countWord(input_string):
    d = {}
    for word in input_string:
        try:
            d[word] += 1
        except:
            d[word] = 1

    for k in d.keys():
        print "%s: %d" % (k, d[k])
print countWord("Hello I am going to I with Hello am")

我想学习如何找到单词计数
from collections import *
import re

Counter(re.findall(r"[\w']+", text.lower()))

使用re.findall
比使用split
更通用，因为否则您无法考虑诸如“不要”和“我会”等收缩
演示（使用您的示例）：
如果您希望进行许多这样的查询，那么这只会执行一次O（N）个查询，而不是O（N*#个查询）。
如果您想查找单个单词的计数，只需使用count
：
input_string.count("Hello")

使用collections.Counter
和split（）
from collections import Counter

words = input_string.split()
wordCount = Counter(words)

你的朋友是：
>>> from collections import Counter
>>> counts = Counter(sentence.lower().split())

这里有一种不区分大小写的替代方法
sum(1 for w in s.lower().split() if w == 'Hello'.lower())
2

它通过将字符串和目标转换为小写来匹配
ps：注意@DSM在下面指出的str.count（）
问题的“am-ham.count”（“am”）==2
）
将Hello
和Hello
视为同一个词，不管它们的情况如何：
>>> from collections import Counter
>>> strs="Hello I am going to I with hello am"
>>> Counter(map(str.lower,strs.split()))
Counter({'i': 2, 'am': 2, 'hello': 2, 'to': 1, 'going': 1, 'with': 1})

单词出现次数的向量称为
Scikit learn提供了一个很好的模块来计算它。例如：
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             min_df = 0,          \
                             max_features = 50) 

text = ["Hello I am going to I with hello am"]

# Count
train_data_features = vectorizer.fit_transform(text)
vocab = vectorizer.get_feature_names()

# Sum up the counts of each vocabulary word
dist = np.sum(train_data_features.toarray(), axis=0)

# For each, print the vocabulary word and the number of times it 
# appears in the training set
for tag, count in zip(vocab, dist):
    print count, tag

输出：
2 am
1 going
2 hello
1 to
1 with

部分代码取自此
仅供参考：
您可以使用Python正则表达式库re
查找子字符串中的所有匹配项并返回数组
import re

input_string = "Hello I am going to I with Hello am"

print(len(re.findall('hello', input_string.lower())))

打印：
2

您可以将字符串划分为元素并计算它们的数目
count=len（my_string.split（））
单独使用count可能会导致意外的结果，但是：am ham.count（“am”）==2
@DSM。。说得好。。无论如何，我对这个解决方案不满意，因为它是区分大小写的，现在正在寻找一个替代方案……Hello
和Hello
是一样的吗？根据您的使用情况，您可能还需要考虑一件事：一些单词的含义根据大小写的不同而变化，比如Polish
和Polish
。也许这对你来说无关紧要，但值得记住。你能为我们定义更多的数据集吗？你会担心标点符号吗，比如在我会，不要等。。其中一些问题在下面的评论中提出。大小写的区别是什么？集合模块是基本python安装的一部分吗？我正在复制@DSM留给我的一条注释的一部分，因为我也使用了str.count（）
作为我的初始解决方案-这有一个问题，因为的“am ham.count”（“am”）
将产生2而不是1@Varun：我相信collections
在Python2.4及更高版本中。@Levon:你说得绝对正确。我相信使用计数器和regex单词收集器可能是最好的选择。我将相应地编辑答案。。归功于@DSM，他首先让我意识到了这一点（因为我也在使用str.count（）
），我会使用计数器（strs.lower（）.split（））
。为更快的运行时间减少了一些开销这不是Martijn Pieters现在的解决方案吗？@DSM我不知何故没有看到他的解决方案，将我的解决方案更新回原始版本。：）+1为re<代码>拆分解决方案无法处理包含标点符号的短语。这是我的最佳答案+1您好，欢迎使用SO。您的答案只包含代码。如果您还可以添加一些注释来解释它的作用和方式，那就更好了。你能把你的答案加上吗？非常感谢。只有代码的答案被认为是低质量的：确保提供一个解释，说明你的代码是做什么的，以及它是如何解决问题的。如果你能在你的文章中添加更多的信息，这将有助于询问者和未来的读者。另请参见解释完全基于代码的答案：
2

def countSub(pat,string):
    result = 0
    for i in range(len(string)-len(pat)+1):
          for j in range(len(pat)):
              if string[i+j] != pat[j]:
                 break
          else:   
                 result+=1
    return result