Python 在计算文本中单词的准确率时，如何忽略某些单词？_Python_Python 2.7_Text_Pandas_Scikit Learn

Python 在计算文本中单词的准确率时，如何忽略某些单词？

python python-2.7 text pandas scikit-learn

Python 在计算文本中单词的准确率时，如何忽略某些单词？,python,python-2.7,text,pandas,scikit-learn,Python,Python 2.7,Text,Pandas,Scikit Learn,在计算文本中单词的准确率时，我怎么能忽略一些单词，如“a”、“the” import pandas as pd from sklearn.feature_extraction.text import CountVectorizer df= pd.DataFrame({'phrase': pd.Series('The large distance between cities. The small distance. The')}) f = CountVectorizer().build_tok

在计算文本中单词的准确率时，我怎么能忽略一些单词，如“a”、“the”

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

df= pd.DataFrame({'phrase': pd.Series('The large distance between cities. The small distance. The')})
f = CountVectorizer().build_tokenizer()(str(df['phrase']))

result = collections.Counter(f).most_common(1)

print result

答案将是。但是我想把距离作为最常用的词。

最好避免以类似的方式开始计算条目

ignore = {'the','a','if','in','it','of','or'}
result = collections.Counter(x for x in f if x not in ignore).most_common(1)

最好避免以类似的方式开始计算条目

ignore = {'the','a','if','in','it','of','or'}
result = collections.Counter(x for x in f if x not in ignore).most_common(1)

另一个选项是使用

countvectorier

的

stop\u words

参数
这些是您不感兴趣的单词，将被分析器丢弃

f = CountVectorizer(stop_words={'the','a','if','in','it','of','or'}).build_analyzer()(str(df['phrase']))
result = collections.Counter(f).most_common(1)
print result
[(u'distance', 1)]

请注意，

tokenizer

不执行预处理（小写、重音去除）或删除停止字，因此您需要在此处使用分析器

您还可以使用

stop\u words='english'

自动删除英语停止词（有关完整列表，请参阅

sklearn.feature\u extraction.text.english\u stop\u words

。

另一个选项是使用

countvectorier

的

stop\u words

参数这些是您不感兴趣的单词，将被分析器丢弃

f = CountVectorizer(stop_words={'the','a','if','in','it','of','or'}).build_analyzer()(str(df['phrase']))
result = collections.Counter(f).most_common(1)
print result
[(u'distance', 1)]

请注意，

tokenizer

不执行预处理（小写、重音去除）或删除停止字，因此您需要在此处使用分析器

您还可以使用

stop\u words='english'

自动删除英语停止词（有关完整列表，请参阅

sklearn.feature\u extraction.text.english\u stop\u words

）