Python 如何仅使用正则表达式复制默认的sklearn CountVectorizer标记化？_Python_Scikit Learn

Python 如何仅使用正则表达式复制默认的sklearn CountVectorizer标记化？

python scikit-learn

Python 如何仅使用正则表达式复制默认的sklearn CountVectorizer标记化？,python,scikit-learn,Python,Scikit Learn,我不想使用CountVectorizer，但尝试重现它的标记化。我知道它会删除一些特殊字符，并将它们放在小写字母中。我尝试了这个正则表达式r'[\W_]+'，并使用'作为分隔符，但仍然无法复制它。有什么想法吗？改用'（？u）\\b\\w\\w+\\b'regex 复制： from sklearn.feature_extraction.text import CountVectorizer cv = CountVectorizer() text = "This a simple exa

我不想使用

CountVectorizer

，但尝试重现它的标记化。我知道它会删除一些特殊字符，并将它们放在小写字母中。我尝试了这个正则表达式

r'[\W_]+'

，并使用

作为分隔符，但仍然无法复制它。有什么想法吗？

改用

'（？u）\\b\\w\\w+\\b'

regex

复制：

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

text = "This a simple example. And this is another."
text_transformed = cv.fit_transform([text])
vocab = sorted(cv.vocabulary_)
counts = text_transformed.toarray()
print(pd.DataFrame(counts, columns = vocab))

可以这样做：

import re
from collections import Counter

regex = re.compile('(?u)\\b\\w\\w+\\b')
tokens = re.findall(regex, text.lower()) # notice lowercase=True param
vocab = sorted(set(tokens))
counts = Counter(tokens)
counts = [counts[key] for key in sorted(counts.keys())]
#vocab, counts = list(zip(*sorted(counter.items()))) # one liner with asterisk unpacking
print(pd.DataFrame([counts], columns = vocab))

说明：

countvectorier

使用

token\u pattern='（？u）\\b\\w\\w+\\b'

param，它是从文本中提取标记的正则表达式模式：

print(cv)

countvectorier（analyzer='word'，binary=False，decode\u error='strict'，
数据类型=，编码='utf-8'，输入='content'，
小写=真，最大值为1.0，最大值为无，最小值为1，
ngram_范围=（1，1），预处理器=无，停止字=无，
strip\u accents=None，token\u pattern='（？u）\\b\\w\\w+\\b'，
标记器=无，词汇表=无）

通过将此正则表达式提供给

re.findall

您将实现类似的标记化，通过进一步计数，您将获得

CountVectorizer

   and  another  example  is  simple  this
0    1        1        1   1       1     2

print(cv)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)