Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/287.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/url/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何在CountVectorizer的词汇表中使用正则表达式?_Python_Scikit Learn_Nlp_Text Classification_Countvectorizer - Fatal编程技术网

Python 如何在CountVectorizer的词汇表中使用正则表达式?

Python 如何在CountVectorizer的词汇表中使用正则表达式?,python,scikit-learn,nlp,text-classification,countvectorizer,Python,Scikit Learn,Nlp,Text Classification,Countvectorizer,如何使“文档中的第一个单词是[目标单词]”成为一个特性 考虑以下两句话: example = ["At the moment, my girlfriend is Jenny. She is working as an artist at the moment.", "My girlfriend is Susie. She is working as an accountant at the moment."] 如果我想衡量一段感情的承诺,我希望能够把“此时此刻”这个短语作为一个特

如何使“文档中的第一个单词是[目标单词]”成为一个特性

考虑以下两句话:

example = ["At the moment, my girlfriend is Jenny. She is working as an artist at the moment.",
       "My girlfriend is Susie. She is working as an accountant at the moment."]
如果我想衡量一段感情的承诺,我希望能够把“此时此刻”这个短语作为一个特征,只有当它在一开始就这样出现的时候

我希望能够在词汇表中使用正则表达式

phrases = ["^at the moment", 'work']
vect = CountVectorizer(vocabulary=phrases, ngram_range=(1, 3), token_pattern=r'\w{1,}')
dtm = vect.fit_transform(example)
但这似乎不起作用

我也尝试过这个,但得到一个“词汇表是空的”错误

CountVectorizer(token_pattern = r"(?u)^currently")

正确的方法是什么?我需要自定义矢量器吗?有什么简单的教程可以链接到我吗?这是我的第一个sklearn项目,我已经在谷歌上搜索了几个小时。非常感谢任何帮助

好的,我想我已经找到了一种方法,基于本教程中的get_tweet_length()函数。。。

我添加了这个函数

def first_words(text):
    matchesList = re.findall('^at the moment', text, re.I)
    if len(matchesList) > 0:
        return 1
    else:
        return 0
并将它们与基本的sklearn_helper
pipelineize_feature()
函数一起使用,该函数将输出转换为sklearn函数所需的数组格式

vect4 = pipelinize_feature(first_words, active=True)
然后,我可以通过FeatureUnion将其与我的常规计数器向量器一起使用

unionObj = FeatureUnion([
        ('vect1', vect1),
        ('vect2', vect2),
        ('vect4', vect4)
])