Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/343.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 对印地语文本进行计数矢量化时遇到问题_Python_Machine Learning_Nlp_Vectorization_Countvectorizer - Fatal编程技术网

Python 对印地语文本进行计数矢量化时遇到问题

Python 对印地语文本进行计数矢量化时遇到问题,python,machine-learning,nlp,vectorization,countvectorizer,Python,Machine Learning,Nlp,Vectorization,Countvectorizer,在用印地语进行计数矢量化时,要素名称会自动被阻止 from sklearn.feature_extraction.text import CountVectorizer test = [] test.append("हमें फिल्म बहुत अच्छी लगी ।") test.append("फिल्म में कुछ बेहतरीन गाने हैं ।") cv = CountVectorizer().fit(test) print(cv.get_feature_names())

在用印地语进行计数矢量化时,要素名称会自动被阻止

from sklearn.feature_extraction.text import CountVectorizer
test = []
test.append("हमें फिल्म बहुत अच्छी लगी ।")
test.append("फिल्म में कुछ बेहतरीन गाने हैं ।")
cv = CountVectorizer().fit(test)
print(cv.get_feature_names())

输出:['अच', 'बह', 'लग', 'हतर', 'हम']

CountVectorizer()使用的分析器似乎不太支持某些编码。您可以定义自定义分析器来定义如何分隔单词。要正确分隔单词,可以使用正则表达式:

import regex 

def custom_analyzer(text):
    words = regex.findall(r'\w{2,}', text) # extract words of at least 2 letters
    for w in words:
        yield w

test = []
test.append("हमें फिल्म बहुत अच्छी लगी ।")
test.append("फिल्म में कुछ बेहतरीन गाने हैं ।")
count_vect = CountVectorizer(analyzer = custom_analyzer)
xv = count_vect.fit_transform(test)
count_vect.get_feature_names()

我使用了,因为它比模块
re
支持更多的编码(感谢您的解释)。

您的问题是什么?您试图获得什么输出?