Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/330.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何将从tweet中提取的词汇、语义和弓形特征组合到分类器中?_Python_Machine Learning_Nlp_Feature Extraction - Fatal编程技术网

Python 如何将从tweet中提取的词汇、语义和弓形特征组合到分类器中?

Python 如何将从tweet中提取的词汇、语义和弓形特征组合到分类器中?,python,machine-learning,nlp,feature-extraction,Python,Machine Learning,Nlp,Feature Extraction,我想将许多特征组结合起来:从tweet中提取的词汇、语义和弓形特征组合到分类器中 我正在twitter上处理作者身份验证问题,代码如下 下面是我的代码: train = pd.read_csv("./av/av1/train.csv") test = pd.read_csv("./av/av1/test.csv") num_chapters = len('train.csv') fvs_lexical = np.zeros((len(train['text']), 3), np.float

我想将许多特征组结合起来:从tweet中提取的词汇、语义和弓形特征组合到分类器中

我正在twitter上处理作者身份验证问题,代码如下

下面是我的代码:

train  = pd.read_csv("./av/av1/train.csv") 
test = pd.read_csv("./av/av1/test.csv")

num_chapters = len('train.csv')
fvs_lexical = np.zeros((len(train['text']), 3), np.float64)
fvs_punct = np.zeros((len(train['text']), 3), np.float64)
for e, ch_text in enumerate(train['text']):
    # note: the nltk.word_tokenize includes punctuation
    tokens = nltk.word_tokenize(ch_text.lower())
    words = word_tokenizer.tokenize(ch_text.lower())
    sentences = sentence_tokenizer.tokenize(ch_text)
    vocab = set(words)
    words_per_sentence = np.array([len(word_tokenizer.tokenize(s))
                                   for s in sentences])

# average number of words per sentence
    fvs_lexical[e, 0] = words_per_sentence.mean()
# sentence length variation
    fvs_lexical[e, 1] = words_per_sentence.std()
# apply whitening to decorrelate the features
fvs_lexical = whiten(fvs_lexical) 

#bag of wrods features
bow_vectorizer = CountVectorizer(max_df=0.90, min_df=2, max_features=1000, stop_words='english') 

vectorizer = FeatureUnion([  ("baw", bow_vectorizer), ("fvs_lexical",fvs_lexical)])
matrix = vectorizer.fit_transform(train['text'].values.astype('U'))
print "num of features: " , len(vectorizer.get_feature_names())


X =matrix.toarray()
y = np.asarray(train['label'].values.astype('U'))  

model=LogisticRegression()

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
scores = cross_val_score(model,X_train,y_train,cv=3,
  scoring='f1_micro')
y_pred = model.fit(X_train, y_train).predict(X_test)

print 'F1 score:',f1_score(y_test, y_pred, average=None) # calculating
预测结果是F1分数,但我得到以下错误:

TypeError                                 Traceback (most recent call last)
<ipython-input-87-1a69ca9a65a2> in <module>()
     24 bow_vectorizer = CountVectorizer(max_df=0.90, min_df=2, max_features=1000, stop_words='english')
     25 
---> 26 vectorizer = FeatureUnion([  ("baw", bow_vectorizer), ("fvs_lexical",fvs_lexical)])
     27 matrix = vectorizer.fit_transform(train['text'].values.astype('U'))
     28 print "num of features: " , len(vectorizer.get_feature_names())

C:\Users\AsusPc\Anaconda2\lib\site-packages\sklearn\pipeline.pyc in __init__(self, transformer_list, n_jobs, transformer_weights)
    616         self.n_jobs = n_jobs
    617         self.transformer_weights = transformer_weights
--> 618         self._validate_transformers()
    619 
    620     def get_params(self, deep=True):

C:\Users\AsusPc\Anaconda2\lib\site-packages\sklearn\pipeline.pyc in _validate_transformers(self)
    660                 raise TypeError("All estimators should implement fit and "
    661                                 "transform. '%s' (type %s) doesn't" %
--> 662                                 (t, type(t)))
    663 
    664     def _iter(self):

TypeError: All estimators should implement fit and transform. '[[1.29995156 0.         0.        ]
 [5.38551361 0.         0.        ]
 [0.37141473 0.         0.        ]
 ...
 [0.92853683 0.         0.        ]
 [1.1142442  3.52964785 0.        ]
 [1.85707366 0.         0.        ]]' (type <type 'numpy.ndarray'>) doesn't
TypeError回溯(最近一次调用)
在()
24弓形向量器=计数向量器(最大值为0.90,最小值为2,最大值为1000,停止词为英语)
25
--->26向量器=特征联合([(“baw”,bow\u向量器),(“fvs\u词法”,fvs\u词法)])
27矩阵=矢量器.fit_变换(序列['text'].values.astype('U'))
28打印“特征数量:”,len(矢量器。获取特征名称()
C:\Users\AsusPc\Anaconda2\lib\site packages\sklearn\pipeline.pyc in\uuuuuuu init\uuuuuu(self、transformer\u列表、n\u作业、transformer\u权重)
616 self.n_jobs=n_jobs
617自变压器重量=变压器重量
-->618自验证变压器()
619
620 def get_参数(self,deep=True):
C:\Users\AsusPc\Anaconda2\lib\site packages\sklearn\pipeline.pyc in\u validate\u transformers(self)
660 raise TypeError(“所有估计员都应实施拟合和校正”
661“转换”。%s'(类型%s)不“%
-->662(t,类型(t)))
663
664 def_iter(自身):
类型错误:所有估计器都应实现拟合和变换。”[[1.29995156 0.         0.        ]
[5.38551361 0.         0.        ]
[0.37141473 0.         0.        ]
...
[0.92853683 0.         0.        ]
[1.1142442  3.52964785 0.        ]
[1.85707366 0.0.]'(类型)没有

事实上,我已经尝试了以下解决方案,但当仅使用TFIDF+弓功能时,它提高了精度:0.899029126214 当我添加词汇特征时,准确度为:0.7747572815533981 我已经使用featureunion来联合相同的特征矩阵(TFIDF+bow),然后我使用h.stack来堆叠featureunion+词法向量,代码如下:

# average number of words per sentence

    fvs_lexical[e, 0] = words_per_sentence.mean()
    # sentence length variation
    fvs_lexical[e, 1] = words_per_sentence.std()
    # Lexical diversity
    fvs_lexical[e, 2] = len(vocab) / float(len(words))
# apply whitening to decorrelate the features
fvs_lexical = whiten(fvs_lexical)
#bag of wrods features
bow_vectorizer = CountVectorizer(max_df=0.90, min_df=2, max_features=1000, stop_words='english') 
#tfidf 
tfidf_vectorizer = TfidfVectorizer(max_df=0.90, min_df=2, max_features=1000, stop_words='english') 
#vectorizer and fitting for the unified features 
vectorizer = FeatureUnion([  ("baw", bow_vectorizer),("tfidf", tfidf_vectorizer)
fvs_lexical_vector = CountVectorizer(fvs_lexical)
x1 =vectorizer.fit_transform (train['text'].values.astype('U'))
x2 =fvs_lexical_vector.fit_transform (train['text'].values.astype('U'))
x= scipy.sparse.hstack((x2,x3),format='csr')
y = np.asarray(train['label'].values.astype('U')) 

那么我已经运行了逻辑回归

现在我可以知道为什么我降级了吗?