Python 3.x 当我尝试使用HashingVectorizer时,Dask client.persist返回AssertionError
我正在尝试使用dask哈希向量器对dask.dataframe进行向量化。我希望矢量化结果留在集群(分布式系统)中。这就是我在尝试转换数据时使用Python 3.x 当我尝试使用HashingVectorizer时,Dask client.persist返回AssertionError,python-3.x,dask,dask-distributed,Python 3.x,Dask,Dask Distributed,我正在尝试使用dask哈希向量器对dask.dataframe进行向量化。我希望矢量化结果留在集群(分布式系统)中。这就是我在尝试转换数据时使用client.persist的原因。但是由于某种原因,我得到了下面的错误 Traceback (most recent call last): File "/home/dodzilla/my_project/components_with_adapter/vectorizers/base_vectorizer.py", line 112, in hy
client.persist
的原因。但是由于某种原因,我得到了下面的错误
Traceback (most recent call last):
File "/home/dodzilla/my_project/components_with_adapter/vectorizers/base_vectorizer.py", line 112, in hybrid_feature_vectorizer
CLUSTERING_FEATURES=self.clustering_features)
File "/home/dodzilla/my_project/components_with_adapter/vectorizers/text_vectorizer.py", line 143, in vectorize
X = self.client.persist(fitted_vectorizer.transform, combined_data)
File "/home/dodzilla/.local/lib/python3.6/site-packages/distributed/client.py", line 2860, in persist
assert all(map(dask.is_dask_collection, collections))
AssertionError
我无法共享数据,但有关数据的所有必要信息如下:
>>>type(combined_data)
<class 'dask.dataframe.core.Series'>
>>>type(combined_data.compute())
<class 'pandas.core.series.Series'>
>>>combined_data.compute().shape
12
我希望这些信息足够了
重要提示:当我说
client.compute
时,我没有收到任何类型的错误,但据我所知,这在机器集群中不起作用,而是在本地机器中运行。它返回一个csr矩阵,而不是延迟计算的dask.array
这不是我应该如何使用client.persist
。我正在寻找的函数是client.submit
和client.map
。。。就我而言,client.submit
解决了我的问题
from stop_words import get_stop_words
from dask_ml.feature_extraction.text import HashingVectorizer as daskHashingVectorizer
import pandas as pd
import dask
import dask.dataframe as dd
from dask.distributed import Client
def convert_dataframe_to_single_text(documents):
"""
Combine all of the columns into 1 column.
"""
if type(documents) is dask.dataframe.core.DataFrame:
cols = documents.columns
documents['combined'] = documents[cols].apply(func=(lambda row: ' '.join(row.values.astype(str))), axis=1,
meta=('str'))
document_texts = documents.drop(cols, axis=1)
else:
raise TypeError('Wrong type of data. Expected Pandas DF or Dask DF but received ', type(documents))
return document_texts
# Init the client.
client = Client('localhost:8786')
# Get stopwords
stopwords = get_stop_words(language="english")
# Create dask dataframe from pandas dataframe
data = {'Name':['Tom', 'nick', 'krish', 'jack'], 'Age':["twenty", "twentyone", "nineteen", "eighteen"]}
df = pd.DataFrame(data)
df = dd.from_pandas(df, npartitions=1)
# Init the vectorizer
vectorizer = daskHashingVectorizer(stop_words=stopwords, alternate_sign=False,
norm=None, binary=False,
n_features=10000)
# Combine all of to columns into 1 column.
combined_data = convert_dataframe_to_single_text(df)
# Fit the vectorizer.
fitted_vectorizer = client.persist(vectorizer.fit(combined_data))
# Transform the data.
X = client.persist(fitted_vectorizer.transform, combined_data)