Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/284.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何获得熊猫文本中特定单词的热编码?_Python_Pandas_Scipy_One Hot Encoding - Fatal编程技术网

Python 如何获得熊猫文本中特定单词的热编码?

Python 如何获得熊猫文本中特定单词的热编码?,python,pandas,scipy,one-hot-encoding,Python,Pandas,Scipy,One Hot Encoding,假设我有一个数据框和单词列表,即 toxic = ['bad','horrible','disguisting'] df = pd.DataFrame({'text':['You look horrible','You are good','you are bad and disguisting']}) main = pd.concat([df,pd.DataFrame(columns=toxic)]).fillna(0) samp = main['text'].str.split().a

假设我有一个数据框和单词列表,即

toxic = ['bad','horrible','disguisting']

df = pd.DataFrame({'text':['You look horrible','You are good','you are bad and disguisting']})

main = pd.concat([df,pd.DataFrame(columns=toxic)]).fillna(0)

samp = main['text'].str.split().apply(lambda x : [i for i in toxic if i in x])

for i,j in enumerate(samp):
    for k in j:
        main.loc[i,k] = 1 
这导致:

   bad  disguisting  horrible                         text
0    0            0         1            You look horrible
1    0            0         0                 You are good
2    1            1         0  you are bad and disguisting
这比get_假人快一点,但是当有大量数据时,pandas中的for循环并不明显

我试过使用
str.get_dummies
,这将使序列中的每个单词都进行一次热编码,这会使其速度变慢

pd.concat([df,main['text'].str.get_dummies(' ')[toxic]],1)

                          text  bad  horrible  disguisting
0            You look horrible    0         1            0
1                 You are good    0         0            0
2  you are bad and disguisting    1         0            1
如果我在scipy中尝试同样的方法

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(toxic)
main['text'].str.split().apply(le.transform)
这会导致
值错误,y包含新标签
。有没有办法忽略scipy中的错误

如何提高实现相同目标的速度,还有其他快速方法吗?

使用:

结果:

In [127]: r
Out[127]:
   bad  horrible  disguisting
0    0         1            0
1    0         0            0
2    1         0            1

In [128]: type(r)
Out[128]: pandas.core.sparse.frame.SparseDataFrame

In [129]: r.info()
<class 'pandas.core.sparse.frame.SparseDataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
bad            3 non-null int64
horrible       3 non-null int64
disguisting    3 non-null int64
dtypes: int64(3)
memory usage: 104.0 bytes

In [130]: r.memory_usage()
Out[130]:
Index          80
bad             8   #  <--- NOTE: it's using 8 bytes (1x int64) instead of 24 bytes for three values (3x8)
horrible        8
disguisting     8
dtype: int64

旧版本Pandas中的PS Sparsed columns在将SparsedDataFrame与常规DataFrame连接后,其稀疏性降低(变得密集),现在我们可以混合使用常规系列(columns)和SparseSeries-这是一个非常好的特性

不推荐接受的答案,请参阅发行说明:

SparseSeries和SparseStataFrame在pandas 1.0.0中被删除。本迁移指南旨在帮助您从以前的版本迁移

熊猫1.0.5解决方案:

r = df = pd.DataFrame.sparse.from_spmatrix(cv.fit_transform(df['text']), 
                   df.index,
                   cv.get_feature_names())

在大型数据集上,这确实是可扩展的,而且速度更快。如果有毒长度为10000,则速度会非常慢。你有什么建议吗?@Dark,嗯,我想我需要一个更大的样本数据集来处理。。。与
str.get_dummies()
方法相比,它慢得多吗?哦,不是
str.get_dummies
更慢。也许我需要清除有毒物质并减少文字。感谢您的这种方法,我从来没有机会使用稀疏数据帧,很高兴我现在可以使用它。
In [137]: r2 = df.join(r)

In [138]: r2
Out[138]:
                          text  bad  horrible  disguisting
0            You look horrible    0         1            0
1                 You are good    0         0            0
2  you are bad and disguisting    1         0            1

In [139]: r2.memory_usage()
Out[139]:
Index          80
text           24
bad             8
horrible        8
disguisting     8
dtype: int64

In [140]: type(r2)
Out[140]: pandas.core.frame.DataFrame

In [141]: type(r2['horrible'])
Out[141]: pandas.core.sparse.series.SparseSeries

In [142]: type(r2['text'])
Out[142]: pandas.core.series.Series
r = df = pd.DataFrame.sparse.from_spmatrix(cv.fit_transform(df['text']), 
                   df.index,
                   cv.get_feature_names())