Python 熊猫将列表列转换为假人_Python_Pandas

Python 熊猫将列表列转换为假人

python pandas

Python 熊猫将列表列转换为假人,python,pandas,Python,Pandas,我有一个数据框，其中一列是我的每个用户所属的组的列表。比如： index groups 0 ['a','b','c'] 1 ['c'] 2 ['b','c','e'] 3 ['a','c'] 4 ['b','e'] 我想做的是创建一系列虚拟列来识别每个用户所属的组，以便运行一些分析 index a b c d e 0 1 1 1 0 0 1 0 0 1 0 0 2 0

我有一个数据框，其中一列是我的每个用户所属的组的列表。比如：

index groups  
0     ['a','b','c']
1     ['c']
2     ['b','c','e']
3     ['a','c']
4     ['b','e']

我想做的是创建一系列虚拟列来识别每个用户所属的组，以便运行一些分析

index  a   b   c   d   e
0      1   1   1   0   0
1      0   0   1   0   0
2      0   1   1   0   1
3      1   0   1   0   0
4      0   1   0   0   0


pd.get_dummies(df['groups'])

不起作用，因为这只会为我的列中的每个不同列表返回一列

解决方案需要高效，因为数据帧将包含500000多行。任何建议都将不胜感激

为您的

df['groups']

使用

：

In [21]: s = pd.Series({0: ['a', 'b', 'c'], 1:['c'], 2: ['b', 'c', 'e'], 3: ['a', 'c'], 4: ['b', 'e'] })

In [22]: s
Out[22]:
0    [a, b, c]
1          [c]
2    [b, c, e]
3       [a, c]
4       [b, e]
dtype: object

这是一个可能的解决方案：

In [23]: pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)
Out[23]:
   a  b  c  e
0  1  1  1  0
1  0  0  1  0
2  0  1  1  1
3  1  0  1  0
4  0  1  0  1

df.groups.apply(lambda x: pd.Series([1] * len(x), index=x)).fillna(0, downcast='infer')

其逻辑是：

```
.apply（Series）
```
将一系列列表转换为数据帧
```
.stack（）
```
再次将所有内容放在一列中（创建多级索引）
```
pd.get_dummies（）
```
创建假人
```
.sum（level=0
```
）用于重新合并应为一行的不同行（通过将第二个级别相加，仅保留原始级别（
```
level=0
```
））

一个轻微的等价物是

pd.get\u dummies（s.apply（pd.Series），前缀=“”，前缀sep=“”）。sum（level=0，axis=1）

我不知道这是否足够有效，但在任何情况下，如果性能很重要，将列表存储在数据帧中都不是一个好主意。

即使这个问题得到了解决，我有一个更快的解决方案：

In [23]: pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)
Out[23]:
   a  b  c  e
0  1  1  1  0
1  0  0  1  0
2  0  1  1  1
3  1  0  1  0
4  0  1  0  1

df.groups.apply(lambda x: pd.Series([1] * len(x), index=x)).fillna(0, downcast='infer')

如果您有空组或

NaN

，您可以：

df.loc[df.groups.str.len() > 0].apply(lambda x: pd.Series([1] * len(x), index=x)).fillna(0, downcast='infer')

工作原理在lambda中，

是您的列表，例如

['a'，'b'，'c']

。因此，pd系列将如下所示：

In [2]: pd.Series([1, 1, 1], index=['a', 'b', 'c'])
Out[2]: 
a    1
b    1
c    1
dtype: int64

当所有

pd.Series

组合在一起时，它们变成

pd.DataFrame

，它们的

索引变成列
；缺少的索引
变成了带有NaN的列
，如下所示：
In [4]: a = pd.Series([1, 1, 1], index=['a', 'b', 'c'])
In [5]: b = pd.Series([1, 1, 1], index=['a', 'b', 'd'])
In [6]: pd.DataFrame([a, b])
Out[6]: 
     a    b    c    d
0  1.0  1.0  1.0  NaN
1  1.0  1.0  NaN  1.0

现在fillna
用0
填充那些NaN
：
In [7]: pd.DataFrame([a, b]).fillna(0)
Out[7]: 
     a    b    c    d
0  1.0  1.0  1.0  0.0
1  1.0  1.0  0.0  1.0

而downcast='infer'
是从float
向下转换到int
：
In [11]: pd.DataFrame([a, b]).fillna(0, downcast='infer')
Out[11]: 
   a  b  c  d
0  1  1  1  0
1  1  1  0  1

注：不需要使用.fillna（0，downcast='infer'）
非常快速的解决方案，以防数据帧过大
使用
结果:
    a   b   c   e
0   1   1   1   0
1   0   0   1   0
2   0   1   1   1
3   1   0   1   0
4   0   1   0   1

为我工作，也有人建议，这更快：
pd.get_dummies（df['groups'].explode（））.sum（level=0）

使用.explode（）
而不是.apply（pd.Series）.stack（）

与其他解决方案相比：
import timeit
import pandas as pd
setup = '''
import time
import pandas as pd
s = pd.Series({0:['a','b','c'],1:['c'],2:['b','c','e'],3:['a','c'],4:['b','e']})
df = s.rename('groups').to_frame()
'''
m1 = "pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)"
m2 = "df.groups.apply(lambda x: pd.Series([1] * len(x), index=x)).fillna(0, downcast='infer')"
m3 = "pd.get_dummies(df['groups'].explode()).sum(level=0)"
times = {f"m{i+1}":min(timeit.Timer(m, setup=setup).repeat(7, 1000)) for i, m in enumerate([m1, m2, m3])}
pd.DataFrame([times],index=['ms'])
#           m1        m2        m3
# ms  5.586517  3.821662  2.547167

您使用的是什么版本的Pandas？@joris your的意思可能是：pd.get\u dummies（s.apply（pd.Series），prefix=''，prefix\u sep=''）.sum（level=0，axis=1）
当您的代码输出一个包含和的序列而不是数据帧时。啊，对不起，括号放错了位置（堆栈应该在get\u dummies中）。我使用的是熊猫0.15.2@Primer是的，我先写的，但我发现它的堆栈更干净（更短），但它给出的输出完全相同。@Alex，你从一个不同的输入开始（一个格式化为列表的字符串，我从列表开始），但我不确定OP想要什么。除此之外，你在apply中做了get_dummies
（所以是针对每一行，而不是一次），这使得上面的方法速度变慢了。@joris True-实际上OPs post中字符周围的引号让我觉得可能是这样的。。。未删除。我已经测试了你的解决方案：它像一个符咒一样有效。你介意进一步解释一下它是如何工作的吗？如果要在列中添加前缀，请使用：dummies.columns=['D_'+colu name for coll_name for coll_name in dummies.columns]
@Ufos，你可以添加前缀（'D_''）
@PauloAlves，哎哟@PauloAlves我尝试了你的解决方案，因为另一个对我的数据集来说太慢了，但我一直得到以下错误：“InvalidIndexError:重新索引仅对唯一值的索引对象有效”。你知道那是从哪里来的吗？如果它来自原始数据帧的索引，我已经检查了df.index.is_unique
，它输出True
。效果非常好！这救了我一命。这种方法肯定更优雅。