Python 将一个int转换为pandas中的多个bool列背景_Python_Performance_Pandas_Numpy

Python 将一个int转换为pandas中的多个bool列背景

python performance pandas numpy

Python 将一个int转换为pandas中的多个bool列背景,python,performance,pandas,numpy,Python,Performance,Pandas,Numpy,我得到了一个包含整数的数据帧。这些整数表示该行存在或不存在的一系列特征我希望这些功能在我的数据框中命名为列问题我当前的解决方案在内存中爆炸，速度非常慢。我该如何提高此文件的内存效率 import pandas as pd df = pd.DataFrame({'some_int':range(5)}) df['some_int'].astype(int).apply(bin).str[2:].str.zfill(4).apply(list).apply(pd.Series).rename(

我得到了一个包含整数的数据帧。这些整数表示该行存在或不存在的一系列特征

我希望这些功能在我的数据框中命名为列

问题我当前的解决方案在内存中爆炸，速度非常慢。我该如何提高此文件的内存效率

import pandas as pd
df = pd.DataFrame({'some_int':range(5)})
df['some_int'].astype(int).apply(bin).str[2:].str.zfill(4).apply(list).apply(pd.Series).rename(columns=dict(zip(range(4), ["f1", "f2", "f3", "f4"])))

  f1 f2 f3 f4
0  0  0  0  0
1  0  0  0  1
2  0  0  1  0
3  0  0  1  1
4  0  1  0  0

似乎是

.apply（pd.Series）

减慢了速度。在我添加这个之前，其他一切都很快

我不能跳过它，因为简单的列表不能构成数据帧。

我认为您需要：

a = pd.DataFrame(df['some_int'].astype(int)
                               .apply(bin)
                               .str[2:]
                               .str.zfill(4)
                               .apply(list).values.tolist(), columns=["f1","f2","f3","f4"])
print (a)
  f1 f2 f3 f4
0  0  0  0  0
1  0  0  0  1
2  0  0  1  0
3  0  0  1  1
4  0  1  0  0

另一个解决方案，谢谢：

有点变化：

a = pd.DataFrame([list('{:04b}'.format(x)) for x in df['some_int'].values], 
                  columns=['f1', 'f2', 'f3', 'f4'])
print (a)
  f1 f2 f3 f4
0  0  0  0  0
1  0  0  0  1
2  0  0  1  0
3  0  0  1  1
4  0  1  0  0

计时：

df = pd.DataFrame({'some_int':range(100000)})

In [80]: %timeit pd.DataFrame(df['some_int'].astype(int).apply(bin).str[2:].str.zfill(20).apply(list).values.tolist())
1 loop, best of 3: 231 ms per loop

In [81]: %timeit pd.DataFrame([list('{:020b}'.format(x)) for x in df['some_int'].values])
1 loop, best of 3: 232 ms per loop

In [82]: %timeit pd.DataFrame(df['some_int'].apply(lambda x: list('{:020b}'.format(x))).values.tolist())
1 loop, best of 3: 222 ms per loop

In [83]: %timeit pd.DataFrame([list(np.binary_repr(x, width=20)) for x in df.some_int.values])
1 loop, best of 3: 343 ms per loop

In [84]: %timeit df.some_int.apply(lambda x: pd.Series(list(np.binary_repr(x, width=20))))
1 loop, best of 3: 16.4 s per loop

In [87]: %timeit pd.DataFrame( num2bin(df.some_int.values, 20))
100 loops, best of 3: 11.4 ms per loop

您可以使用以下方法：

或

这是一个矢量化的NumPy方法-

def num2bin(nums, width):
    return ((nums[:,None] & (1 << np.arange(width-1,-1,-1)))!=0).astype(int)

解释

1）投入：

2）获取2个通电范围编号：

In [100]: (1 << np.arange(width-1,-1,-1))
Out[100]: array([8, 4, 2, 1])

为了理解比特ANDIng，让我们从代码< NUS中考虑数字<代码> 5代码/代码>及其位对所有2个幂的代码[8]，[4]，[2，1] < /代码>：

In [103]: 5 & 8    # 0101 & 1000
Out[103]: 0

In [104]: 5 & 4    # 0101 & 0100
Out[104]: 4

In [105]: 5 & 2    # 0101 & 0010
Out[105]: 0

In [106]: 5 & 1    # 0101 & 0001
Out[106]: 1

因此，我们看到与

[8,2]

没有交集，而对于其他的，我们有非零

4）在最后一个阶段，查找匹配项（非零），通过与

进行比较，生成一个布尔数组，然后转换为int-dtype，将这些匹配项转换为1，将其余的转换为0：

In [107]: matches = nums[:,None] & (1 << np.arange(width-1,-1,-1))

In [108]: matches!=0
Out[108]: 
array([[False, False, False,  True],
       [False,  True, False,  True],
       [False, False,  True,  True],
       [ True, False, False, False],
       [False,  True, False, False]], dtype=bool)

In [109]: (matches!=0).astype(int)
Out[109]: 
array([[0, 0, 0, 1],
       [0, 1, 0, 1],
       [0, 0, 1, 1],
       [1, 0, 0, 0],
       [0, 1, 0, 0]])

稍微“简单”：

pd.DataFrame（df.some_int.apply（{:04b}）.format）.apply（list.tolist（），columns=['f1'，'f2'，'f3'，'f4']）

在我的电脑上试用dataset@JonClements实际上是

.values

部分使它稍微快了一点。@ayhan啊好的。。。正在剪切astype、bin、slice和zfill，并在那里使用

str.format

。。。（如果这样做效率较低，请感到惊讶）好吧，看来我的例子太简单了，有一个隐藏的问题<代码>溢出错误：Python int太大，无法转换为C long我有22个特性…在我的解决方案中，

pd.Series

实例化是最重要的。您在解决方案中也在这样做。有办法绕过它吗？而且，lambda函数让我感到怀疑。如果第二种解决方案有效，我就试试看。第一个仍然在运行，但看起来很慢。第二个解决方案提供字符串，而不是整数和布尔值，并且是相当大的内存heavy@firelynx，我相信您想要使用Divakar的矢量化解决方案；-）我现在很忙。但我要等到明天才能给出正确的解决方案。给每个人一个找出最佳解决方案的机会神圣的通心粉，这个很快。。。然后是迪瓦卡，所有的人都张着嘴，不知道努比有多快；）这个解决方案很棒，但有点神奇。我对num2bin函数的内部功能只有一个模糊的概念。我希望它能更清楚地表达出来。（我是一个干净的代码狂热者）@firelynx补充了一些解释。过来看！

In [70]: df
Out[70]: 
   some_int
0         1
1         5
2         3
3         8
4         4

In [71]: pd.DataFrame( num2bin(df.some_int.values, 4), \
                    columns = [["f1", "f2", "f3", "f4"]])
Out[71]: 
   f1  f2  f3  f4
0   0   0   0   1
1   0   1   0   1
2   0   0   1   1
3   1   0   0   0
4   0   1   0   0

In [98]: nums = np.array([1,5,3,8,4])

In [99]: width = 4

In [100]: (1 << np.arange(width-1,-1,-1))
Out[100]: array([8, 4, 2, 1])

In [101]: nums[:,None]
Out[101]: 
array([[1],
       [5],
       [3],
       [8],
       [4]])

In [102]: nums[:,None] & (1 << np.arange(width-1,-1,-1))
Out[102]: 
array([[0, 0, 0, 1],
     [0, 4, 0, 1],
     [0, 0, 2, 1],
     [8, 0, 0, 0],
     [0, 4, 0, 0]])

In [103]: 5 & 8    # 0101 & 1000
Out[103]: 0

In [104]: 5 & 4    # 0101 & 0100
Out[104]: 4

In [105]: 5 & 2    # 0101 & 0010
Out[105]: 0

In [106]: 5 & 1    # 0101 & 0001
Out[106]: 1

In [107]: matches = nums[:,None] & (1 << np.arange(width-1,-1,-1))

In [108]: matches!=0
Out[108]: 
array([[False, False, False,  True],
       [False,  True, False,  True],
       [False, False,  True,  True],
       [ True, False, False, False],
       [False,  True, False, False]], dtype=bool)

In [109]: (matches!=0).astype(int)
Out[109]: 
array([[0, 0, 0, 1],
       [0, 1, 0, 1],
       [0, 0, 1, 1],
       [1, 0, 0, 0],
       [0, 1, 0, 0]])

In [58]: df = pd.DataFrame({'some_int':range(100000)})

# @jezrael's soln-1
In [59]: %timeit pd.DataFrame(df['some_int'].astype(int).apply(bin).str[2:].str.zfill(4).apply(list).values.tolist())
1 loops, best of 3: 198 ms per loop

# @jezrael's soln-2
In [60]: %timeit pd.DataFrame([list('{:20b}'.format(x)) for x in df['some_int'].values])
10 loops, best of 3: 154 ms per loop

# @jezrael's soln-3
In [61]: %timeit pd.DataFrame(df['some_int'].apply(lambda x: list('{:20b}'.format(x))).values.tolist())
10 loops, best of 3: 132 ms per loop

# @MaxU's soln-1
In [62]: %timeit pd.DataFrame([list(np.binary_repr(x, width=20)) for x in df.some_int.values])
1 loops, best of 3: 193 ms per loop

# @MaxU's soln-2
In [64]: %timeit df.some_int.apply(lambda x: pd.Series(list(np.binary_repr(x, width=20))))
1 loops, best of 3: 11.8 s per loop

# Proposed in this post
In [65]: %timeit pd.DataFrame( num2bin(df.some_int.values, 20))
100 loops, best of 3: 5.64 ms per loop