Python 计算一系列字符串中连续数字的组数_Python_Pandas

Python 计算一系列字符串中连续数字的组数

python pandas

Python 计算一系列字符串中连续数字的组数,python,pandas,Python,Pandas,考虑一下pd.Seriess import pandas as pd import numpy as np np.random.seed([3,1415]) p = (.35, .35, .1, .1, .1) s = pd.DataFrame(np.random.choice(['', 1] + list('abc'), (10, 20), p=p)).sum(1) s 0 11111bbaacbbca1 1 1bab111aaaaca1a 2 11aaa1b11a11

考虑一下

pd.Series

import pandas as pd
import numpy as np

np.random.seed([3,1415])
p = (.35, .35, .1, .1, .1)
s = pd.DataFrame(np.random.choice(['', 1] + list('abc'), (10, 20), p=p)).sum(1)

s

0    11111bbaacbbca1
1    1bab111aaaaca1a
2    11aaa1b11a11a11
3     1ca11bb1b1a1b1
4        bb1111b1111
5       b1111c1aa111
6     1b1a111b11b1ab
7        1bc111ab1ba
8      a11b1b1b11111
9        1cc1ab1acc1
dtype: object

我想计算

的每个元素中连续数字组的数量。或者，每个字符串中有多少个整数

我想让结果看起来像

0    2
1    3
2    5
3    6
4    2
5    3
6    5
7    3
8    4
9    4
dtype: int64

我在寻找效率，虽然优雅也很重要。

更新：首先用单个

替换所有连续的digist组，然后删除所有不是

的内容，最后得到更改字符串的长度：

In [159]: s.replace(['\d+', '[^1]+'], ['1', ''], regex=True).str.len()
Out[159]:
0    2
1    3
2    5
3    6
4    2
5    3
6    5
7    3
8    4
9    4
dtype: int64

针对100K系列的计时：

In [160]: %timeit big.replace(['\d+', '[^1]+'], ['1', ''], regex=True).str.len()
1 loop, best of 3: 1 s per loop

In [161]: %timeit big.apply(lambda x: len(re.sub('\D+', ' ', x).strip().split()))
1 loop, best of 3: 1.18 s per loop

In [162]: %timeit big.str.replace(r'\D+', ' ').str.strip().str.split().str.len()
1 loop, best of 3: 1.25 s per loop

In [163]: big.shape
Out[163]: (100000,)

针对1M系列的计时：

In [164]: big = pd.concat([s] * 10**5, ignore_index=True)

In [165]: %timeit big.replace(['\d+', '[^1]+'], ['1', ''], regex=True).str.len()
1 loop, best of 3: 9.98 s per loop

In [166]: %timeit big.apply(lambda x: len(re.sub('\D+', ' ', x).strip().split()))
1 loop, best of 3: 11.7 s per loop

In [167]: %timeit big.str.replace(r'\D+', ' ').str.strip().str.split().str.len()
1 loop, best of 3: 12.6 s per loop

In [168]: big.shape
Out[168]: (1000000,)

说明：

In [169]: s.replace(['\d+', '[^1]+'], ['1', ''], regex=True)
Out[169]:
0        11
1       111
2     11111
3    111111
4        11
5       111
6     11111
7       111
8      1111
9      1111
dtype: object

In [131]: s.str.extractall('(\d+)')
Out[131]:
             0
  match
0 0      11111
  1          1
1 0          1
  1        111
  2          1
2 0         11
  1          1
  2         11
  3         11
  4         11
3 0          1
  1         11
  2          1
  3          1
  4          1
  5          1
4 0       1111
  1       1111
5 0       1111
  1          1
  2        111
6 0          1
  1          1
  2        111
  3         11
  4          1
7 0          1
  1        111
  2          1
8 0         11
  1          1
  2          1
  3      11111
9 0          1
  1          1
  2          1
  3          1

旧（慢）答案：

与联合使用如何

说明：

In [169]: s.replace(['\d+', '[^1]+'], ['1', ''], regex=True)
Out[169]:
0        11
1       111
2     11111
3    111111
4        11
5       111
6     11111
7       111
8      1111
9      1111
dtype: object

In [131]: s.str.extractall('(\d+)')
Out[131]:
             0
  match
0 0      11111
  1          1
1 0          1
  1        111
  2          1
2 0         11
  1          1
  2         11
  3         11
  4         11
3 0          1
  1         11
  2          1
  3          1
  4          1
  5          1
4 0       1111
  1       1111
5 0       1111
  1          1
  2        111
6 0          1
  1          1
  2        111
  3         11
  4          1
7 0          1
  1        111
  2          1
8 0         11
  1          1
  2          1
  3      11111
9 0          1
  1          1
  2          1
  3          1

这就是我的解决方案

s.str.replace(r'\D+', ' ').str.strip().str.split().str.len()

100000行

PiRSquared和MaxU解决方案非常好

但是，我注意到，

apply

通常比使用多个字符串方法快一点

In [142]: %timeit s.apply(lambda x: len(re.sub('\D+', ' ', x).strip().split()))
1 loop, best of 3: 367 ms per loop

In [143]: %timeit s.str.replace(r'\D+', ' ').str.strip().str.split().str.len()
1 loop, best of 3: 403 ms per loop

In [145]: s.shape
Out[145]: (100000L,)

不错！比较更大系列上的计时会很有趣…@MaxU添加了计时您可能还需要基准测试

s.apply（lambda x:len（re.sub（'\D+，''，x）.strip（）.split（））

。有两种str方法，apply快10%。@piRSquared，这不公平！为什么你接受了我慢两倍的答案？@piRSquared，顺便说一句，受你的解决方案启发，我试图在使用时去掉空元素：

s.str.split（'\D+）