Python 熊猫根据另一列的子字符串中的数字从子字符串切片生成新列_Python_Numpy_Pandas_Dataframe

Python 熊猫根据另一列的子字符串中的数字从子字符串切片生成新列

python numpy pandas dataframe

Python 熊猫根据另一列的子字符串中的数字从子字符串切片生成新列,python,numpy,pandas,dataframe,Python,Numpy,Pandas,Dataframe,我有一个名为“表”的数据帧，如下所示： import pandas as pd import numpy as np table = pd.read_csv(main_data, sep='\t') 它产生了这样的结果： NAME SYMBOL STRING A blah A34SA B foo BS2812D ... 如何在pandas中创建一个新列，因此我有以下内容： NAME SYMBOL STRING

我有一个名为“表”的数据帧，如下所示：

import pandas as pd
import numpy as np
table = pd.read_csv(main_data, sep='\t')

它产生了这样的结果：

NAME    SYMBOL    STRING
A       blah       A34SA
B       foo        BS2812D
...

如何在pandas中创建一个新列，因此我有以下内容：

NAME     SYMBOL      STRING    NUMBER
   A       blah       A34SA        34 
   B        foo     BS2812D      2812

到目前为止，我有：

table['NUMBER']=table.STRING.str[int（filter（str.isdigit，table.STRING））]

但此函数在此上下文中不起作用

谢谢大家!

您可以尝试使用正则表达式从字符串中提取数字：

import re
def extNumber(row):
    row['NUMBER'] = re.search("(\\d+)", row.STRING).group(1)
    return row

df.apply(extNumber, axis=1)

以下几点应该有效

table['NUMBER'] = table.STRING.apply(lambda x: int(''.join(filter(str.isdigit, x))))

您可以使用正则表达式

import re

table['NUMBER'] = table['STRING'].apply(lambda x: re.sub(r'[^0-9]','',x))

我会这样做：

In [22]: df['NUMBER'] = df.STRING.str.extract('(?P<NUMBER>\d+)', expand=True).astype(int)

In [23]: df
Out[23]:
  NAME SYMBOL   STRING  NUMBER
0    A   blah    A34SA      34
1    B    foo  BS2812D    2812

In [24]: df.dtypes
Out[24]:
NAME      object
SYMBOL    object
STRING    object
NUMBER     int32
dtype: object

[22]中的

：df['NUMBER']=df.STRING.str.extract（'（？P\d+），expand=True）。astype（int）
In[23]：df
出[23]：
名称符号字符串编号
0 A空谈A34SA 34
1 B foo BS2812D 2812
In[24]：df.dtypes
出[24]：
名称对象
符号对象
字符串对象
数字int32
数据类型：对象

针对20M行DF的计时：

In [71]: df = pd.concat([df] * 10**7, ignore_index=True) In [72]: df.shape Out[72]: (20000000, 3) In [73]: df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 20000000 entries, 0 to 19999999 Data columns (total 3 columns): NAME object SYMBOL object STRING object dtypes: object(3) memory usage: 457.8+ MB In [74]: %timeit df.STRING.str.replace(r'\D+', '').astype(int) 1 loop, best of 3: 507 ms per loop In [75]: %timeit df.STRING.str.extract('(?P<NUMBER>\d+)', expand=True).astype(int) 1 loop, best of 3: 434 ms per loop In [76]: %timeit df.STRING.apply(lambda x: int(''.join(filter(str.isdigit, x)))) 1 loop, best of 3: 562 ms per loop In [77]: %timeit df['STRING'].apply(lambda x: re.sub(r'[^0-9]','',x)) 1 loop, best of 3: 552 ms per loop
[71]中的
df=pd.concat（[df]*10**7，忽略索引=True） In[72]：df.shape Out[72]：（20000000,3）在[73]：df.info（）中范围索引：20000000个条目，0到19999999 数据列（共3列）：名称对象符号对象字符串对象数据类型：对象（3）内存使用率：457.8+MB 在[74]中：%timeit df.STRING.str.replace（r'\D+'，''）.astype（int） 1个回路，最好为3:507毫秒/回路在[75]中：%timeit df.STRING.str.extract（'（？P\d+），expand=True） 1个循环，最佳3:434毫秒/循环在[76]中：%timeit df.STRING.apply（lambda x:int（“”.join（filter（str.isdigit，x））） 1个循环，最好为3:562毫秒/循环在[77]中：%timeit df['STRING'].apply（lambda x:re.sub（r'[^0-9]'，''，x）） 1个循环，最佳3:552毫秒/循环
那么您想从字符串中提取一个整数？准确地说。我想提取这个号码。。可能在字母之间，也可能是符号之间