Python/Pandas-性能改进-将列分成多个部分并将字符串序列转换为列表
我有一个叫做target的数据帧。该数据帧有一个名为“CNAE2”的列 如果我Python/Pandas-性能改进-将列分成多个部分并将字符串序列转换为列表,python,performance,list,loops,pandas,Python,Performance,List,Loops,Pandas,我有一个叫做target的数据帧。该数据帧有一个名为“CNAE2”的列 如果我打印(target.CNAE2)我会得到以下信息: id 3 NaN 7 NaN 17 50.30-1-02, 52.11-7-01, 52.32-0-00, 52.50-8-05 18
打印(target.CNAE2)
我会得到以下信息:
id
3 NaN
7 NaN
17 50.30-1-02, 52.11-7-01, 52.32-0-00, 52.50-8-05
18 32.67-1-00
19 46.93-1-00, 49.40-0-00
20 NaN
列的非NaN值是字符串。他们反对某种关系逻辑,我的意图是:
a) 把它变成列表
b) 将它们分为多个级别(我称之为“pai”、“vo”、“bisavo”),并在不同的列中分开
id CNAE2 CNAE2pai CNAE2vo CNAE2bisavo
3 NaN NaN NaN NaN
7 NaN NaN NaN NaN
17 [50.30-1-02, 52.32-0-00, 52.50-8-05] [50.30-1, 52.32-0, 52.50-8] [50.30, 52.32, 52.50] [50, 52, 52]
18 [32.67-1-00] [32.67-1] [32.67] [32]
19 [46.93-1-00, 46.40-0-00] [46.93-1, 46.40-0] [46.93, 46.40] [46, 46]
20 NaN NaN NaN NaN
我能够实现这个结果,但是,我的代码依赖于很多循环,而且因为我运行的是一个相当大的数据帧,所以需要花费很多时间。这是不可行的。我使用了以下代码:
for i in target.index:
cnaes=str(target['CNAE2'][i]).split(', ')
target.CNAE2[i]=cnaes
if cnaes == ['nan'] or cnaes == 'NaN' or cnaes == "":
target.CNAE2[i]='NaN'
else:
target.CNAE2pai[i]=[]
target.CNAE2vo[i]=[]
target.CNAE2bisavo[i]=[]
for k in range(len(cnaes)):
y=cnaes[k][:7]
target['CNAE2pai'][i].append(y)
for k in range(len(cnaes)):
y=cnaes[k][:5]
target['CNAE2vo'][i].append(y)
for k in range(len(cnaes)):
y=cnaes[k][:2]
target['CNAE2bisavo'][i].append(y)
target.CNAE2pai[i]=list(set(target.CNAE2pai[i]))
target.CNAE2vo[i]=list(set(target.CNAE2vo[i]))
target.CNAE2bisavo[i]=list(set(target.CNAE2bisavo[i]))
有人能提出一种更有效的方法来实现这个结果吗?还没有尝试过,但最好避免。追加。最好先制作一个列表并附加到该列表中,当结果完成后,将其输入到数据框中。还没有尝试过,但最好避免。附加。最好先创建一个列表并附加到该列表中,当结果完成后,将其输入到数据框中。我在这里使用了
apply
函数,它应该比遍历行更快,为
循环设置比or函数更快的查找,最后列出比嵌套的更快的理解。我还没有测试过这个,但希望它能有所帮助
import pandas as pd
# Create dummy data and dataframe
d = {"3":"NaN","7":"NaN","17":"50.30-1-02, 52.11-7-01, 52.32-0-00, 52.50-8-05","18":"32.67-1-00",
"19":"46.93-1-00, 49.40-0-00","20":"NaN"}
target = pd.DataFrame([[k, d[k]] for k in d], columns = ["id","CNAE"])
# Loop across desired columns
nans = set(["nan","NaN",""])
for col in [("CNAE2pai",7),("CNAE2vo",5),("CNAE2bisavo",2)]:
target[col[0]] = target.CNAE.apply(lambda x: "NaN" if x in nans else [i[:col[1]] for i in x.split(", ")])
target["CNAE2"] = target.CNAE.apply(lambda x: "NaN" if x in nans else [i for i in x.split(", ")])
编辑
在我的系统上,利用lambda
函数和列表理解比groupby
产生更快的结果:
d = {"3":"NaN","7":"NaN","17":"50.30-1-02, 52.11-7-01, 52.32-0-00, 52.50-8-05","18":"32.67-1-00",
"19":"46.93-1-00, 49.40-0-00","20":"NaN"}
target = pd.DataFrame([[k, d[k]] for k in d], columns = ["id","CNAE"])
def lambda_func(target):
# Loop across desired columns
nans = set(["nan","NaN",""])
for col in [("CNAE2pai",7),("CNAE2vo",5),("CNAE2bisavo",2)]:
target[col[0]] = target.CNAE.apply(lambda x: "NaN" if x in nans else [i[:col[1]] for i in x.split(", ")])
target["CNAE2"] = target.CNAE.apply(lambda x: "NaN" if x in nans else [i for i in x.split(", ")])
return target
def groupby_func(target):
s = target.CNAE.str.split(', ', expand=True).stack()
pai = s.str.rsplit('-', 1).str[0].groupby(level=0).apply(list)
vo = s.str.split('-', 1).str[0].groupby(level=0).apply(list)
bisavo = s.str.split('.').str[0].groupby(level=0).apply(list)
base = s.groupby(level=0).apply(list)
target = pd.concat(
[base, pai, vo, bisavo], axis=1,
keys=['', 'pai', 'vo', 'bisavo']
).add_prefix('CNAE2').reindex(target.index)
return target
结果:
%timeit lambda_func(target) 1000 loops, best of 3: 930 µs per loop
%timeit groupby_func(target) 100 loops, best of 3: 6.3 ms per loop
我在这里使用了apply
函数,它应该比遍历行更快,set
查找,查找速度应该比or函数更快,最后列出理解,这一理解往往比嵌套的快。我还没有测试过这个,但希望它能有所帮助
import pandas as pd
# Create dummy data and dataframe
d = {"3":"NaN","7":"NaN","17":"50.30-1-02, 52.11-7-01, 52.32-0-00, 52.50-8-05","18":"32.67-1-00",
"19":"46.93-1-00, 49.40-0-00","20":"NaN"}
target = pd.DataFrame([[k, d[k]] for k in d], columns = ["id","CNAE"])
# Loop across desired columns
nans = set(["nan","NaN",""])
for col in [("CNAE2pai",7),("CNAE2vo",5),("CNAE2bisavo",2)]:
target[col[0]] = target.CNAE.apply(lambda x: "NaN" if x in nans else [i[:col[1]] for i in x.split(", ")])
target["CNAE2"] = target.CNAE.apply(lambda x: "NaN" if x in nans else [i for i in x.split(", ")])
编辑
在我的系统上,利用lambda
函数和列表理解比groupby
产生更快的结果:
d = {"3":"NaN","7":"NaN","17":"50.30-1-02, 52.11-7-01, 52.32-0-00, 52.50-8-05","18":"32.67-1-00",
"19":"46.93-1-00, 49.40-0-00","20":"NaN"}
target = pd.DataFrame([[k, d[k]] for k in d], columns = ["id","CNAE"])
def lambda_func(target):
# Loop across desired columns
nans = set(["nan","NaN",""])
for col in [("CNAE2pai",7),("CNAE2vo",5),("CNAE2bisavo",2)]:
target[col[0]] = target.CNAE.apply(lambda x: "NaN" if x in nans else [i[:col[1]] for i in x.split(", ")])
target["CNAE2"] = target.CNAE.apply(lambda x: "NaN" if x in nans else [i for i in x.split(", ")])
return target
def groupby_func(target):
s = target.CNAE.str.split(', ', expand=True).stack()
pai = s.str.rsplit('-', 1).str[0].groupby(level=0).apply(list)
vo = s.str.split('-', 1).str[0].groupby(level=0).apply(list)
bisavo = s.str.split('.').str[0].groupby(level=0).apply(list)
base = s.groupby(level=0).apply(list)
target = pd.concat(
[base, pai, vo, bisavo], axis=1,
keys=['', 'pai', 'vo', 'bisavo']
).add_prefix('CNAE2').reindex(target.index)
return target
结果:
%timeit lambda_func(target) 1000 loops, best of 3: 930 µs per loop
%timeit groupby_func(target) 100 loops, best of 3: 6.3 ms per loop