删除python中的for循环并查找和替换文本
我有两个数据帧。我想在两个数据帧之间进行查找和替换。在删除python中的for循环并查找和替换文本,python,performance,pandas,for-loop,vectorization,Python,Performance,Pandas,For Loop,Vectorization,我有两个数据帧。我想在两个数据帧之间进行查找和替换。在df_finddataframe的当前标题列中,我想在每一行中搜索df_replace数据框中“keywrod”列中出现的任何值,如果找到,则用“keywordlength”列中的相应值替换它 由于我需要使用str.replace函数对该数据帧中的每一行进行迭代,因此我已经能够摆脱df\u find数据帧的循环,这是replace函数的矢量化形式 在我的例子中,性能很重要,因为两个数据帧都运行在GB中。因此,我想在这里去掉df_replace
df_find
dataframe的当前标题列中,我想在每一行中搜索df_replace
数据框中“keywrod
”列中出现的任何值,如果找到,则用“keywordlength
”列中的相应值替换它
由于我需要使用str.replace
函数对该数据帧中的每一行进行迭代,因此我已经能够摆脱df\u find
数据帧的循环,这是replace
函数的矢量化形式
在我的例子中,性能很重要,因为两个数据帧都运行在GB中。因此,我想在这里去掉df_replace
的循环,并使用任何其他有效的方法迭代df_replace
数据帧的所有行
import pandas as pd
df_find = pd.read_csv("input_find.csv")
df_replace = pd.read_csv("input_replace.csv")
#replace
for i,j in zip(df_replace.keyword,df_replace.keywordLength):
df_find.current_title=df_find.current_title.str.replace(i,j,case=False)
df_更换
此数据框包含查找和替换所需的数据
keyword keywordLength
IT Manager ##10##
Sales Manager ##13##
IT Analyst ##12##
Store Manager ##13##
df_find是我们需要进行转换的地方
在执行查找和替换代码之前:
current_title
I have been working here as a store manager since after I passed from college
I am sales manager and primarily work in the ASEAN region. My primary rolw is to bring new customers.
I initially joined as a IT analyst and because of my sheer drive and dedication, I was promoted to IT manager position within 3 years
通过上述代码执行查找和替换后
current_title
I have been working here as a ##13## since after I passed from college
I am ##13## and primarily work in the ASEAN region. My primary rolw is to bring new customers.
I initially joined as a ##12## and because of my sheer drive and dedication, I was promoted to ##10## position within 3 years
我将永远感激你!谢谢如果我理解正确,您应该能够对数据集进行相对简单的合并(使用其他几行),并获得所需的结果 没有你的数据集,我只是自己编了一套。下面的代码可能会更优雅一些,但它可以让您在四行中找到需要的位置,最重要的是,没有循环: 设置:
df_find = pd.DataFrame({
'current_title':['a','a','b','c','b','c','b','a'],
'other':['this','is','just','a','bunch','of','random','words']
})
df_replace = pd.DataFrame({'keyword':['a','c'], 'keywordlength':['x','z']})
# This line is to simply re-sort at the end of the code. Someone with more experience can probably bypass this step.
df_find['idx'] = df_find.index
# Merge together the two data sets based on matching the "current_title" and the "keyword"
dfx = df_find.merge(df_replace, left_on = 'current_title', right_on = 'keyword', how = 'outer').drop('keyword', 1)
# Now, copy the non-null "keywordlength" values to "current_title"
dfx.loc[dfx['keywordlength'].notnull(), 'current_title'] = dfx.loc[dfx['keywordlength'].notnull(), 'keywordlength']
# Clean up by dropping the unnecessary columns and resort based on the first line above.
df_find = dfx.sort_values('idx').drop(['keywordlength','idx'], 1)
current_title other
0 x this
1 x is
3 b just
6 z a
4 b bunch
7 z of
5 b random
2 x words
代码:
df_find = pd.DataFrame({
'current_title':['a','a','b','c','b','c','b','a'],
'other':['this','is','just','a','bunch','of','random','words']
})
df_replace = pd.DataFrame({'keyword':['a','c'], 'keywordlength':['x','z']})
# This line is to simply re-sort at the end of the code. Someone with more experience can probably bypass this step.
df_find['idx'] = df_find.index
# Merge together the two data sets based on matching the "current_title" and the "keyword"
dfx = df_find.merge(df_replace, left_on = 'current_title', right_on = 'keyword', how = 'outer').drop('keyword', 1)
# Now, copy the non-null "keywordlength" values to "current_title"
dfx.loc[dfx['keywordlength'].notnull(), 'current_title'] = dfx.loc[dfx['keywordlength'].notnull(), 'keywordlength']
# Clean up by dropping the unnecessary columns and resort based on the first line above.
df_find = dfx.sort_values('idx').drop(['keywordlength','idx'], 1)
current_title other
0 x this
1 x is
3 b just
6 z a
4 b bunch
7 z of
5 b random
2 x words
输出:
df_find = pd.DataFrame({
'current_title':['a','a','b','c','b','c','b','a'],
'other':['this','is','just','a','bunch','of','random','words']
})
df_replace = pd.DataFrame({'keyword':['a','c'], 'keywordlength':['x','z']})
# This line is to simply re-sort at the end of the code. Someone with more experience can probably bypass this step.
df_find['idx'] = df_find.index
# Merge together the two data sets based on matching the "current_title" and the "keyword"
dfx = df_find.merge(df_replace, left_on = 'current_title', right_on = 'keyword', how = 'outer').drop('keyword', 1)
# Now, copy the non-null "keywordlength" values to "current_title"
dfx.loc[dfx['keywordlength'].notnull(), 'current_title'] = dfx.loc[dfx['keywordlength'].notnull(), 'keywordlength']
# Clean up by dropping the unnecessary columns and resort based on the first line above.
df_find = dfx.sort_values('idx').drop(['keywordlength','idx'], 1)
current_title other
0 x this
1 x is
3 b just
6 z a
4 b bunch
7 z of
5 b random
2 x words
匹配的值是完全匹配,还是仅子字符串匹配?如果有多个匹配呢?你只是第一场比赛吗?所有的比赛都被替换了。完全匹配。查看正则表达式和
re.sub
。您可以将文件读取为文本,将要替换的内容替换为regex,然后将其作为csv.str打开。replace是re-sub的矢量化实现。它对整个列而不是一行执行操作。这不是我要找的。我添加了一个“之前”和“之后”的示例,以使其更清楚。谢谢好的,这就是为什么我问他们是完全匹配还是子字符串匹配。。。无论如何,这突出了发布数据集的重要性。这是完全匹配的。我们正在用关键字长度替换整个关键字。for循环中的代码就是这样做的。我不知道你说的“完全匹配”是不是指别的意思,但我是按字面意思说的。你能帮我什么忙吗?