Python 如何替换整个csv文件中的特定单词?

Python 如何替换整个csv文件中的特定单词?,python,pandas,dataframe,Python,Pandas,Dataframe,我有一个很大的CSV文件,其中有许多简短的单词,我需要将它们更改为完整的单词。我在这里发现很少有像这样的帖子,但大多数帖子要么改变整行,要么需要逐个手动完成 我的CSV文件看起来像: infoID messages 111 we need to fix the car mag but we can't 113 we need a shf to perform eng change 115 gr i

我有一个很大的CSV文件,其中有许多简短的单词,我需要将它们更改为完整的单词。我在这里发现很少有像这样的帖子,但大多数帖子要么改变整行,要么需要逐个手动完成

我的CSV文件看起来像:

infoID               messages
 111     we need to fix the car mag but we can't
 113         we need a shf to perform eng change
 115                      gr is needed to change
 116                            bat needs change
 117                    car towed for ext change 
 118                              car ml is high
  .
  .
我的另一个文件包含所有简短单词的完整单词,我想在我的文档中使用该文件,其形式为:

shf:shaft
gr:gear
ml:mileage

如果您能提供代码帮助,我也能在我这边运行,那就太好了。谢谢

将您的文本文件作为一个系列读入

s

0    mag:magnitude
1        shf:shaft
2          gr:gear
3      bat:battery
4      ext:exhaust
5       ml:mileage
Name: 0, dtype: object
在冒号上拆分并将序列转换为字典映射键以替换:

dict(s.str.split(':').tolist())

# {'bat': 'battery',
#  'ext': 'exhaust',
#  'gr': 'gear',
#  'mag': 'magnitude',
#  'ml': 'mileage',
#  'shf': 'shaft'}
使用此选项执行带有
regex=True
的操作:

df['messages'].replace(dict(s.str.split(':').tolist()), regex=True)

0    we need to fix the car magnitude but we can't
1            we need a shaft to perform eng change
2                         gear is needed to change
3                             battery needs change
4                     car towed for exhaust change
5                              car mileage is high
Name: messages, dtype: object

请注意,如果这些是严格意义上的整词替换,则可以通过将关键字字符串转换为使用词边界的正则表达式来扩展此解决方案。为了更好地测量,还应转义字符串:

import re

mapping = {fr'\b{re.escape(k)}\b': v for k, v in s.str.split(':').tolist()}
df['messages'].replace(mapping, regex=True)

0    we need to fix the car magnitude but we can't
1            we need a shaft to perform eng change
2                         gear is needed to change
3                             battery needs change
4                     car towed for exhaust change
5                              car mileage is high
Name: messages, dtype: object

将文本文件作为一个系列读入,如下所示

s

0    mag:magnitude
1        shf:shaft
2          gr:gear
3      bat:battery
4      ext:exhaust
5       ml:mileage
Name: 0, dtype: object
在冒号上拆分并将序列转换为字典映射键以替换:

dict(s.str.split(':').tolist())

# {'bat': 'battery',
#  'ext': 'exhaust',
#  'gr': 'gear',
#  'mag': 'magnitude',
#  'ml': 'mileage',
#  'shf': 'shaft'}
使用此选项执行带有
regex=True
的操作:

df['messages'].replace(dict(s.str.split(':').tolist()), regex=True)

0    we need to fix the car magnitude but we can't
1            we need a shaft to perform eng change
2                         gear is needed to change
3                             battery needs change
4                     car towed for exhaust change
5                              car mileage is high
Name: messages, dtype: object

请注意,如果这些是严格意义上的整词替换,则可以通过将关键字字符串转换为使用词边界的正则表达式来扩展此解决方案。为了更好地测量,还应转义字符串:

import re

mapping = {fr'\b{re.escape(k)}\b': v for k, v in s.str.split(':').tolist()}
df['messages'].replace(mapping, regex=True)

0    we need to fix the car magnitude but we can't
1            we need a shaft to perform eng change
2                         gear is needed to change
3                             battery needs change
4                     car towed for exhaust change
5                              car mileage is high
Name: messages, dtype: object

使用pd.Series.apply的另一种方法是:

d = dict(i.split(':') for i in d.split('\n'))
#{'bat': 'battery',
# 'ext': 'exhaust',
# 'gr': 'gear',
# 'mag': 'magnitude',
# 'ml': 'mileage',
# 'shf': 'shaft'}

df['messages'].apply(lambda x : ' '.join(d.get(i, i) for i in x.split()), 1)
输出:

0    we need to fix the car magnitude but we can't
1            we need a shaft to perform eng change
2                         gear is needed to change
3                             battery needs change
4                     car towed for exhaust change
5                              car mileage is high
Name: messages, dtype: object

使用pd.Series.apply的另一种方法是:

d = dict(i.split(':') for i in d.split('\n'))
#{'bat': 'battery',
# 'ext': 'exhaust',
# 'gr': 'gear',
# 'mag': 'magnitude',
# 'ml': 'mileage',
# 'shf': 'shaft'}

df['messages'].apply(lambda x : ' '.join(d.get(i, i) for i in x.split()), 1)
输出:

0    we need to fix the car magnitude but we can't
1            we need a shaft to perform eng change
2                         gear is needed to change
3                             battery needs change
4                     car towed for exhaust change
5                              car mileage is high
Name: messages, dtype: object

奇怪的是,为什么需要
regex=True
呢?如果还有其他单词包含字典的键,这不是很容易出错吗?例如,类似great的内容也将更改为geareat。@razdi没有它,pandas会查找精确匹配,因此整行内容必须与搜索到的文本匹配。@Chris是这样,但没有任何上下文,这是最简单的解决方案。如果需要全词替换,则可以使用词边界扩展解决方案。@cs95我明白了。我也喜欢你文章的简洁。感谢您的回复:)很好奇,为什么需要
regex=True
呢?如果还有其他单词包含字典的键,这不是很容易出错吗?例如,类似great的内容也将更改为geareat。@razdi没有它,pandas会查找精确匹配,因此整行内容必须与搜索到的文本匹配。@Chris是这样,但没有任何上下文,这是最简单的解决方案。如果需要全词替换,则可以使用词边界扩展解决方案。@cs95我明白了。我也喜欢你文章的简洁。感谢您的回复:)