Python 如何从数据框的文本字段中删除松散的字母
我有以下数据源:Python 如何从数据框的文本字段中删除松散的字母,python,pandas,dataframe,Python,Pandas,Dataframe,我有以下数据源: import pandas as pd df_Msg = pd.DataFrame({'Id': [1, 2, 3], 'Sentence': ['I like fictions', 'Thank s you', 'I need to by a new book']}) print(df_Msg) 输出: Id Sentence 1 I like fictions 2 Thank s you 3 I need
import pandas as pd
df_Msg = pd.DataFrame({'Id': [1, 2, 3],
'Sentence': ['I like fictions', 'Thank s you', 'I need to by a new book']})
print(df_Msg)
输出:
Id Sentence
1 I like fictions
2 Thank s you
3 I need to by a new book
我想删除字母表中单独出现的所有字母。在这种情况下,我需要删除字母:I、s和a。
为此,我使用了replace(),如下所示:
df_Msg['Sentence'] = df_Msg['Sentence'].replace('I', '', regex=True)
df_Msg['Sentence'] = df_Msg['Sentence'].replace('s', '', regex=True)
df_Msg['Sentence'] = df_Msg['Sentence'].replace('a', '', regex=True)
输出:
Id Sentence
1 like fiction
2 Thnk you
3 need to by new book
但是,我希望输出为:
Id Sentence
1 like fictions
2 Thank you
3 need to by new book
谢谢IIUC使用单词边界:
print(df_Msg["Sentence"].str.replace(r"\b[A-Za-z]\b\s?", ""))
0 like fictions
1 Thank you
2 need to by new book
Name: Sentence, dtype: object
为避免背对背的单字符单词周围出现空白的复杂情况,请拆分、删除1个字符长的单词,然后重新连接
df_Msg = pd.DataFrame({'Id': [1, 2, 3, 4],
'Sentence': ['I like fictions', 'Thank s you', 'I need to by a new book',
'a b foo b b bar baz b c']})
s = df_Msg['Sentence'].str.split(expand=True).stack()
df_Msg['sentence'] = s[s.str.len().gt(1)].groupby(level=0).agg(' '.join)
你可以这样做
import pandas as pd
df_Msg = pd.DataFrame({'Id': [1, 2, 3],
'Sentence': ['I like fictions', 'Thank s you', 'I need to by a new book']})
for i in range(len(df_Msg['Sentence'])):
lst = df_Msg['Sentence'][i].split()
for item in lst:
if len(item) == 1 and item == 'I' or item == 's' or item == 'a':
lst.remove(item)
elif item[-1] == 's':
lst.remove(item)
item = item.replace('s','')
lst.append(item)
sentence = " ".join(lst)
df_Msg['Sentence'][i] = sentence
在句子栏中使用apply方法
df_Msg['Sentence'] = df_Msg.Sentence.apply(lambda x :" ".join([i for i in x.split() if len(i)>1]))
对于正则表达式替换来说,这是一项简单的工作,尽管挑战在于,如果您非常精确地知道在字符串的开头、中间或结尾处应该如何处理周围的空格。
df_Msg['Sentence'] = df_Msg.Sentence.apply(lambda x :" ".join([i for i in x.split() if len(i)>1]))