Python 如何从数据框的文本字段中删除松散的字母_Python_Pandas_Dataframe

Python 如何从数据框的文本字段中删除松散的字母

python pandas dataframe

Python 如何从数据框的文本字段中删除松散的字母,python,pandas,dataframe,Python,Pandas,Dataframe,我有以下数据源： import pandas as pd df_Msg = pd.DataFrame({'Id': [1, 2, 3], 'Sentence': ['I like fictions', 'Thank s you', 'I need to by a new book']}) print(df_Msg) 输出： Id Sentence 1 I like fictions 2 Thank s you 3 I need

我有以下数据源：

  import pandas as pd
  df_Msg = pd.DataFrame({'Id': [1, 2, 3], 
               'Sentence': ['I like fictions', 'Thank s you', 'I need to by a new book']})

  print(df_Msg)

输出：

Id   Sentence
1    I like fictions
2    Thank s you
3    I need to by a new book

我想删除字母表中单独出现的所有字母。在这种情况下，我需要删除字母：I、s和a。为此，我使用了replace（），如下所示：

df_Msg['Sentence'] = df_Msg['Sentence'].replace('I', '', regex=True)
df_Msg['Sentence'] = df_Msg['Sentence'].replace('s', '', regex=True)
df_Msg['Sentence'] = df_Msg['Sentence'].replace('a', '', regex=True)

输出：

Id  Sentence
1   like fiction
2   Thnk you
3   need to by new book

但是，我希望输出为：

Id   Sentence
1    like fictions
2    Thank you
3    need to by new book

谢谢

IIUC使用单词边界：

print(df_Msg["Sentence"].str.replace(r"\b[A-Za-z]\b\s?", ""))

0          like fictions
1              Thank you
2    need to by new book
Name: Sentence, dtype: object

为避免背对背的单字符单词周围出现空白的复杂情况，请拆分、删除1个字符长的单词，然后重新连接

df_Msg = pd.DataFrame({'Id': [1, 2, 3, 4], 
                      'Sentence': ['I like fictions', 'Thank s you', 'I need to by a new book', 
                                   'a b foo b b bar baz b c']})

s = df_Msg['Sentence'].str.split(expand=True).stack()
df_Msg['sentence'] = s[s.str.len().gt(1)].groupby(level=0).agg(' '.join)

你可以这样做

import pandas as pd
df_Msg = pd.DataFrame({'Id': [1, 2, 3], 
                   'Sentence': ['I like fictions', 'Thank s you', 'I need to by a new book']})

for i in range(len(df_Msg['Sentence'])):
   lst = df_Msg['Sentence'][i].split()
   for item in lst:
        if len(item) == 1 and item == 'I' or item == 's' or item == 'a':
            lst.remove(item)
        elif item[-1] == 's':
            lst.remove(item)
            item = item.replace('s','')
            lst.append(item)
   sentence = " ".join(lst)
   df_Msg['Sentence'][i] = sentence

在句子栏中使用apply方法

df_Msg['Sentence'] = df_Msg.Sentence.apply(lambda x :" ".join([i for i in x.split() if len(i)>1]))

对于正则表达式替换来说，这是一项简单的工作，尽管挑战在于，如果您非常精确地知道在字符串的开头、中间或结尾处应该如何处理周围的空格。

df_Msg['Sentence'] = df_Msg.Sentence.apply(lambda x :" ".join([i for i in x.split() if len(i)>1]))