从python panda dataframe中的一大组文本中逐行删除URL_Python_Regex_Pandas_Dataframe_Spyder

从python panda dataframe中的一大组文本中逐行删除URL

python regex pandas dataframe

从python panda dataframe中的一大组文本中逐行删除URL,python,regex,pandas,dataframe,spyder,Python,Regex,Pandas,Dataframe,Spyder,我已将数据插入数据框。如图所示正如您所看到的，有些行包含url链接，我想删除所有url链接，并将它们替换为“”（不只是擦除它），因为您可以看到第4行有一个url，还有其他行也有url。我想浏览status_message列中的所有行，找到任何url并删除它们。我一直在看这个，但不确定如何在数据帧上使用它。所以第4排现在应该投票支持劳工登记。我想你可以做一些简单的事情，比如 for index,row in data.iterrows(): desc = row['status_mess

我已将数据插入数据框。如图所示

正如您所看到的，有些行包含url链接，我想删除所有url链接，并将它们替换为“”（不只是擦除它），因为您可以看到第4行有一个url，还有其他行也有url。我想浏览status_message列中的所有行，找到任何url并删除它们。我一直在看这个，但不确定如何在数据帧上使用它。所以第4排现在应该投票支持劳工登记。

我想你可以做一些简单的事情，比如

for index,row in data.iterrows():
    desc = row['status_message'].lower().split()
    print ' '.join(word for word in desc if not word.startswith(('www.','http')))

只要URL以“www.”

开头，您就可以使用正则表达式替换（）

df = pd.DataFrame({'A':['Nice to meet you www.xy.com amazing','Wow https://www.goal.com','Amazing http://Goooooo.com']})
df['A'] = df['A'].replace(r'http\S+', '', regex=True).replace(r'www\S+', '', regex=True)

输出：

A 0 Nice to meet you amazing 1 Wow 2 Amazing A. 很高兴认识你太棒了哇 2惊人您可以与

case=False

参数一起使用：

df = pd.DataFrame({'status_message':['a s sd Www.labour.com',
                                    'httP://lab.net dud ff a',
                                     'a ss HTTPS://dd.com ur o']})
print (df)
             status_message
0     a s sd Www.labour.com
1   httP://lab.net dud ff a
2  a ss HTTPS://dd.com ur o

df['status_message'] = df['status_message'].str.replace('http\S+|www.\S+', '', case=False)
print (df)
  status_message
0        a s sd 
1       dud ff a
2     a ss  ur o

df.status\u message=df.status\u message.str.replace（“www.，”）

是的，非常相似，只有一个区别-

case=False

不区分大小写。加上一个

case=False