Python 熊猫-用特定模式替换值_Python_Pandas

Python 熊猫-用特定模式替换值

python pandas

Python 熊猫-用特定模式替换值,python,pandas,Python,Pandas,在我的数据帧中： df = pd.DataFrame(zip(datetimes, from_, message), columns=['timestamp', 'sender', 'message']) df['timestamp'] = pd.to_datetime(df.timestamp, format='%d/%m/%Y, %I:%M %p') 存在一些有问题的值，这些值由清晰的模式定义： timestamp send

在我的数据帧中：

df = pd.DataFrame(zip(datetimes, from_, message), columns=['timestamp', 'sender', 'message'])
df['timestamp'] = pd.to_datetime(df.timestamp, format='%d/%m/%Y, %I:%M %p')

存在一些有问题的值，这些值由清晰的模式定义：

    timestamp                              sender                               message
    113381 2020-06-04 11:59:24              Jose                               bom te ver feliz\r\n
    113382 2020-06-04 11:59:29              Jose                                              ❤\r\n
    113383 2020-06-04 11:59:40              Maria                Estar bem com você me faz feliz\r\n
    113384 2020-06-04 12:00:57              Maria   Estava falando com uma amiga de infância aque...
    113385 2020-06-04 12:01:14              Maria           Ela teve uma briga feia com o marido\r\n
    113386 2020-06-04 12:01:24   Maria: ‎<attached        00113509-PHOTO-2020-06-04-12-01-25.jpg>\r\n
    113387 2020-06-04 12:02:54              Maria                       e assim leva-se a vida, um\n
    113388 2020-06-04 12:03:21              Maria                  Pelo menos ela riu isso ajuda\r\n
    113389 2020-06-04 13:06:39    Jose: ‎<attached        00113512-PHOTO-2020-06-04-13-06-40.jpg>\r\n

这应该行得通

df['sender'] = df['sender'].str.replace(u': \u200e<attached', '')

df['sender']=df['sender'].str.replace（u'：\u200edata
df=pd.DataFrame（{'sender'：['Jose'，'Jose'，'Maria'，'Maria'，'Maria'，'Maria:8位博尔赫斯，您的数据中可能有一个\u200e
字符。我遇到了类似的问题，因为split什么也不做，因为像这样的奇怪字符。这是我的解决方案：
a = df['sender'].to_dict()

然后，我看到了当你将它发送到dict时实际值是多少。值是：\u200e
编辑的，你在什么版本上？在三台不同的PC上为我工作包括我的数据样本。也许你查看一下，告诉我与你的样本有什么区别。因为无法以任何其他方式复制你的样本，代码适用于me@wwnde我怀疑问题的根源是，如果8位博尔赫斯发送到dict:df=pd.dataframe（{'sender':['Jose'、'Jose'、'Maria'、'Maria'、'Maria:\u200eThe\u200e是您的问题的根源。奇怪的是，我刚刚在我这边测试了它，它已签出。列字符串的数据类型是什么？
df['sender'] = df['sender'].str.replace(u': \u200e<attached', '')

df = pd.DataFrame({'sender': ['Jose','Jose','Maria','Maria','Maria','Maria: <attached','Maria','Maria','Jose: <attached']})

df.sender = df.sender.str.split(': <attached').str[0]

   sender
0   Jose
1   Jose
2   Maria
3   Maria
4   Maria
5   Maria
6   Maria
7   Maria
8   Jose

a = df['sender'].to_dict()

df['sender'] = df['sender'].str.split(': \u200e<attached').str[0]