Python没有';即使附加相同的df,也找不到任何重复值
我有一个像这样的虚拟df并将其保存到csv中Python没有';即使附加相同的df,也找不到任何重复值,python,pandas,dataframe,duplicates,Python,Pandas,Dataframe,Duplicates,我有一个像这样的虚拟df并将其保存到csv中 date time open high low close volume 0 2021-05-06 04:08:00 9150090.0 9150090.0 9125001.0 9130000.0 9.015642 1 2021-05-06 04:09:00 9140000.0 9145000.0 9125012.0 9134068.0 3.12
date time open high low close volume
0 2021-05-06 04:08:00 9150090.0 9150090.0 9125001.0 9130000.0 9.015642
1 2021-05-06 04:09:00 9140000.0 9145000.0 9125012.0 9134068.0 3.121043
2 2021-05-06 04:10:00 9133882.0 9133882.0 9125002.0 9132999.0 5.536345
3 2021-05-06 04:11:00 9132999.0 9135013.0 9131000.0 9132999.0 5.880620
之后,我尝试通过附加相同的csv来模拟附加新数据流,并尝试删除(如果有)重复数据
if os.path.isfile(filename):
df_old = pd.read_csv(filename, encoding='UTF-8')
else:
df_old = pd.DataFrame()
df_stream = df_old.append(df_new).drop_duplicates(subset=['time'])
df_stream.to_csv(filename, encoding='UTF-8', index=False)
df_流仍然返回重复的值
print(df_stream)
date time open high low close volume
0 2021-05-06 04:08:00 9150090.0 9150090.0 9125001.0 9130000.0 9.015642
1 2021-05-06 04:09:00 9140000.0 9145000.0 9125012.0 9134068.0 3.121043
2 2021-05-06 04:10:00 9133882.0 9133882.0 9125002.0 9132999.0 5.536345
3 2021-05-06 04:11:00 9132999.0 9135013.0 9131000.0 9132999.0 5.880620
0 2021-05-06 04:08:00 9150090.0 9150090.0 9125001.0 9130000.0 9.015642
1 2021-05-06 04:09:00 9140000.0 9145000.0 9125012.0 9134068.0 3.121043
2 2021-05-06 04:10:00 9133882.0 9133882.0 9125002.0 9132999.0 5.536345
3 2021-05-06 04:11:00 9132999.0 9135013.0 9131000.0 9132999.0 5.880620
print(df_stream.duplicated())
0 False
1 False
2 False
3 False
0 False
1 False
2 False
3 False
如何解决这个问题?
我尝试使用
df_stream[~df_stream.index.duplicated(keep='last')]
返回的数据没有一致性(洗牌、删除以前的数据等)在通过追加相同的csv并将数据存储在df中模拟追加新数据流之后,我们可以应用
在df上放置重复项
df = df.drop_duplicates()
读取具有重复项的csv后输入
d="""date,time,open,high,low,close,volume
2021-05-06,04:08:00,9150090.0,9150090.0,9125001.0,9130000.0,9.015642
2021-05-06,04:09:00,9140000.0,9145000.0,9125012.0,9134068.0,3.121043
2021-05-06,04:10:00,9133882.0,9133882.0,9125002.0,9132999.0,5.536345
2021-05-06,04:11:00,9132999.0,9135013.0,9131000.0,9132999.0,5.880620
2021-05-06,04:08:00,9150090.0,9150090.0,9125001.0,9130000.0,9.015642
2021-05-06,04:09:00,9140000.0,9145000.0,9125012.0,9134068.0,3.121043
2021-05-06,04:10:00,9133882.0,9133882.0,9125002.0,9132999.0,5.536345
2021-05-06,04:11:00,9132999.0,9135013.0,9131000.0,9132999.0,5.880620"""
df=pd.read_csv(StringIO(d))
df
输入df
date time open high low close volume
0 2021-05-06 04:08:00 9150090.0 9150090.0 9125001.0 9130000.0 9.015642
1 2021-05-06 04:09:00 9140000.0 9145000.0 9125012.0 9134068.0 3.121043
2 2021-05-06 04:10:00 9133882.0 9133882.0 9125002.0 9132999.0 5.536345
3 2021-05-06 04:11:00 9132999.0 9135013.0 9131000.0 9132999.0 5.880620
4 2021-05-06 04:08:00 9150090.0 9150090.0 9125001.0 9130000.0 9.015642
5 2021-05-06 04:09:00 9140000.0 9145000.0 9125012.0 9134068.0 3.121043
6 2021-05-06 04:10:00 9133882.0 9133882.0 9125002.0 9132999.0 5.536345
7 2021-05-06 04:11:00 9132999.0 9135013.0 9131000.0 9132999.0 5.880620
输出
date time open high low close volume
0 2021-05-06 04:08:00 9150090.0 9150090.0 9125001.0 9130000.0 9.015642
1 2021-05-06 04:09:00 9140000.0 9145000.0 9125012.0 9134068.0 3.121043
2 2021-05-06 04:10:00 9133882.0 9133882.0 9125002.0 9132999.0 5.536345
3 2021-05-06 04:11:00 9132999.0 9135013.0 9131000.0 9132999.0 5.880620
如果我手动输入数据,它会起作用,为了澄清,我使用了两个不同的csv(每个csv有4行数据),然后我将其放入
df_old
和df_new
中,然后我做df_stream=df_old.append(df_new)。重置_index(drop=True)
以模拟结果,但在df_stream=df_stream.drop\u duplicates(subset=['time'))
它不会删除任何内容,然后我尝试更改删除列df_stream=df_stream.drop_duplicates(subset=['date'])
它只返回索引0和4,您可以在这里查看它[当您提供df_stream.drop_duplicates(subset=['time'])
它应该删除时间
列的重复行,根据示例,它应该返回前4行。当您运行df\u stream=df\u stream时,删除重复行(子集=['date'])
它应该只提供第0个索引行,因为date
列中的所有行都有重复的值。如果您想删除df的所有重复行,那么为什么要使用subset arg
?如果使用df\u stream,我同意您的看法。删除重复的行(subset=['date'])
应该只返回索引0,但为什么它也返回索引4?我使用子集arg
,因为df_new
将与df_old
有重叠数据,所以我检查重复的值并删除它