Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/312.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python没有';即使附加相同的df,也找不到任何重复值_Python_Pandas_Dataframe_Duplicates - Fatal编程技术网

Python没有';即使附加相同的df,也找不到任何重复值

Python没有';即使附加相同的df,也找不到任何重复值,python,pandas,dataframe,duplicates,Python,Pandas,Dataframe,Duplicates,我有一个像这样的虚拟df并将其保存到csv中 date time open high low close volume 0 2021-05-06 04:08:00 9150090.0 9150090.0 9125001.0 9130000.0 9.015642 1 2021-05-06 04:09:00 9140000.0 9145000.0 9125012.0 9134068.0 3.12

我有一个像这样的虚拟df并将其保存到csv中

    date      time       open       high       low        close      volume
0  2021-05-06  04:08:00  9150090.0  9150090.0  9125001.0  9130000.0  9.015642
1  2021-05-06  04:09:00  9140000.0  9145000.0  9125012.0  9134068.0  3.121043
2  2021-05-06  04:10:00  9133882.0  9133882.0  9125002.0  9132999.0  5.536345
3  2021-05-06  04:11:00  9132999.0  9135013.0  9131000.0  9132999.0  5.880620
之后,我尝试通过附加相同的csv来模拟附加新数据流,并尝试删除(如果有)重复数据

if os.path.isfile(filename):
    df_old = pd.read_csv(filename, encoding='UTF-8')
else:
    df_old = pd.DataFrame()
df_stream = df_old.append(df_new).drop_duplicates(subset=['time'])
df_stream.to_csv(filename, encoding='UTF-8', index=False)
df_流仍然返回重复的值

print(df_stream)
   date        time      open       high       low        close      volume
0  2021-05-06  04:08:00  9150090.0  9150090.0  9125001.0  9130000.0  9.015642
1  2021-05-06  04:09:00  9140000.0  9145000.0  9125012.0  9134068.0  3.121043
2  2021-05-06  04:10:00  9133882.0  9133882.0  9125002.0  9132999.0  5.536345
3  2021-05-06  04:11:00  9132999.0  9135013.0  9131000.0  9132999.0  5.880620
0  2021-05-06  04:08:00  9150090.0  9150090.0  9125001.0  9130000.0  9.015642
1  2021-05-06  04:09:00  9140000.0  9145000.0  9125012.0  9134068.0  3.121043
2  2021-05-06  04:10:00  9133882.0  9133882.0  9125002.0  9132999.0  5.536345
3  2021-05-06  04:11:00  9132999.0  9135013.0  9131000.0  9132999.0  5.880620

print(df_stream.duplicated())
0    False
1    False
2    False
3    False
0    False
1    False
2    False
3    False
如何解决这个问题?
我尝试使用
df_stream[~df_stream.index.duplicated(keep='last')]
返回的数据没有一致性(洗牌、删除以前的数据等)

在通过追加相同的csv并将数据存储在df中模拟追加新数据流之后,我们可以应用 在df上放置重复项

df = df.drop_duplicates()
读取具有重复项的csv后输入

d="""date,time,open,high,low,close,volume
2021-05-06,04:08:00,9150090.0,9150090.0,9125001.0,9130000.0,9.015642
2021-05-06,04:09:00,9140000.0,9145000.0,9125012.0,9134068.0,3.121043
2021-05-06,04:10:00,9133882.0,9133882.0,9125002.0,9132999.0,5.536345
2021-05-06,04:11:00,9132999.0,9135013.0,9131000.0,9132999.0,5.880620
2021-05-06,04:08:00,9150090.0,9150090.0,9125001.0,9130000.0,9.015642
2021-05-06,04:09:00,9140000.0,9145000.0,9125012.0,9134068.0,3.121043
2021-05-06,04:10:00,9133882.0,9133882.0,9125002.0,9132999.0,5.536345
2021-05-06,04:11:00,9132999.0,9135013.0,9131000.0,9132999.0,5.880620"""
df=pd.read_csv(StringIO(d))
df
输入df

    date    time    open    high    low close   volume
0   2021-05-06  04:08:00    9150090.0   9150090.0   9125001.0   9130000.0   9.015642
1   2021-05-06  04:09:00    9140000.0   9145000.0   9125012.0   9134068.0   3.121043
2   2021-05-06  04:10:00    9133882.0   9133882.0   9125002.0   9132999.0   5.536345
3   2021-05-06  04:11:00    9132999.0   9135013.0   9131000.0   9132999.0   5.880620
4   2021-05-06  04:08:00    9150090.0   9150090.0   9125001.0   9130000.0   9.015642
5   2021-05-06  04:09:00    9140000.0   9145000.0   9125012.0   9134068.0   3.121043
6   2021-05-06  04:10:00    9133882.0   9133882.0   9125002.0   9132999.0   5.536345
7   2021-05-06  04:11:00    9132999.0   9135013.0   9131000.0   9132999.0   5.880620
输出

date    time    open    high    low close   volume
0   2021-05-06  04:08:00    9150090.0   9150090.0   9125001.0   9130000.0   9.015642
1   2021-05-06  04:09:00    9140000.0   9145000.0   9125012.0   9134068.0   3.121043
2   2021-05-06  04:10:00    9133882.0   9133882.0   9125002.0   9132999.0   5.536345
3   2021-05-06  04:11:00    9132999.0   9135013.0   9131000.0   9132999.0   5.880620

如果我手动输入数据,它会起作用,为了澄清,我使用了两个不同的csv(每个csv有4行数据),然后我将其放入
df_old
df_new
中,然后我做
df_stream=df_old.append(df_new)。重置_index(drop=True)
以模拟结果,但在
df_stream=df_stream.drop\u duplicates(subset=['time'))
它不会删除任何内容,然后我尝试更改删除列
df_stream=df_stream.drop_duplicates(subset=['date'])
它只返回索引0和4,您可以在这里查看它[当您提供
df_stream.drop_duplicates(subset=['time'])
它应该删除
时间
列的重复行,根据示例,它应该返回前4行。当您运行
df\u stream=df\u stream时,删除重复行(子集=['date'])
它应该只提供第0个索引行,因为
date
列中的所有行都有重复的值。如果您想删除df的所有重复行,那么为什么要使用
subset arg
?如果使用
df\u stream,我同意您的看法。删除重复的行(subset=['date'])
应该只返回索引0,但为什么它也返回索引4?我使用
子集arg
,因为
df_new
将与
df_old
有重叠数据,所以我检查重复的值并删除它