Pandas to_sql使索引唯一_Pandas_Unique_Primary Key_Flask Sqlalchemy_Pandas To Sql

Pandas to_sql使索引唯一

pandas

Pandas to_sql使索引唯一,pandas,unique,primary-key,flask-sqlalchemy,pandas-to-sql,Pandas,Unique,Primary Key,Flask Sqlalchemy,Pandas To Sql,我已经读过关于pandas to_sql解决方案的内容，这些解决方案可以避免向数据库中添加重复记录。我正在处理csv日志文件，每次我上传一个新的日志文件，然后读取数据，并通过创建一个新的数据框进行一些更改。然后我执行到\u sql（'Logs'，con=db.engine，如果存在='append'，index=True）。使用if_存在arg i确保每次从新文件创建的新数据帧都附加到现有数据库中。问题是它不断添加重复的值。我想确保，如果一个已经上传的文件被错误地再次上传，它将不会被附加到数据

我已经读过关于pandas to_sql解决方案的内容，这些解决方案可以避免向数据库中添加重复记录。我正在处理csv日志文件，每次我上传一个新的日志文件，然后读取数据，并通过创建一个新的数据框进行一些更改。然后我执行

到\u sql（'Logs'，con=db.engine，如果存在='append'，index=True）

。使用

if_存在

arg i

确保每次从新文件创建的新数据帧都附加到现有数据库中。问题是它不断添加重复的值。我想确保，如果一个已经上传的文件被错误地再次上传，它将不会被附加到数据库中。我想在创建数据库时直接尝试这样做，而不需要找到解决方法，比如检查文件名以前是否被使用过

我在研究炼金术

谢谢。

您最好的办法是通过将索引设置为主键来捕获重复项，然后使用

尝试

除

之外的方法来捕获唯一性冲突。你提到了另一篇建议关注

IntegrityError

异常的帖子，我同意这是最好的方法。您可以将其与去重复功能结合起来，以确保表更新顺利运行

演示问题下面是一个玩具示例：

from sqlalchemy import *
import sqlite3

# make a database, 'test', and a table, 'foo'.
conn = sqlite3.connect("test.db")
c = conn.cursor()
# id is a primary key.  this will be the index column imported from to_sql().
c.execute('CREATE TABLE foo (id integer PRIMARY KEY, foo integer NOT NULL);')
# use the sqlalchemy engine.
engine = create_engine('sqlite:///test.db')

pd.read_sql("pragma table_info(foo)", con=engine)

   cid name     type  notnull dflt_value  pk
0    0   id  integer        0       None   1
1    1  foo  integer        1       None   0

现在，两个示例数据帧，

df

和

df2

：

data = {'foo':[1,2,3]}
df = pd.DataFrame(data)
df
   foo
0    1
1    2
2    3

data2 = {'foo':[3,4,5]}
df2 = pd.DataFrame(data2, index=[2,3,4])
df2
   foo
2    3       # this row is a duplicate of df.iloc[2,:]
3    4
4    5

将

df

移动到表

foo

：

df.to_sql('foo', con=engine, index=True, index_label='id', if_exists='append')

pd.read_sql('foo', con=engine)
   id  foo
0   0    1
1   1    2
2   2    3

现在，当我们尝试附加

df2

时，我们捕获了

IntegrityError

：

try:
    df2.to_sql('foo', con=engine, index=True, index_label='id', if_exists='append')
# use the generic Exception, both IntegrityError and sqlite3.IntegrityError caused trouble.
except Exception as e: 
    print("FAILURE TO APPEND: {}".format(e))

输出：

FAILURE TO APPEND: (sqlite3.IntegrityError) UNIQUE constraint failed: foo.id [SQL: 'INSERT INTO foo (id, foo) VALUES (?, ?)'] [parameters: ((2, 3), (3, 4), (4, 5))]

Initial failure to append: (sqlite3.IntegrityError) UNIQUE constraint failed: foo.id [SQL: 'INSERT INTO foo (id, foo) VALUES (?, ?)'] [parameters: ((2, 3), (3, 4), (4, 5))]

Attempting to rectify...
Successful deduplication.

foo    Success after dedupe
dtype: object

建议的解决方案在IntegrityError上，可以提取现有表数据，删除新数据的重复项，然后重试append语句。为此使用

apply（）

：

def append_db(data):
    try:
        data.to_sql('foo', con=engine, index=True, index_label='id', if_exists='append')
        return 'Success'
    except Exception as e:
        print("Initial failure to append: {}\n".format(e))
        print("Attempting to rectify...")
        existing = pd.read_sql('foo', con=engine)
        to_insert = data.reset_index().rename(columns={'index':'id'})
        mask = ~to_insert.id.isin(existing.id)
        try:
            to_insert.loc[mask].to_sql('foo', con=engine, index=False, if_exists='append')
            print("Successful deduplication.")
        except Exception as e2:
            "Could not rectify duplicate entries. \n{}".format(e2)
        return 'Success after dedupe'

df2.apply(append_db)

输出：

FAILURE TO APPEND: (sqlite3.IntegrityError) UNIQUE constraint failed: foo.id [SQL: 'INSERT INTO foo (id, foo) VALUES (?, ?)'] [parameters: ((2, 3), (3, 4), (4, 5))]

Initial failure to append: (sqlite3.IntegrityError) UNIQUE constraint failed: foo.id [SQL: 'INSERT INTO foo (id, foo) VALUES (?, ?)'] [parameters: ((2, 3), (3, 4), (4, 5))]

Attempting to rectify...
Successful deduplication.

foo    Success after dedupe
dtype: object

最近有一个讨论是关于给大熊猫增加营养。TL；DR-目前被认为不属于熊猫的范围，因为保持数据库的不可知性变得很棘手。（用重复项替换条目是一种插入方式。）有没有一种方法不替换条目，而只是忽略重复项时的数据帧？日志文件每月生成一次。实际上，我只关心不要将已经添加到数据库中的数据帧重新追加，以防有人错误地将同一文件上载两次。我在另一篇文章中看到，一个可能的解决方案是使用sqlite3.IntegrityError，但这对我不起作用。对于未来的读者来说：我已经使用了几年的解决方案，虽然速度很慢，但效果很好——就是迭代数据帧（是的，我知道…），然后

尝试使用插入每一行，然后使用插入sql
。在除
之外的块中，测试'1062'
是否在错误输出中，因为这表示重复。感谢您的回复，但是在文章中提到了IntegrityError作为解决方案，它不需要任何额外的步骤。毕竟，我真的希望避免创建临时数据库。我正在使用Flask SQL Alchemy，起初我认为通过定义模型并在那里设置索引作为主键，它可以工作，但没有工作（我猜毕竟模型中的表和pandas包装的表是不同的）。有没有一种方法可以直接使用pandas设置我的主键，或者使用SQLAlchemy设置解决方案？虽然可以使用schema
参数指定现有的架构，但不能使用pandas设置架构详细信息。您没有指定捕获IntegrityError
失败的原因，这就是为什么我演示了一个解决方案。这个解决方案确实使用了SQLAlchemy…恐怕我有点不清楚您的问题到底是什么。请考虑用适当的方式更新您的原始文章。请查看指南——您所发布的示例代码不是最低的、完整的或可验证的。如果在表中正确指定了主键，则不会得到重复的主键，但在使用访问sql
时会出现错误。我的解决方案的表单不会创建临时数据库，但会检查现有条目以查找重复项。我不确定你能绕过那一步。