Python 多表上SQLite3 executemany（）插入生成器的高效设计_Python_Database_Performance_Sqlite

Python 多表上SQLite3 executemany（）插入生成器的高效设计

python database performance sqlite

Python 多表上SQLite3 executemany（）插入生成器的高效设计,python,database,performance,sqlite,Python,Database,Performance,Sqlite,我正在用python将大量xml文件解析到sqlite3数据库中。据我所知，（尽管我非常愿意并寻求更多的性能选项），更高性能的选项是sqlite3的executemany（）插入函数我目前所做工作的要点如下： document_dir = '/documents' Document = named_tuple('Document', 'doc_id doc_title doc_mentioned_people ... etc') People = named_tuple('People',

我正在用python将大量xml文件解析到sqlite3数据库中。据我所知，（尽管我非常愿意并寻求更多的性能选项），更高性能的选项是sqlite3的

executemany（）

插入函数

我目前所做工作的要点如下：

document_dir = '/documents'

Document = named_tuple('Document', 'doc_id doc_title doc_mentioned_people ... etc')
People = named_tuple('People', 'doc_id first_name last_name ... ') 

class DocumentXML(object):
    """
    ... there's some stuff here, but you get the idea

    """

    def parse_document(path):
        """
        This object keeps track of the current 'document' type element from a cElementTree.iterparse() elsewhere

        I've simplified things here, but you can get the idea that this is providing a named tuple for a generator
        """
        doc_id = _current_element.findall(xpath = '../id')[0].text
        doc_title = _current_element.findall(xpath = '../title')[0].text

        # parse lists of people here

        doc_mentioned_people = People(first_name, last_name, ..., person_id)
        #etc...
        return Document(doc_id, doc_title, doc_mentioned_people, ..., etc)

def doc_generator():
    documents = parse_document(document_dir)
    for doc in documents:
        yield doc.id, doc.title, ..., doc.date



# Import into Table 1
with cursor(True) as c:
        c.executemany("INSERT INTO Document VALUES (?,?,?,?,?,?,?,?,?,?,?);", doc_generator())



def people_generator():
    documents = parse_document(document_dir)
    for doc in documents:
        people = doc.people
        yield people.firstname, people.lastname ..., people.eyecolor


# Import into Table 2
with cursor(True) as c:
        c.executemany("INSERT INTO Document VALUES (?,?,?,?,?,?,?,?,?,?,?);", people_generator())


# This goes on for several tables...

正如你所看到的，这里的效率非常低。对每个xml文件进行一次又一次的解析，解析次数与数据库中的表相同

我只想利用XML的一次解析（因为我可以在一个命名元组中生成所有相关信息），但要将结构保持为生成器，以免将内存需求扩大到不可行的水平

有什么好办法吗

我的尝试一直围绕着使用ExecuteMy和双插入类型的语句展开，例如：

c.executemany("
    INSERT INTO Document VALUES (?,?,?,?,?,?,?,?,?,?,?);
    INSERT INTO People VALUES (?,?,?,?,?,?,?); 
    INSERT INTO Companies VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?); 
    INSERT INTO Oils VALUES (?,?,?,?,?,?,?); 
    INSERT INTO Physics VALUES (?,?,?,?,?,?,?,?,?,?,?)",
        complete_data_generator())

其中，

complete\u data\u generator（）

生成所有相关的结构化信息；然而，我知道这可能行不通

是否有更好的方法来构造此文档以提高性能？

如果您只有很少的小文档，您将在内存中加载所有内容，并且不再为重新解析文档而烦恼

如果只有一张表要输入，那么生成器方法就可以了

如果这两种方法都不合适，我会尝试中级方法：

解析一堆XML文件并积累大量
```
doc
```
元素
当有足够数量的文档可用时，可以暂停解析，并开始使用executemany在该数量的文档上输入数据库表
插入文档包后，您可以选择释放SQLite日志文件，并继续解析

优点：

文件只解析一次
SQLite数据库上的加载可以通过中间提交进行控制
您仍然使用
```
executemany
```

缺点：

对
```
executemany
```
的多次调用，具体取决于数据量
每次提交都需要一些时间

如果您只有很少的小文档，那么您将在内存中加载所有内容，不再为重新解析文档而烦恼

如果只有一张表要输入，那么生成器方法就可以了

如果这两种方法都不合适，我会尝试中级方法：

解析一堆XML文件并积累大量
```
doc
```
元素
当有足够数量的文档可用时，可以暂停解析，并开始使用executemany在该数量的文档上输入数据库表
插入文档包后，您可以选择释放SQLite日志文件，并继续解析

优点：

文件只解析一次
SQLite数据库上的加载可以通过中间提交进行控制
您仍然使用
```
executemany
```

缺点：

对
```
executemany
```
的多次调用，具体取决于数据量
每次提交都需要一些时间