Python 无法将lxml etree对象传递给单独的进程

Python 无法将lxml etree对象传递给单独的进程,python,lxml,python-multiprocessing,Python,Lxml,Python Multiprocessing,我正在做一个项目,使用lxml在python中并发解析多个xml文件。初始化进程时,我希望我的主类在将etree对象传递给进程之前对XML进行一些处理,但我发现,当etree对象到达新进程时,该类仍然存在,但XML从对象中消失,getroot()不返回任何内容 我知道我只能使用队列传递可拾取的数据,但我传递给“args”字段中的进程的数据也是这样吗 这是我的密码: import multiprocessing, multiprocessing.pool, time from lxml impor

我正在做一个项目,使用lxml在python中并发解析多个xml文件。初始化进程时,我希望我的主类在将etree对象传递给进程之前对XML进行一些处理,但我发现,当etree对象到达新进程时,该类仍然存在,但XML从对象中消失,getroot()不返回任何内容

我知道我只能使用队列传递可拾取的数据,但我传递给“args”字段中的进程的数据也是这样吗

这是我的密码:

import multiprocessing, multiprocessing.pool, time
from lxml import etree

def compute(tree):
    print("Start Process")
    print(type(tree)) # Returns <class 'lxml.etree._ElementTree'>
    print(id(tree)) # Returns new ID 44637320 as expected
    print(tree.getroot()) # Returns None

def pool_init(queue):
    # see http://stackoverflow.com/a/3843313/852994
    compute.queue = queue

class Main():
    def __init__(self):
        pass

    def main(self):
        tree = etree.parse('test.xml')
        print(id(tree)) # Returns object ID 43998536
        print(tree.getroot()) #Returns <Element SymCLI_ML at 0x29f5dc8>

        self.queue = multiprocessing.Queue()
        self.pool = multiprocessing.Pool(processes=1, initializer=pool_init, initargs=(self.queue,))
        self.pool.apply_async(func=compute, args=(tree,))
        time.sleep(10)

if __name__ == '__main__':
    Main().main()
import multiprocessing,multiprocessing.pool,time
从lxml导入etree
def计算(树):
打印(“启动流程”)
打印(类型(树))#返回
打印(id(树))#按预期返回新id 44637320
打印(tree.getroot())#返回无
def池初始化(队列):
#看http://stackoverflow.com/a/3843313/852994
compute.queue=队列
类Main():
定义初始化(自):
通过
def主(自):
tree=etree.parse('test.xml')
打印(id(树))#返回对象id 43998536
打印(tree.getroot())#返回
self.queue=multiprocessing.queue()
self.pool=multiprocessing.pool(进程=1,初始值设定项=pool\u init,initargs=(self.queue,))
self.pool.apply_async(func=compute,args=(tree,))
时间。睡眠(10)
如果uuuu name uuuuuu='\uuuuuuu main\uuuuuuu':
Main().Main()
非常感谢所有帮助

更新/回答

根据下一篇文章中的答案,我对它做了一些修改,使它能够在不使用字符串IO的情况下以更低的内存占用率工作。 tostring方法返回一个字节数组,可以对其进行pickle,然后要取消pickle,字节数组可以由etree解析

import multiprocessing, multiprocessing.pool, time, copyreg
from lxml import etree

def compute(tree):
    print("Start Process")
    print(type(tree)) # Returns <class 'lxml.etree._ElementTree'>
    print(tree.getroot()) # Returns <Element SymCLI_ML at 0x29f5dc8>. Success!

def pool_init(queue):
    # see http://stackoverflow.com/a/3843313/852994
    compute.queue = queue

def elementtree_unpickler(data):
    return etree.parse(BytesIO(data))

def elementtree_pickler(tree):
    return elementtree_unpickler, (etree.tostring(tree),)

copyreg.pickle(etree._ElementTree, elementtree_pickler, elementtree_unpickler)

class Main():
    def __init__(self):
        pass

    def main(self):
        tree = etree.parse('test.xml')
        print(tree.getroot()) #Returns <Element SymCLI_ML at 0x29f5dc8>

        self.queue = multiprocessing.Queue()
        self.pool = multiprocessing.Pool(processes=1, initializer=pool_init, initargs=(self.queue,))
        self.pool.apply_async(func=compute, args=(tree,))
        time.sleep(10)

if __name__ == '__main__':
    Main().main()
import multiprocessing,multiprocessing.pool,time,copyreg
从lxml导入etree
def计算(树):
打印(“启动流程”)
打印(类型(树))#返回
打印(tree.getroot())#返回。成功!
def池初始化(队列):
#看http://stackoverflow.com/a/3843313/852994
compute.queue=队列
def elementtree_解锁器(数据):
返回etree.parse(BytesIO(数据))
def elementtree_酸洗器(树):
return元素tree_unpickler,(etree.tostring(tree),)
copyreg.pickle(etree.\u ElementTree,ElementTree\u pickler,ElementTree\u unpickler)
类Main():
定义初始化(自):
通过
def主(自):
tree=etree.parse('test.xml')
打印(tree.getroot())#返回
self.queue=multiprocessing.queue()
self.pool=multiprocessing.pool(进程=1,初始值设定项=pool\u init,initargs=(self.queue,))
self.pool.apply_async(func=compute,args=(tree,))
时间。睡眠(10)
如果uuuu name uuuuuu='\uuuuuuu main\uuuuuuu':
Main().Main()
更新2


在使用内存做了一些基准测试之后,我发现传递大型对象会导致对象无法通过主进程上的垃圾收集来清除。这在小范围内可能不是问题,但在etree中,对象在内存中的大小约为数百MB。一旦在语句中使用XML对象调用了异步任务,如果从主进程中删除该对象,则无法从内存中清除该对象,即使是手动调用垃圾收集。因此,我恢复了在主进程中关闭XML,并将文件名传递给子进程。

使用以下代码注册lxml元素/ElementTree对象的简单pickler/unpickler。我过去在lxml和zmq中使用过它

import copy_reg
try:
    from cStringIO import StringIO
except ImportError:
    from StringIO import StringIO
from lxml import etree

def element_unpickler(data):
    return etree.fromstring(data)

def element_pickler(element):
    data = etree.tostring(element)
    return element_unpickler, (data,)

copy_reg.pickle(etree._Element, element_pickler, element_unpickler)

def elementtree_unpickler(data):
    data = StringIO(data)
    return etree.parse(data)

def elementtree_pickler(tree):
    data = StringIO()
    tree.write(data)
    return elementtree_unpickler, (data.getvalue(),)

copy_reg.pickle(etree._ElementTree, elementtree_pickler, elementtree_unpickler)

我添加了这个(Python3.4,所以copy_reg是copyreg,StringIO import是'from is import StringIO'),但是在启动进程的那一行,我得到了'I/O error:write error'。完整代码修改为初始问题。是否可以将etree对象放在共享内存中,并将对共享内存位置的引用传递给子进程?