Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/332.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
元素对象的Spark Python RDD?_Python_Apache Spark_Xml Parsing_Lxml - Fatal编程技术网

元素对象的Spark Python RDD?

元素对象的Spark Python RDD?,python,apache-spark,xml-parsing,lxml,Python,Apache Spark,Xml Parsing,Lxml,我想对一组XML文档进行一些交互式探索。我尝试使用lxml解析文档,并使用find、findall和xpath方法进行查询。但是,当我尝试创建元素对象的RDD时,PySpark会受阻: from lxml import etree from lxml.etree import XMLSyntaxError def get_root(xml): xml_bytes = bytes(bytearray(xml, encoding = 'utf-8')) try: return [et

我想对一组XML文档进行一些交互式探索。我尝试使用lxml解析文档,并使用find、findall和xpath方法进行查询。但是,当我尝试创建元素对象的RDD时,PySpark会受阻:

from lxml import etree
from lxml.etree import XMLSyntaxError
def get_root(xml):
  xml_bytes = bytes(bytearray(xml, encoding = 'utf-8'))
  try:
    return [etree.XML(xml_bytes)]
  except XMLSyntaxError:
    return []

docs = [
    "<doc><tag name='hoo'>hah</tag><tag name='wah'>zoo</tag></doc>"
  , "<doc><tag name='hoo'>yah</tag><tag name='wah'>woo</tag></doc>"
]
roots = [get_root(x)[0] for x in docs]
roots
  [<Element doc at 0x3b2280>, <Element doc at 0x3b2140>]
docs_rdd = sc.parallelize(docs)
roots_rdd = docs_rdd.flatMap(lambda d: get_root(d))
roots_rdd.count()
  2
roots_rdd.first()
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "lxml.etree.pyx", line 1033, in lxml.etree._Element.__repr__ (src/lxml/lxml.etree.c:42268)
    File "lxml.etree.pyx", line 881, in lxml.etree._Element.tag.__get__ (src/lxml/lxml.etree.c:40855)
    File "apihelpers.pxi", line 15, in lxml.etree._assertValidNode  (src/lxml/lxml.etree.c:12875)
  AssertionError: invalid Element proxy at 62728864
从lxml导入etree
从lxml.etree导入XMLSyntaxError
def get_根目录(xml):
xml_bytes=bytes(字节数组(xml,编码='utf-8'))
尝试:
返回[etree.XML(XML_字节)]
除XMLSyntaxError外:
返回[]
文件=[
“哈动物园”
“耶和华”
]
roots=[get_root(x)[0]表示文档中的x]
根
[, ]
docs\u rdd=sc.parallelize(docs)
roots\u rdd=docs\u rdd.flatMap(lambda d:get\u root(d))
根数
2.
根\ rdd.first()
回溯(最近一次呼叫最后一次):
文件“”,第1行,在
lxml.etree.\u元素中的文件“lxml.etree.pyx”,第1033行(src/lxml/lxml.etree.c:42268)
文件“lxml.etree.pyx”,第881行,位于lxml.etree.\u Element.tag.\uuuuu get\uuuu(src/lxml/lxml.etree.c:40855)
文件“apihelpers.pxi”,第15行,在lxml.etree._assertValidNode(src/lxml/lxml.etree.c:12875)中
AssertionError:位于62728864的元素代理无效
有人能帮我理解这里发生了什么吗

Python2.7.x或3.5.x,Spark 1.6.x,lxml与pip或pip3一起安装


提前谢谢

lxml
对象是不可序列化的,不能在执行器和驱动程序之间传递,也不能洗牌。无需使用Spark即可轻松复制:

from lxml import etree
import pickle

pickle.loads(pickle.dumps(etree.XML("<doc>foo</doc>")))
您仍然可以使用
lxml
解析和获取可序列化的Python对象:

from operator import attrgetter

docs_rdd.flatMap(get_root).flatMap(lambda x: x).map(attrgetter("text")).collect()
['hah',zoo','yah','woo']