Google cloud dataflow 如何调试Google云数据流作业中的msgpack序列化问题?

Google cloud dataflow 如何调试Google云数据流作业中的msgpack序列化问题?,google-cloud-dataflow,apache-beam,spacy,msgpack,Google Cloud Dataflow,Apache Beam,Spacy,Msgpack,我有一份谷歌云数据流的工作,我想使用特定的spacymodelneural coref从文本中提取命名实体 在不使用beam的情况下运行提取时,我可以提取实体,但当我尝试使用DirectRunner运行时,由于msgpack中的序列化错误,作业失败。我不知道如何继续调试这个问题 我的要求非常简单,包括: apache-beam[gcp]==2.4 spacy==2.0.12 ujson==1.35 问题可能与spacy和beam是如何相互作用的有关,因为stacktrace显示spacy喷出了

我有一份谷歌云数据流的工作,我想使用特定的
spacy
model
neural coref
从文本中提取命名实体

在不使用beam的情况下运行提取时,我可以提取实体,但当我尝试使用
DirectRunner
运行时,由于
msgpack
中的序列化错误,作业失败。我不知道如何继续调试这个问题

我的要求非常简单,包括:

apache-beam[gcp]==2.4
spacy==2.0.12
ujson==1.35
问题可能与spacy和beam是如何相互作用的有关,因为stacktrace显示spacy喷出了一些它不应该做的方法

stacktrace的奇怪空间日志行为:

T4: <class 'entity.extract_entities.EntityExtraction'>
# T4
D2: <dict object at 0x1126c0398>
T4: <class 'spacy.lang.en.English'>
# T4
D2: <dict object at 0x1126b54b0>
D2: <dict object at 0x1126d1168>
F2: <function is_alpha at 0x11266d320>
# F2
F2: <function is_ascii at 0x112327c08>
# F2
F2: <function is_digit at 0x11266d398>
# F2
F2: <function is_lower at 0x11266d410>
# F2
F2: <function is_punct at 0x112327b90>
# F2
F2: <function is_space at 0x11266d488>
# F2
F2: <function is_title at 0x11266d500>
# F2
F2: <function is_upper at 0x11266d578>
# F2
F2: <function like_url at 0x11266d050>
# F2
F2: <function like_num at 0x110d55140>
# F2
F2: <function like_email at 0x112327f50>
# F2
Fu: <functools.partial object at 0x11266c628>
F2: <function _create_ftype at 0x1070af500>
# F2
T1: <type 'functools.partial'>
F2: <function _load_type at 0x1070af398>
# F2
# T1
F2: <function is_stop at 0x11266d5f0>
# F2
D2: <dict object at 0x1126b7168>
T4: <type 'set'>
# T4
# D2
# Fu
F2: <function is_oov at 0x11266d668>
# F2
F2: <function is_bracket at 0x112327cf8>
# F2
F2: <function is_quote at 0x112327d70>
# F2
F2: <function is_left_punct at 0x112327de8>
# F2
F2: <function is_right_punct at 0x112327e60>
# F2
F2: <function is_currency at 0x112327ed8>
# F2
Fu: <functools.partial object at 0x110d49ba8>
F2: <function _get_attr_unless_lookup at 0x1106e26e0>
# F2
F2: <function lower at 0x11266d140>
# F2
D2: <dict object at 0x112317c58>
# D2
D2: <dict object at 0x110e38168>
# D2
D2: <dict object at 0x112669c58>
# D2
# Fu
F2: <function word_shape at 0x11266d0c8>
# F2
F2: <function prefix at 0x11266d1b8>
# F2
F2: <function suffix at 0x11266d230>
# F2
F2: <function get_prob at 0x11266d6e0>
# F2
F2: <function cluster at 0x11266d2a8>
# F2
F2: <function _return_en at 0x11266f0c8>
# F2
# D2
B2: <built-in function unpickle_vocab>
# B2
T4: <type 'spacy.strings.StringStore'>
# T4

我不知道如何用beam调试这个问题。为了重现整个问题,我设置了一个repo,其中包含关于如何设置所有内容的说明:

您是否能够从常规Python程序(而不是Beam DoFn)运行相同的代码


如果不是,请检查您是否正在Beam DoFn(或将由Beam序列化的任何其他函数)中存储任何不可序列化的状态。这将防止梁运行程序序列化这些函数(发送给工作程序),因此应避免。

最后,我通过更改安装的软件包版本解决了上述问题。我确实认为调试beam设置过程是相当痛苦的,尽管我的方法只是手动尝试不同的包排列。

当我直接调用它时,代码运行,只有在通过beam执行代码时才会失败。例如,当我通过beam的
DirectRunner
调用它时,它会导致代码失败。
/Users/chris/coref_entity_extraction/venv/lib/python2.7/site-packages/msgpack_numpy.py:183: DeprecationWarning: encoding is deprecated, Use raw=False instead.
  return _unpackb(packed, **kwargs)
/Users/chris/coref_entity_extraction/venv/lib/python2.7/site-packages/msgpack_numpy.py:132: DeprecationWarning: encoding is deprecated.
  use_bin_type=use_bin_type)
T4: <class 'entity.extract_entities.EntityExtraction'>
# T4
D2: <dict object at 0x1126c0398>
T4: <class 'spacy.lang.en.English'>
# T4
D2: <dict object at 0x1126b54b0>
D2: <dict object at 0x1126d1168>
F2: <function is_alpha at 0x11266d320>
# F2
F2: <function is_ascii at 0x112327c08>
# F2
F2: <function is_digit at 0x11266d398>
# F2
F2: <function is_lower at 0x11266d410>
# F2
F2: <function is_punct at 0x112327b90>
# F2
F2: <function is_space at 0x11266d488>
# F2
F2: <function is_title at 0x11266d500>
# F2
F2: <function is_upper at 0x11266d578>
# F2
F2: <function like_url at 0x11266d050>
# F2
F2: <function like_num at 0x110d55140>
# F2
F2: <function like_email at 0x112327f50>
# F2
Fu: <functools.partial object at 0x11266c628>
F2: <function _create_ftype at 0x1070af500>
# F2
T1: <type 'functools.partial'>
F2: <function _load_type at 0x1070af398>
# F2
# T1
F2: <function is_stop at 0x11266d5f0>
# F2
D2: <dict object at 0x1126b7168>
T4: <type 'set'>
# T4
# D2
# Fu
F2: <function is_oov at 0x11266d668>
# F2
F2: <function is_bracket at 0x112327cf8>
# F2
F2: <function is_quote at 0x112327d70>
# F2
F2: <function is_left_punct at 0x112327de8>
# F2
F2: <function is_right_punct at 0x112327e60>
# F2
F2: <function is_currency at 0x112327ed8>
# F2
Fu: <functools.partial object at 0x110d49ba8>
F2: <function _get_attr_unless_lookup at 0x1106e26e0>
# F2
F2: <function lower at 0x11266d140>
# F2
D2: <dict object at 0x112317c58>
# D2
D2: <dict object at 0x110e38168>
# D2
D2: <dict object at 0x112669c58>
# D2
# Fu
F2: <function word_shape at 0x11266d0c8>
# F2
F2: <function prefix at 0x11266d1b8>
# F2
F2: <function suffix at 0x11266d230>
# F2
F2: <function get_prob at 0x11266d6e0>
# F2
F2: <function cluster at 0x11266d2a8>
# F2
F2: <function _return_en at 0x11266f0c8>
# F2
# D2
B2: <built-in function unpickle_vocab>
# B2
T4: <type 'spacy.strings.StringStore'>
# T4
Traceback (most recent call last):
  File "/Users/chris/.pyenv/versions/2.7.14/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/Users/chris/.pyenv/versions/2.7.14/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/Users/chris/coref_entity_extraction/main.py", line 29, in <module>
    run()
  File "/Users/chris/coref_entity_extraction/main.py", line 24, in run
    entities = records | 'ExtractEntities' >> beam.ParDo(EntityExtraction())
  File "/Users/chris/coref_entity_extraction/venv/lib/python2.7/site-packages/apache_beam/transforms/core.py", line 784, in __init__
    super(ParDo, self).__init__(fn, *args, **kwargs)
  File "/Users/chris/coref_entity_extraction/venv/lib/python2.7/site-packages/apache_beam/transforms/ptransform.py", line 638, in __init__
    self.fn = pickler.loads(pickler.dumps(self.fn))
  File "/Users/chris/coref_entity_extraction/venv/lib/python2.7/site-packages/apache_beam/internal/pickler.py", line 204, in dumps
    s = dill.dumps(o)
  File "/Users/chris/coref_entity_extraction/venv/lib/python2.7/site-packages/dill/dill.py", line 259, in dumps
    dump(obj, file, protocol, byref, fmode, recurse)#, strictio)
  File "/Users/chris/coref_entity_extraction/venv/lib/python2.7/site-packages/dill/dill.py", line 252, in dump
    pik.dump(obj)
  File "/Users/chris/.pyenv/versions/2.7.14/lib/python2.7/pickle.py", line 224, in dump
    self.save(obj)
  File "/Users/chris/.pyenv/versions/2.7.14/lib/python2.7/pickle.py", line 331, in save
    self.save_reduce(obj=obj, *rv)
  File "/Users/chris/.pyenv/versions/2.7.14/lib/python2.7/pickle.py", line 425, in save_reduce
    save(state)
  File "/Users/chris/.pyenv/versions/2.7.14/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/Users/chris/coref_entity_extraction/venv/lib/python2.7/site-packages/apache_beam/internal/pickler.py", line 172, in new_save_module_dict
    return old_save_module_dict(pickler, obj)
  File "/Users/chris/coref_entity_extraction/venv/lib/python2.7/site-packages/dill/dill.py", line 841, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "/Users/chris/.pyenv/versions/2.7.14/lib/python2.7/pickle.py", line 655, in save_dict
    self._batch_setitems(obj.iteritems())
  File "/Users/chris/.pyenv/versions/2.7.14/lib/python2.7/pickle.py", line 692, in _batch_setitems
    save(v)
  File "/Users/chris/.pyenv/versions/2.7.14/lib/python2.7/pickle.py", line 331, in save
    self.save_reduce(obj=obj, *rv)
  File "/Users/chris/.pyenv/versions/2.7.14/lib/python2.7/pickle.py", line 425, in save_reduce
    save(state)
  File "/Users/chris/.pyenv/versions/2.7.14/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/Users/chris/coref_entity_extraction/venv/lib/python2.7/site-packages/apache_beam/internal/pickler.py", line 172, in new_save_module_dict
    return old_save_module_dict(pickler, obj)
  File "/Users/chris/coref_entity_extraction/venv/lib/python2.7/site-packages/dill/dill.py", line 841, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "/Users/chris/.pyenv/versions/2.7.14/lib/python2.7/pickle.py", line 655, in save_dict
    self._batch_setitems(obj.iteritems())
  File "/Users/chris/.pyenv/versions/2.7.14/lib/python2.7/pickle.py", line 687, in _batch_setitems
    save(v)
  File "/Users/chris/.pyenv/versions/2.7.14/lib/python2.7/pickle.py", line 331, in save
    self.save_reduce(obj=obj, *rv)
  File "/Users/chris/.pyenv/versions/2.7.14/lib/python2.7/pickle.py", line 401, in save_reduce
    save(args)
  File "/Users/chris/.pyenv/versions/2.7.14/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/Users/chris/.pyenv/versions/2.7.14/lib/python2.7/pickle.py", line 568, in save_tuple
    save(element)
  File "/Users/chris/.pyenv/versions/2.7.14/lib/python2.7/pickle.py", line 306, in save
    rv = reduce(self.proto)
  File "vectors.pyx", line 108, in spacy.vectors.Vectors.__reduce__
  File "vectors.pyx", line 409, in spacy.vectors.Vectors.to_bytes
  File "/Users/chris/coref_entity_extraction/venv/lib/python2.7/site-packages/spacy/util.py", line 485, in to_bytes
    serialized[key] = getter()
  File "vectors.pyx", line 404, in spacy.vectors.Vectors.to_bytes.serialize_weights
  File "/Users/chris/coref_entity_extraction/venv/lib/python2.7/site-packages/msgpack_numpy.py", line 165, in packb
    return Packer(**kwargs).pack(o)
  File "msgpack/_packer.pyx", line 282, in msgpack._cmsgpack.Packer.pack
  File "msgpack/_packer.pyx", line 288, in msgpack._cmsgpack.Packer.pack
  File "msgpack/_packer.pyx", line 285, in msgpack._cmsgpack.Packer.pack
  File "msgpack/_packer.pyx", line 232, in msgpack._cmsgpack.Packer._pack
  File "msgpack/_packer.pyx", line 279, in msgpack._cmsgpack.Packer._pack
TypeError: can not serialize 'buffer' object