在python中打开apache thrift二进制文件_Python_Machine Learning_Thrift

在python中打开apache thrift二进制文件

python machine-learning

在python中打开apache thrift二进制文件,python,machine-learning,thrift,Python,Machine Learning,Thrift,我用ApacheThrift序列化了5gb的数据，还有一个带有数据格式的.thrift文件。我尝试过使用thriftpy和官方的thrift软件包，但我不知道如何打开这些文件数据是来自的扩展数据集数据格式的说明可在此处找到Scala设置可在文件中找到。由于大多数东西的命名在整个Thrift库中都是一致的，因此您或多或少要按照Scala示例对python代码进行建模： package edu.umass.cs.iesl.wikilink.expanded.process import org

我用ApacheThrift序列化了5gb的数据，还有一个带有数据格式的.thrift文件。我尝试过使用thriftpy和官方的thrift软件包，但我不知道如何打开这些文件

数据是来自的扩展数据集

数据格式的说明可在此处找到

Scala设置可在文件中找到。由于大多数东西的命名在整个Thrift库中都是一致的，因此您或多或少要按照Scala示例对python代码进行建模：

package edu.umass.cs.iesl.wikilink.expanded.process

import org.apache.thrift.protocol.TBinaryProtocol
import org.apache.thrift.transport.TIOStreamTransport
import java.io.File
import java.io.BufferedOutputStream
import java.io.FileOutputStream
import java.io.BufferedInputStream
import java.io.FileInputStream
import java.util.zip.{GZIPOutputStream, GZIPInputStream}

 object ThriftSerializerFactory {

   def getWriter(f: File) = {
      val stream = new BufferedOutputStream(new GZIPOutputStream(new FileOutputStream(f)), 2048)
      val protocol= new TBinaryProtocol(new TIOStreamTransport(stream))
      (stream, protocol)
   }

   def getReader(f: File) = {
      val stream = new BufferedInputStream(new GZIPInputStream(new FileInputStream(f)), 2048)
      val protocol = new TBinaryProtocol(new TIOStreamTransport(stream))
      (stream, protocol)
   }
 }

您基本上设置了流传输和二进制协议。如果您对数据进行压缩，则必须将gzip片段添加到拼图中，但一旦数据被解压缩，就不再需要它了

中的代码显示了如何使用上面的factory类读取数据文件

class PerFileWebpageIterator(f: File) extends Iterator[WikiLinkItem] {
    var done = false
    val (stream, proto) = ThriftSerializerFactory.getReader(f)
    private var _next: Option[WikiLinkItem] = getNext()

    private def getNext(): Option[WikiLinkItem] = try {
        Some(WikiLinkItem.decode(proto))
    } catch {case _: TTransportException => {done = true; stream.close(); None}}

    def hasNext(): Boolean = !done && (_next != None || {_next = getNext(); _next != None})

    def next(): WikiLinkItem = if (hasNext()) _next match {
        case Some(wli) => {_next = None; wli}
        case None => {throw new Exception("Next on empty iterator.")}
    } else throw new Exception("Next on empty iterator.")
}

实施步骤：

实现上述节约协议栈工厂（推荐模式，顺便说一句）

实例化每个记录的根元素，在我们的例子中是a

WikiLinkItem

调用

instance.read（proto）

读取一条数据记录