Warning: file_get_contents(/data/phpspider/zhask/data//catemap/1/cassandra/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Pig进入Cassandra-使用PythonUDF和CqlStorage传递列表对象_Cassandra_User Defined Functions_Etl_Apache Pig - Fatal编程技术网

Pig进入Cassandra-使用PythonUDF和CqlStorage传递列表对象

Pig进入Cassandra-使用PythonUDF和CqlStorage传递列表对象,cassandra,user-defined-functions,etl,apache-pig,Cassandra,User Defined Functions,Etl,Apache Pig,我正在研究一个数据流,包括Pig中的一些聚合步骤和将步骤存储到Cassandra中。我已经能够传递相对简单的数据类型,比如整数、long或dates,但找不到如何使用CqlStorage将某种列表、集合或元组从Pig传递到Cassandra 我使用Pig 0.9.2,因此无法使用展平方法 问题: 如何填写包含复杂数据类型(如Pig 0.9.2中的集合或列表)的Cassandra表? 我的具体应用概述: 我根据描述创建了相应的Cassandra表: CREATE TABLE mycassandr

我正在研究一个数据流,包括Pig中的一些聚合步骤和将步骤存储到Cassandra中。我已经能够传递相对简单的数据类型,比如整数、long或dates,但找不到如何使用CqlStorage将某种列表、集合或元组从Pig传递到Cassandra

我使用Pig 0.9.2,因此无法使用展平方法

问题: 如何填写包含复杂数据类型(如Pig 0.9.2中的集合或列表)的Cassandra表?

我的具体应用概述:
  • 我根据描述创建了相应的Cassandra表:

    CREATE TABLE mycassandracf (
    my_id int,
    date timestamp,
    my_count bigint,
    grouped_ids list<bigint>,
    PRIMARY KEY (my_id, date)); 
    
  • 从“分组依据”关系中,我“生成”了一个cql友好格式的关系(例如元组),我希望将其存储到Cassandra中

    CassandraAggregate = FOREACH GroupedRelation
        GENERATE TOTUPLE(TOTUPLE('my_id', $0.my_id),
        TOTUPLE('date', ISOToUnix($0.createdAt))),
        TOTUPLE(COUNT($1), $1.grouped_id);
    
    DUMP CassandraAggregate;
    
    (((my_id,30021),(date,1357084800000)),(2,{(60128490006325819),(62726281032786005)}))
    (((my_id,30165),(date,1357084800000)),(1,{(60128411174143024)}))
    (((my_id,30376),(date,1357084800000)),(4,{(60128411146211875),(63645100121476995),(60128411146211875),(63645100121476995)}))
    
毫不奇怪,在此关系上使用STORE指令会引发异常:

java.lang.ClassCastException:org.apache.pig.data.DefaultDataBag不能强制转换为org.apache.pig.data.DataByteArray

因此,我添加了一个用python编写的UDF,以便在分组的id包上应用一些扁平化:

@outputSchema("flat_bag:bag{}")
def flattenBag(bag):
    return tuple([long(item) for tup in bag for item in tup])
我使用tuple是因为使用python集和python列表最终会导致转换错误

将其添加到我的管道中,我有:

CassandraAggregate = FOREACH GroupedRelation
    GENERATE TOTUPLE(TOTUPLE('my_id', $0.my_id),
    TOTUPLE('date', ISOToUnix($0.createdAt))),
    TOTUPLE(COUNT($1), py_f.flattenBag($1.grouped_id));

DUMP CassandraAggregate;

(((my_id,30021),(date,1357084800000)),(2,(60128490006325819,62726281032786005)))
(((my_id,31120),(date,1357084800000)),(1,(60128411174143024)))
(((my_id,31120),(date,1357084800000)),(1,(60128411146211875,63645100121476995,6012841114621187563645100121476995)))
在最后一个关系上使用STORE指令会引发异常并导致错误堆栈:

java.io.IOException: java.io.IOException: org.apache.thrift.transport.TTransportException
at     org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:465)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:428)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:408)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:262)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:652)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:260)
Caused by: java.io.IOException: org.apache.thrift.transport.TTransportException
at org.apache.cassandra.hadoop.cql3.CqlRecordWriter$RangeClient.run(CqlRecordWriter.java:248)
Caused by: org.apache.thrift.transport.TTransportException
at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129)
at org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378)
at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297)
at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
at org.apache.cassandra.thrift.Cassandra$Client.recv_execute_prepared_cql3_query(Cassandra.java:1724)
at org.apache.cassandra.thrift.Cassandra$Client.execute_prepared_cql3_query(Cassandra.java:1709)
at org.apache.cassandra.hadoop.cql3.CqlRecordWriter$RangeClient.run(CqlRecordWriter.java:232)
我用简单的数据类型测试了完全相同的工作流,并且工作得非常好。我真正想要的是用复杂类型(如Pig中的集合或列表)填充cassandra表的方法


非常感谢

经过进一步调查,我在这里找到了解决方案:

基本上,CqlStorage支持复杂类型。为此,类型应该由元组中的元组表示,将数据类型作为字符串作为第一个元素。对于列表,这是如何做到的:

# python
@outputSchema("flat_bag:bag{}")
def flattenBag(bag):
    return ('list',) + tuple([long(item) for tup in bag for item in tup])
因此,在咕哝中:

# pig
CassandraAggregate = FOREACH GroupedRelation
    GENERATE TOTUPLE(TOTUPLE('my_id', $0.my_id),
    TOTUPLE('date', ISOToUnix($0.createdAt))),
    TOTUPLE(COUNT($1), py_f.flattenBag($1.grouped_id));

DUMP CassandraAggregate;

(((my_id,30021),(date,1357084800000)),(2,(list, 60128490006325819,62726281032786005)))
(((my_id,31120),(date,1357084800000)),(1,(list, 60128411174143024)))
(((my_id,31120),(date,1357084800000)),(1,(list, 60128411146211875,63645100121476995,6012841114621187563645100121476995)))
然后使用经典编码的prepared语句将其存储到cassandra中


希望这会有所帮助。

我使用的是DSE3.2.4,其中包含Pig0.9.2,我在pig中没有ISOTONIX()。那么,将日期加载到cassandra中的最佳方式是什么呢?我的日期格式是'yyyy/MM/dd'@sudheer ISOToUnix实际上是piggybank的自定义项,而不是内置方法。下面是访问该方法需要添加的内容:
REGISTER/path to your jar/piggybank.jar;定义ISOToUnix org.apache.pig.piggybank.evaluation.datetime.convert.ISOToUnix()很抱歉不够清晰。很抱歉我找不到Piggybank.jar,这就是我在上一条消息中试图传达的内容。我使用的是DSE3.2.4,通常我会在Cloudera的contrib文件夹中找到它,是因为我在安装中遗漏了什么还是Datastax没有提供piggybank?我不知道DSE将其pig lib文件夹放在哪里,但如果你能找到它,在dse-pig-libs/contrib/piggybank/java/piggybank.jar的:/path中查找jar我使用手动安装的hadoop/pig,可以在:/usr/local/lib/pig中找到我的pig-libs-*
# pig
CassandraAggregate = FOREACH GroupedRelation
    GENERATE TOTUPLE(TOTUPLE('my_id', $0.my_id),
    TOTUPLE('date', ISOToUnix($0.createdAt))),
    TOTUPLE(COUNT($1), py_f.flattenBag($1.grouped_id));

DUMP CassandraAggregate;

(((my_id,30021),(date,1357084800000)),(2,(list, 60128490006325819,62726281032786005)))
(((my_id,31120),(date,1357084800000)),(1,(list, 60128411174143024)))
(((my_id,31120),(date,1357084800000)),(1,(list, 60128411146211875,63645100121476995,6012841114621187563645100121476995)))