Python 将pyspark rdd保存到hbase会引发属性错误
我正在尝试使用Pyspark将Spark RDD写入Hbase表。RDD看起来像 下面使用print rdd.take(rdd.count())命令 当我尝试使用函数saveRecord将rdd写入Hbase表时Python 将pyspark rdd保存到hbase会引发属性错误,python,apache-spark,hbase,pyspark,rdd,Python,Apache Spark,Hbase,Pyspark,Rdd,我正在尝试使用Pyspark将Spark RDD写入Hbase表。RDD看起来像 下面使用print rdd.take(rdd.count())命令 当我尝试使用函数saveRecord将rdd写入Hbase表时 def SaveRecord(tx_fee_rdd): host = 'localhost' #sys.argv[1] table = 'tx_fee_table' #needs to be created before hand in hbase shel
def SaveRecord(tx_fee_rdd):
host = 'localhost' #sys.argv[1]
table = 'tx_fee_table' #needs to be created before hand in hbase shell
conf = {"hbase.zookeeper.quorum": host,
"hbase.mapred.outputtable": table,
"mapreduce.outputformat.class": "org.apache.hadoop.hbase.mapreduce.TableOutputFormat",
"mapreduce.job.output.key.class": "org.apache.hadoop.hbase.io.ImmutableBytesWritable",
"mapreduce.job.output.value.class": "org.apache.hadoop.io.Writable"}
keyConv = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"
valueConv = "org.apache.spark.examples.pythonconverters.StringListToPutConverter"
#row key id,id, cfamily=tx_fee_col,column_name = tx_fee, column_value=x
datamap = tx_fee_rdd.map(lambda x: ("tx_fee_col","tx_fee",x ) )
datamap.saveAsNewAPIHadoopDataset(conf=conf,keyConverter=keyConv,valueConverter=valueConv)
tx_fee_rdd.foreach(SaveRecord)
我得到以下错误
AttributeError: 'Decimal' object has no attribute 'map'
Traceback (most recent call last):
File "/home/ubuntu/unix_practice/bcrpc/bitcoin-inspector-webserver/bitcoin/bctxfee_text3.py", line 66, in <module>
SaveRecord(tx_fee_rdd)
File "/home/ubuntu/unix_practice/bcrpc/bitcoin-inspector-webserver/bitcoin/bctxfee_text3.py", line 29, in SaveRecord
datamap.saveAsNewAPIHadoopDataset(conf=conf,keyConverter=keyConv,valueConverter=valueConv)
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1348, in saveAsNewAPIHadoopDataset
File "/usr/local/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
File "/usr/local/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.saveAsHadoopDataset.
: org.apache.spark.SparkException: RDD element of type [Ljava.lang.Object; cannot be used
at org.apache.spark.api.python.SerDeUtil$.pythonToPairRDD(SerDeUtil.scala:237)
at org.apache.spark.api.python.PythonRDD$.saveAsHadoopDataset(PythonRDD.scala:801)
at org.apache.spark.api.python.PythonRDD.saveAsHadoopDataset(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
如何应对
根据@zeros323的建议,我得到了以下错误
AttributeError: 'Decimal' object has no attribute 'map'
Traceback (most recent call last):
File "/home/ubuntu/unix_practice/bcrpc/bitcoin-inspector-webserver/bitcoin/bctxfee_text3.py", line 66, in <module>
SaveRecord(tx_fee_rdd)
File "/home/ubuntu/unix_practice/bcrpc/bitcoin-inspector-webserver/bitcoin/bctxfee_text3.py", line 29, in SaveRecord
datamap.saveAsNewAPIHadoopDataset(conf=conf,keyConverter=keyConv,valueConverter=valueConv)
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1348, in saveAsNewAPIHadoopDataset
File "/usr/local/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
File "/usr/local/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.saveAsHadoopDataset.
: org.apache.spark.SparkException: RDD element of type [Ljava.lang.Object; cannot be used
at org.apache.spark.api.python.SerDeUtil$.pythonToPairRDD(SerDeUtil.scala:237)
at org.apache.spark.api.python.PythonRDD$.saveAsHadoopDataset(PythonRDD.scala:801)
at org.apache.spark.api.python.PythonRDD.saveAsHadoopDataset(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
回溯(最近一次呼叫最后一次):
文件“/home/ubuntu/unix_practice/bcrpc/bitcoin inspector webserver/bitcoin/bctxfee_text3.py”,第66行,在
保存记录(发送费用)
文件“/home/ubuntu/unix_practice/bcrpc/bitcoin inspector webserver/bitcoin/bctxfee_text3.py”,第29行,保存记录
saveAsNewAPIHadoopDataset(conf=conf,keyConverter=keyConv,valueConverter=valueConv)
saveAsNewAPIHadoopDataset中的文件“/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py”,第1348行
文件“/usr/local/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py”,第538行,在调用中__
文件“/usr/local/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py”,第300行,在get_return_值中
py4j.protocol.Py4JJavaError:调用z:org.apache.spark.api.python.PythonRDD.saveAshadopDataSet时出错。
:org.apache.spark.SparkException:RDD元素的类型为[Ljava.lang.Object;无法使用
位于org.apache.spark.api.python.SerDeUtil$.pythonToPairRDD(SerDeUtil.scala:237)
位于org.apache.spark.api.PythonRDD$.saveAshadopDataSet(PythonRDD.scala:801)
位于org.apache.spark.api.python.PythonRDD.saveAsHadoopDataset(PythonRDD.scala)
在sun.reflect.NativeMethodAccessorImpl.invoke0(本机方法)处
位于sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
在sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)中
位于java.lang.reflect.Method.invoke(Method.java:498)
位于py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
位于py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
在py4j.Gateway.invoke处(Gateway.java:259)
位于py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
在py4j.commands.CallCommand.execute(CallCommand.java:79)
在py4j.GatewayConnection.run处(GatewayConnection.java:207)
运行(Thread.java:745)
foreach
对单个记录进行操作,因此接收的是Decimal
对象而不是RDD
s。您无法映射这些对象,更不用说使用saveAsNewAPIHadoopDataset
方法了
如果要使用saveAsNewAPIHadoopDataset
,则函数应直接在RDD
上运行:
SaveRecord(tx_fee_rdd)
另一个可能的问题是以下部分:
datamap = tx_fee_rdd.map(lambda x: ("tx_fee_col","tx_fee",x ) )
saveAsNewAPIHadoopDataset
expect pairs notRDD
s。它也可能不适用于Decimal
对象。有关详细信息,请参阅。foreach
对单个记录进行操作,因此接收的是Decimal
对象而不是RDD
s。您无法映射这些对象,更不用说使用saveAsNewAPIHadoopDataset方法
如果要使用saveAsNewAPIHadoopDataset
,则函数应直接在RDD
上运行:
SaveRecord(tx_fee_rdd)
另一个可能的问题是以下部分:
datamap = tx_fee_rdd.map(lambda x: ("tx_fee_col","tx_fee",x ) )
saveAsNewAPIHadoopDataset
expect pairs not triplets。它也可能不适用于Decimal
对象。有关详细信息,请参阅。@zeros,您关于键、值的语句是正确的,但关于foreach
的语句不正确。您可以看到这一点
下面是一个写入HBase的工作代码
from pyspark import SparkContext
from jsonrpc.authproxy import AuthServiceProxy
from pyspark.streaming import StreamingContext
import json
# Create a local StreamingContext with * working thread and batch interval of 1 second
sc = SparkContext("local[*]", "txcount")
ssc = StreamingContext(sc, 0.5) #0.001 did 9710 blocks in 12 minutes
#function SaveRecord: saves tx_fee for a block to hbase database
def SaveRecord(tx_fee_rdd):
host = 'localhost' #sys.argv[1]
table = 'transaction_fee_table' #needs to be created before hand in hbase shell
conf = {"hbase.zookeeper.quorum": host,
"hbase.mapred.outputtable": table,
"mapreduce.outputformat.class": "org.apache.hadoop.hbase.mapreduce.TableOutputFormat",
"mapreduce.job.output.key.class": "org.apache.hadoop.hbase.io.ImmutableBytesWritable",
"mapreduce.job.output.value.class": "org.apache.hadoop.io.Writable"}
keyConv = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"
valueConv = "org.apache.spark.examples.pythonconverters.StringListToPutConverter"
#row key id,id, cfamily=tx_fee_col,column_name = tx_fee, column_value=x
#datamap = tx_fee_rdd.map(lambda x: ("tx_fee",x) )
#( rowkey , [ row key , column family , column name , value ] )
datamap = tx_fee_rdd.map(lambda x: (str(x[0]),
[str(x[0]),"tx_fee_col","tx_fee",str(x[1])])
)
datamap.saveAsNewAPIHadoopDataset(conf=conf,
keyConverter=keyConv,
valueConverter=valueConv)
lines = ssc.socketTextStream("localhost", 8888)
dump_rdd = lines.map(lambda x: json.dumps(x))
#print dump_rdd.take(2)
load_rdd = dump_rdd.map(lambda x: json.loads(x)).map(lambda x : x.decode('unicode_escape').encode('ascii','ignore'))
#load_rdd.pprint(2)
#load_rdd.pprint(100)
#tx = load_rdd.flatMap(lambda x: x.split(":")) #this works
split_blk_rdd = load_rdd.map(lambda x: x.split(":"))
#split_blk_rdd.pprint()
tx_fee_rdd = split_blk_rdd.map(lambda x : (x[14][1:7],x[15][1:-15])) #this gets transaction fee
#tx_fee_rdd.pprint(200) #works
tx_fee_rdd.foreachRDD(SaveRecord) #function call
#tx_fee_rdd.saveAsTextFiles("hdfs://ec2-52-21-47-235.compute-1.amazonaws.com:9000/bitcoin/","txt")
######tx_fee_rdd.pprint(1000)
#gen_tx = tx.map(lambda x: x[8])
#gen_tx.pprint(100)
#gen_tx = tx.map(parse_gen_tx)
#print type(tx)
#lst_tx = pprint.pprint(tx)
#print lst_tx
#print lst_tx[8][0]
#print type(lst_tx[8])
#print "here"
#print str(tx[8])[4:-4] #gives tx_id without the enclosing quotes
#print type(lst_tx)
ssc.start() # Start the computation
#ssc.awaitTermination() # Wait for the computation to terminate
ssc.awaitTerminationOrTimeout(15000) #13000#time out in 3 hours
#ssc.stop() # Wait for the computation to terminate
@零,您关于键、值的陈述是正确的,但关于foreach
。您可以看到这一点
下面是一个写入HBase的工作代码
from pyspark import SparkContext
from jsonrpc.authproxy import AuthServiceProxy
from pyspark.streaming import StreamingContext
import json
# Create a local StreamingContext with * working thread and batch interval of 1 second
sc = SparkContext("local[*]", "txcount")
ssc = StreamingContext(sc, 0.5) #0.001 did 9710 blocks in 12 minutes
#function SaveRecord: saves tx_fee for a block to hbase database
def SaveRecord(tx_fee_rdd):
host = 'localhost' #sys.argv[1]
table = 'transaction_fee_table' #needs to be created before hand in hbase shell
conf = {"hbase.zookeeper.quorum": host,
"hbase.mapred.outputtable": table,
"mapreduce.outputformat.class": "org.apache.hadoop.hbase.mapreduce.TableOutputFormat",
"mapreduce.job.output.key.class": "org.apache.hadoop.hbase.io.ImmutableBytesWritable",
"mapreduce.job.output.value.class": "org.apache.hadoop.io.Writable"}
keyConv = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"
valueConv = "org.apache.spark.examples.pythonconverters.StringListToPutConverter"
#row key id,id, cfamily=tx_fee_col,column_name = tx_fee, column_value=x
#datamap = tx_fee_rdd.map(lambda x: ("tx_fee",x) )
#( rowkey , [ row key , column family , column name , value ] )
datamap = tx_fee_rdd.map(lambda x: (str(x[0]),
[str(x[0]),"tx_fee_col","tx_fee",str(x[1])])
)
datamap.saveAsNewAPIHadoopDataset(conf=conf,
keyConverter=keyConv,
valueConverter=valueConv)
lines = ssc.socketTextStream("localhost", 8888)
dump_rdd = lines.map(lambda x: json.dumps(x))
#print dump_rdd.take(2)
load_rdd = dump_rdd.map(lambda x: json.loads(x)).map(lambda x : x.decode('unicode_escape').encode('ascii','ignore'))
#load_rdd.pprint(2)
#load_rdd.pprint(100)
#tx = load_rdd.flatMap(lambda x: x.split(":")) #this works
split_blk_rdd = load_rdd.map(lambda x: x.split(":"))
#split_blk_rdd.pprint()
tx_fee_rdd = split_blk_rdd.map(lambda x : (x[14][1:7],x[15][1:-15])) #this gets transaction fee
#tx_fee_rdd.pprint(200) #works
tx_fee_rdd.foreachRDD(SaveRecord) #function call
#tx_fee_rdd.saveAsTextFiles("hdfs://ec2-52-21-47-235.compute-1.amazonaws.com:9000/bitcoin/","txt")
######tx_fee_rdd.pprint(1000)
#gen_tx = tx.map(lambda x: x[8])
#gen_tx.pprint(100)
#gen_tx = tx.map(parse_gen_tx)
#print type(tx)
#lst_tx = pprint.pprint(tx)
#print lst_tx
#print lst_tx[8][0]
#print type(lst_tx[8])
#print "here"
#print str(tx[8])[4:-4] #gives tx_id without the enclosing quotes
#print type(lst_tx)
ssc.start() # Start the computation
#ssc.awaitTermination() # Wait for the computation to terminate
ssc.awaitTerminationOrTimeout(15000) #13000#time out in 3 hours
#ssc.stop() # Wait for the computation to terminate
一个人不应该在没有正当理由的情况下发表演讲。不是每个人都能完美地说英语,也不是每个人都能很好地表达自己的问题。让我们成为一个欢迎的社区。如果我可以问,你如何提交这个脚本?你需要下载一个特定的jar吗?一个人不应该在没有正当理由的情况下发表演讲。不是每个人都能完美地说英语,也不是每个人都能表达自己的想法r好的问题。让我们成为一个欢迎的社区。如果我可以问,你如何提交这个脚本?你需要下载一个特定的jar吗?foreachRDD
与foreach
不同,DStream
不是RDD
并且你粘贴的流式代码与你在问题中显示的代码和错误都不匹配或消息。@zero323,谢谢你的评论。你能评论一下foreachRDD和foreach之间的区别吗?还有DStream和RDD之间的区别吗?再次感谢。流是Spark流中使用的RDD的连续序列。foreachRDD
对流中的每个RDD执行(惊奇;)操作,类型为RDD[T]=>Unit
。RDD(弹性分布式数据集)是Spark中的一个并行单元(不仅仅是流)。foreach
是RDD上的一个操作,具有类型T=>Unit
并执行一些操作(副作用)在RDD的每个元素上。foreachRDD
与foreach
不同,DStream
不是RDD
,您粘贴的流式代码与问题中显示的代码和错误消息都不匹配。@zero323,谢谢您的评论。您能对foreachRD之间的差异发表评论吗D和foreach。也是在DStream和RDD之间?再次感谢。流是Spark流中使用的RDD的连续序列。foreachRDD
对流中的每个RDD执行操作(惊奇;),类型为RDD[T]=>Unit
。RDD(弹性分布式数据集)是Spark中的并行单元(不仅仅是流).foreach
是RDD上的一个操作,具有类型T=>Unit
并对RDD的每个元素执行一些操作(副作用)。链接已断开,请检查链接已断开,请检查