Apache spark 读取gz.parquet文件

Apache spark 读取gz.parquet文件,apache-spark,hive,apache-kafka,parquet,flume-twitter,Apache Spark,Hive,Apache Kafka,Parquet,Flume Twitter,你好,我需要从gz.parquet文件中读取数据,但不知道如何读取??尝试使用黑斑羚,但我得到的结果与没有桌子结构的拼花工具cat相同 p.S:欢迎提出任何改进spark代码的建议。 由于twitter=>flume=>kafka=>spark streaming=>hive/gz.parquet文件创建了一条数据管道,我有以下拼花文件gz.parquet。对于flume代理,我使用的是agent1.sources.twitter-data.type=org.apache.flume.sourc

你好,我需要从gz.parquet文件中读取数据,但不知道如何读取??尝试使用黑斑羚,但我得到的结果与没有桌子结构的
拼花工具cat
相同

p.S:欢迎提出任何改进spark代码的建议。

由于twitter=>flume=>kafka=>spark streaming=>hive/gz.parquet文件创建了一条数据管道,我有以下拼花文件
gz.parquet
。对于flume代理,我使用的是
agent1.sources.twitter-data.type=org.apache.flume.source.twitter.TwitterSource

Spark code将来自kafka的数据排成队列并存储在hive中,如下所示:

    [root@quickstart /]# parquet-tools meta hdfs://quickstart.cloudera:8020/user/hive/warehouse/tweets/_metadata
    creator:       parquet-mr version 1.5.0-cdh5.5.0 (build ${buildNumber}) 
    extra:         org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"tweet","type":"string","nullable":true,"metadata":{}}]} 
    
    file schema:   root 
    ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    tweet:         OPTIONAL BINARY O:UTF8 R:0 D:1

Furthermore, if i load the data into a dataframe in spark i get the output of `df.show´ as follows:

    +--------------------+
    |               tweet|
    +--------------------+
    |��Objavro.sc...|
    |��Objavro.sc...|
    |��Objavro.sc...|
    |ڕObjavro.sch...|
    |��Objavro.sc...|
    |ֲObjavro.sch...|
    |��Objavro.sc...|
    |��Objavro.sc...|
    |֕Objavro.sch...|
    |��Objavro.sc...|
    |��Objavro.sc...|
    |��Objavro.sc...|
    |��Objavro.sc...|
    |��Objavro.sc...|
    |��Objavro.sc...|
    |��Objavro.sc...|
    |��Objavro.sc...|
    |��Objavro.sc...|
    |��Objavro.sc...|
    |��Objavro.sc...|
    +--------------------+
    only showing top 20 rows

How ever i would like to see the tweets as plain text?
val sparkConf=new sparkConf().setAppName(“KafkaTweet2Hive”)
val sc=新的SparkContext(sparkConf)
val ssc=新的StreamingContext(sc,秒(2))
val sqlContext=new org.apache.spark.sql.hive.HiveContext(sc)//new org.apache.spark.sql.sqlContext(sc)
//创建带有代理和主题的直接卡夫卡流
val topicsSet=topics.split(“,”).toSet
val kafkaParams=Map[String,String](“metadata.broker.list”->brokers)
val messages=KafkaUtils.createDirectStream[String,String,StringDecoder,StringDecoder](ssc,kafkaParams,TopicSet)
//从卡夫卡那里获取数据(推特)
val tweets=messages.map(u._2)
//将推文添加到蜂巢
tweets.foreachRDD{rdd=>
val hiveContext=SQLContext.getOrCreate(rdd.sparkContext)
导入sqlContext.implicits_
val tweetsDF=rdd.toDF()
tweetsDF.write.mode(“append”).saveAsTable(“tweet”)
}
当我运行
spark streaming
应用程序时,它将数据存储为hdfs:
/user/hive/warehouse
目录中的
gz.parquet
文件,如下所示:

    [root@quickstart /]# parquet-tools meta hdfs://quickstart.cloudera:8020/user/hive/warehouse/tweets/_metadata
    creator:       parquet-mr version 1.5.0-cdh5.5.0 (build ${buildNumber}) 
    extra:         org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"tweet","type":"string","nullable":true,"metadata":{}}]} 
    
    file schema:   root 
    ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    tweet:         OPTIONAL BINARY O:UTF8 R:0 D:1

Furthermore, if i load the data into a dataframe in spark i get the output of `df.show´ as follows:

    +--------------------+
    |               tweet|
    +--------------------+
    |��Objavro.sc...|
    |��Objavro.sc...|
    |��Objavro.sc...|
    |ڕObjavro.sch...|
    |��Objavro.sc...|
    |ֲObjavro.sch...|
    |��Objavro.sc...|
    |��Objavro.sc...|
    |֕Objavro.sch...|
    |��Objavro.sc...|
    |��Objavro.sc...|
    |��Objavro.sc...|
    |��Objavro.sc...|
    |��Objavro.sc...|
    |��Objavro.sc...|
    |��Objavro.sc...|
    |��Objavro.sc...|
    |��Objavro.sc...|
    |��Objavro.sc...|
    |��Objavro.sc...|
    +--------------------+
    only showing top 20 rows

How ever i would like to see the tweets as plain text?
[root@quickstart/]#hdfs dfs-ls/user/hive/warehouse/tweets
找到469件物品
-rw-r--r--1根超级组0 2016-03-30 08:36/user/hive/warehouse/tweets/\u成功
-rw-r--r--1根超级组241 2016-03-30 08:36/user/hive/warehouse/tweets/_common\u元数据
-rw-r--r--1根超级组35750 2016-03-30 08:36/user/hive/warehouse/tweets/\u元数据
-rw-r--r--1根超级组23518 2016-03-30 08:33/user/hive/warehouse/tweets/part-r-00000-0133fcd1-f529-4dd1-9371-36bf5c3e5df3.gz.拼花地板
-rw-r--r--1根超级组9552 2016-03-30 08:33/user/hive/warehouse/tweets/part-r-00000-02c44f98-bfc3-47e3-a8e7-62486a1a45e7.gz.拼花地板
-rw-r--r--1根超级组19228 2016-03-30 08:25/user/hive/warehouse/tweets/part-r-00000-0321ce99-9d2b-4c52-82ab-a9ed5f7d5036.gz.拼花地板
-rw-r--r--1根超级组241 2016-03-30 08:25/user/hive/warehouse/tweets/part-r-00000-03415df3-c719-4a3a-90c6 462c43cfef54.gz.拼花地板
\u元数据
文件中的架构如下所示:

    [root@quickstart /]# parquet-tools meta hdfs://quickstart.cloudera:8020/user/hive/warehouse/tweets/_metadata
    creator:       parquet-mr version 1.5.0-cdh5.5.0 (build ${buildNumber}) 
    extra:         org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"tweet","type":"string","nullable":true,"metadata":{}}]} 
    
    file schema:   root 
    ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    tweet:         OPTIONAL BINARY O:UTF8 R:0 D:1

Furthermore, if i load the data into a dataframe in spark i get the output of `df.show´ as follows:

    +--------------------+
    |               tweet|
    +--------------------+
    |��Objavro.sc...|
    |��Objavro.sc...|
    |��Objavro.sc...|
    |ڕObjavro.sch...|
    |��Objavro.sc...|
    |ֲObjavro.sch...|
    |��Objavro.sc...|
    |��Objavro.sc...|
    |֕Objavro.sch...|
    |��Objavro.sc...|
    |��Objavro.sc...|
    |��Objavro.sc...|
    |��Objavro.sc...|
    |��Objavro.sc...|
    |��Objavro.sc...|
    |��Objavro.sc...|
    |��Objavro.sc...|
    |��Objavro.sc...|
    |��Objavro.sc...|
    |��Objavro.sc...|
    +--------------------+
    only showing top 20 rows

How ever i would like to see the tweets as plain text?
sqlContext.read.parquet(“/user/hive/warehouse/tweets”).show