Apache spark 将数据另存为从spark到hdfs的文本文件_Apache Spark_Pyspark_Apache Spark Sql_Apache Hive

Apache spark 将数据另存为从spark到hdfs的文本文件

apache-spark pyspark

Apache spark 将数据另存为从spark到hdfs的文本文件,apache-spark,pyspark,apache-spark-sql,apache-hive,Apache Spark,Pyspark,Apache Spark Sql,Apache Hive,我使用pySpark和sqlContext通过以下查询处理数据： (sqlContext.sql("select LastUpdate,Count(1) as Count" from temp_t) .rdd.coalesce(1).saveAsTextFile("/apps/hive/warehouse/Count")) 它以以下格式存储： Row(LastUpdate=u'2016-03-14 12:27:55.01', Count=1) Row(LastUpdate

我使用pySpark和sqlContext通过以下查询处理数据：

(sqlContext.sql("select LastUpdate,Count(1) as Count" from temp_t)
           .rdd.coalesce(1).saveAsTextFile("/apps/hive/warehouse/Count"))

它以以下格式存储：

Row(LastUpdate=u'2016-03-14 12:27:55.01', Count=1)
Row(LastUpdate=u'2016-02-18 11:56:54.613', Count=1)
Row(LastUpdate=u'2016-04-13 13:53:32.697', Count=1)
Row(LastUpdate=u'2016-02-22 17:43:37.257', Count=5)

但我想将数据存储在配置单元表中

LastUpdate                           Count

2016-03-14 12:27:55.01                   1
.                                        .
.                                        .

以下是如何在配置单元中创建表：

CREATE TABLE Data_Count(LastUpdate string, Count int )
ROW FORMAT DELIMITED fields terminated by '|';

我尝试了许多选择，但没有成功。请在这方面帮助我。

您创建了一个表，现在需要用生成的数据填充它

我相信这可以通过Spark HiveContext运行

将路径“/apps/hive/warehouse/Count”中的数据加载到表DATA\u Count中或者，您可能希望在数据上构建一个表

创建外部表（如果不存在）数据\u计数最近更新日期，计数整数行格式分隔以“|”结尾的字段存储为文本文件位置“/apps/hive/warehouse/Count”；

为什么不将数据加载到配置单元本身，而不经历保存文件然后将其加载到配置单元的过程呢

从日期时间导入日期时间、日期、时间、时间增量 hiveCtx=HiveContextsc 创建示例数据 currTime=datetime.now currRow=RowLastUpdate=currTime delta=timedeltadays=1 未来时间=当前时间+增量 futureRow=RowLastUpdate=futureTime lst=[currRow，currRow，futureRow，futureRow，futureRow] 并行化列表并转换为dataframe myRdd=sc.parallelizelst df=myRdd.toDF df.RegisterEmptableTemp\u t aggRDD=hiveCtx.sqlselect LastUpdate，Count1作为LastUpdate从临时组中的计数 aggRDD.saveAsTableData\u计数

我将结果存储到一个变量ex:result in spark中，在运行上述查询之后，当我执行result.show…时，数据显示在两列中，管道作为分隔符。是的，我确实将路径“/apps/hive/warehouse/Count”中的数据加载到表DATA\u Count中，但结果显示在一列LastUpdate和Count下，而另一列Count显示为NULL。当显示RDD时，它使用管道格式化。它不会将数据保存到带有管道的文本文件中。您可以对HDFS文件进行cat以检查实际的分隔符。您在第二列中得到null，因为所有内容都被推到第一列hello，checked..其分隔符为，当我相应地更改您的查询并执行时，现在我得到的两列都是null谢谢，使用相同的方法..但数据存储为2016-03-14 12:27:55.01 1 2016-02-18 11:56:54.613 1不是表格格式。。。对于列名，我不能像在表ex上那样进行查询：dl commandscan您是否可以发布一个在这里不起作用的dl命令示例？