Apache spark 如何在spark sql concat中包含双引号?
我试图用双引号连接两列,在这两列上都有前缀和后缀。代码是有效的,但它给了我额外的双引号 输入:Apache spark 如何在spark sql concat中包含双引号?,apache-spark,Apache Spark,我试图用双引号连接两列,在这两列上都有前缀和后缀。代码是有效的,但它给了我额外的双引号 输入: campaign_file_name_1, campaign_name_1, shagdhsjagdhjsagdhrSqpaKa5saoaus89, 1 campaign_file_name_1, campaign_name_1, sagdhsagdhasjkjkasihdklas872hjsdjk, 2 campaign_file_name_1, shagdhsjagdhjsagdh
campaign_file_name_1, campaign_name_1, shagdhsjagdhjsagdhrSqpaKa5saoaus89, 1
campaign_file_name_1, campaign_name_1, sagdhsagdhasjkjkasihdklas872hjsdjk, 2
campaign_file_name_1, shagdhsjagdhjsagdhrSqpaKa5saoaus89, "campaign_name_1"="1", 2017-06-06 17:09:31
campaign_file_name_1, sagdhsagdhasjkjkasihdklas872hjsdjk, "campaign_name_1"="2", 2017-06-06 17:09:31
campaign_file_name_1, shagdhsjagdhjsagdhrSqpaKa5saoaus89, """campaign_name_1""=""1""", 2017-06-06 17:09:31
campaign_file_name_1, sagdhsagdhasjkjkasihdklas872hjsdjk, """campaign_name_1""=""2""", 2017-06-06 17:09:31
预期输出:
campaign_file_name_1, campaign_name_1, shagdhsjagdhjsagdhrSqpaKa5saoaus89, 1
campaign_file_name_1, campaign_name_1, sagdhsagdhasjkjkasihdklas872hjsdjk, 2
campaign_file_name_1, shagdhsjagdhjsagdhrSqpaKa5saoaus89, "campaign_name_1"="1", 2017-06-06 17:09:31
campaign_file_name_1, sagdhsagdhasjkjkasihdklas872hjsdjk, "campaign_name_1"="2", 2017-06-06 17:09:31
campaign_file_name_1, shagdhsjagdhjsagdhrSqpaKa5saoaus89, """campaign_name_1""=""1""", 2017-06-06 17:09:31
campaign_file_name_1, sagdhsagdhasjkjkasihdklas872hjsdjk, """campaign_name_1""=""2""", 2017-06-06 17:09:31
根据代码的实际输出:
campaign_file_name_1, campaign_name_1, shagdhsjagdhjsagdhrSqpaKa5saoaus89, 1
campaign_file_name_1, campaign_name_1, sagdhsagdhasjkjkasihdklas872hjsdjk, 2
campaign_file_name_1, shagdhsjagdhjsagdhrSqpaKa5saoaus89, "campaign_name_1"="1", 2017-06-06 17:09:31
campaign_file_name_1, sagdhsagdhasjkjkasihdklas872hjsdjk, "campaign_name_1"="2", 2017-06-06 17:09:31
campaign_file_name_1, shagdhsjagdhjsagdhrSqpaKa5saoaus89, """campaign_name_1""=""1""", 2017-06-06 17:09:31
campaign_file_name_1, sagdhsagdhasjkjkasihdklas872hjsdjk, """campaign_name_1""=""2""", 2017-06-06 17:09:31
火花代码:
object campaignResultsMergerETL extends BaseETL {
val now = ApplicationUtil.getCurrentTimeStamp()
val conf = new Configuration()
val fs = FileSystem.get(conf)
val log = LoggerFactory.getLogger(this.getClass.getName)
def main(args: Array[String]): Unit = {
//---------------------
code for sqlContext Initialization
//---------------------
val campaignResultsDF = sqlContext.read.format("com.databricks.spark.avro").load(campaignResultsLoc)
campaignResultsDF.registerTempTable("campaign_results")
val campaignGroupedDF = sqlContext.sql(
"""
|SELECT campaign_file_name,
|campaign_name,
|tracker_id,
|SUM(campaign_measure) AS campaign_measure
|FROM campaign_results
|GROUP BY campaign_file_name,campaign_name,tracker_id
""".stripMargin)
campaignGroupedDF.registerTempTable("campaign_results_full")
val campaignMergedDF = sqlContext.sql(
s"""
|SELECT campaign_file_name,
|tracker_id,
|CONCAT('\"',campaign_name, '\"','=','\"',campaign_measure,'\"'),
|"$now" AS audit_timestamp
|FROM campaign_results_full
""".stripMargin)
saveAsCSVFiles(campaignMergedDF,campaignResultsExportLoc,numPartitions)
}
def saveAsCSVFiles(campaignMeasureDF:DataFrame,hdfs_output_loc:String,numPartitions:Int): Unit =
{
log.info("saveAsCSVFile method started")
if (fs.exists(new Path(hdfs_output_loc))){
fs.delete(new Path(hdfs_output_loc), true)
}
campaignMeasureDF.repartition(numPartitions).write.format("com.databricks.spark.csv").save(hdfs_output_loc)
log.info("saveAsCSVFile method ended")
}
}
有人能帮我解决这个问题吗?看起来您在
CONCAT
参数中错误地包含了=
。尝试:
|CONCAT('\"',campaign_name, '\"','=','\"',campaign_measure,'\"'),
[更新]
也许你的Spark版本与我的不同,它似乎对我来说很有效:
val df = Seq(("x", "y")).toDF("a", "b")
df.createOrReplaceTempView("df")
val df2 = spark.sqlContext.sql("""SELECT a, b, CONCAT('"', a, '"="', b, '"') as a_eq_b FROM df""")
df2.show
+---+---+-------+
| a| b| a_eq_b|
+---+---+-------+
| x| y|"x"="y"|
+---+---+-------+
df2.coalesce(1).write.option("header", "true").csv("/path/to/df2.csv")
/path/to/df2.csv content:
a,b,a_eq_b
x,y,"\"x\"=\"y\""
现在,您可以选择将quote设置为null,如下所示:
df2.coalesce(1).write.option("header", "true").option("quote", "\u0000").csv("/path/to/df2null.csv")
/path/to/df2null.csv content:
a,b,a_eq_b
x,y,"x"="y"
但是请注意,如果您需要在Spark上重新读取CSV,请确保包含相同的
报价
选项。@Leo:我尝试了相同的方法,但仍然不正确output@Surender拉贾,请看我的扩展答案。这是一个完美的答案。多谢