Apache spark pyspark:使用别名选择列

Apache spark pyspark:使用别名选择列,apache-spark,pyspark,Apache Spark,Pyspark,我正在尝试使用spark 1.6中的SQLContext.sql从别名中进行简单的选择 sqlCtx = SQLContext(sc) ## Import CSV File header = (sc.textFile("data.csv") .map(lambda line: [x for x in line.split(",")])) ## Convert RDD to DF, specify column names headerDF = header.toDF(['h

我正在尝试使用spark 1.6中的SQLContext.sql从别名中进行简单的选择

sqlCtx = SQLContext(sc)
## Import CSV File
header = (sc.textFile("data.csv")
          .map(lambda line: [x for x in line.split(",")]))

## Convert RDD to DF, specify column names
headerDF = header.toDF(['header', 'adj', 'desc'])

## Convert Adj Column to numeric
headerDF = headerDF.withColumn("adj", headerDF['adj'].cast(DoubleType()))

headerDF.registerTempTable("headerTab")
head = sqlCtx.sql("select d.desc from headerTab as d").show()
我注意到这似乎在Spark 2.0中起作用,但目前我仅限于1.6

这是我看到的错误消息。对于一个简单的选择,我可以删除别名,但最终我会尝试使用具有相同列名的多个表进行连接

Spark 1.6错误

Traceback (most recent call last):
  File "/home/temp/text_import.py", line 49, in <module>
    head = sqlCtx.sql("select d.desc from headerTab as d").show()
  File "/home/pricing/spark-1.6.1/python/lib/pyspark.zip/pyspark/sql/context.py", line 580, in sql
  File "/home/pricing/spark-1.6.1/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
  File "/home/pricing/spark-1.6.1/python/lib/pyspark.zip/pyspark/sql/utils.py", line 45, in deco
  File "/home/pricing/spark-1.6.1/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o64.sql.
: java.lang.RuntimeException: [1.10] failure: ``*'' expected but `desc' found

如问题下方评论所述,使用desc是不合适的,因为它是一个关键字。更改列的名称可以解决此问题

## Convert RDD to DF, specify column names
headerDF = header.toDF(['header', 'adj', 'descTmp'])

## Convert Adj Column to numeric
headerDF = headerDF.withColumn("adj", headerDF['adj'].cast(DoubleType()))

headerDF.registerTempTable("headerTab")
head = sqlCtx.sql("select d.descTmp from headerTab as d").show()

+-----------+
|    descTmp|
+-----------+
|       data|
|       data|
|       data|

基本上,您在列名中使用关键字
desc
,这是不合适的。您可以通过两种方式解决此问题:更改列名或在关键字desc周围使用符号(`)

方式1:-

sqlCtx = SQLContext(sc)
## Import CSV File
header = (sc.textFile("data.csv")
          .map(lambda line: [x for x in line.split(",")]))

## Convert RDD to DF, specify column names
headerDF = header.toDF(['header', 'adj', 'description'])

## Convert Adj Column to numeric
headerDF = headerDF.withColumn("adj", headerDF['adj'].cast(DoubleType()))

headerDF.registerTempTable("headerTab")
head = sqlCtx.sql("select d.description from headerTab as d").show()
方式2:-

sqlCtx = SQLContext(sc)
## Import CSV File
header = (sc.textFile("data.csv")
          .map(lambda line: [x for x in line.split(",")]))

## Convert RDD to DF, specify column names
headerDF = header.toDF(['header', 'adj', 'desc'])

## Convert Adj Column to numeric
headerDF = headerDF.withColumn("adj", headerDF['adj'].cast(DoubleType()))

headerDF.registerTempTable("headerTab")
head = sqlCtx.sql("select d.`desc` from headerTab as d").show()

您的问题在于
desc
,因为
desc
是用于排序的关键字。所以试试这个head=sqlCtx.sql(“从headerTab中选择d.
desc
作为d”)。show()我很难看出你的建议有什么不同,但我确实理解你对关键字的意思。我将列名改为descTmp,它按设计工作。谢谢你的帮助,我会把答案贴出来。伟人,2票赞成你,你有更好的答案。接受你的回答。谢谢@JestonBlu谢谢,您可以使用rakeshkaswan8356@gmail.com
sqlCtx = SQLContext(sc)
## Import CSV File
header = (sc.textFile("data.csv")
          .map(lambda line: [x for x in line.split(",")]))

## Convert RDD to DF, specify column names
headerDF = header.toDF(['header', 'adj', 'desc'])

## Convert Adj Column to numeric
headerDF = headerDF.withColumn("adj", headerDF['adj'].cast(DoubleType()))

headerDF.registerTempTable("headerTab")
head = sqlCtx.sql("select d.`desc` from headerTab as d").show()