Apache spark Spark 2.1.1使用select（）方法时，DataFrame给出的列不正确_Apache Spark

Apache spark Spark 2.1.1使用select（）方法时，DataFrame给出的列不正确

apache-spark

Apache spark Spark 2.1.1使用select（）方法时，DataFrame给出的列不正确,apache-spark,Apache Spark,我使用下面的模式使用Spark的数据源API创建数据帧 StructType(Seq(StructField("name", StringType, true), StructField("age", IntegerType, true), StructField("livesIn", StringType, true), StructField("b

我使用下面的模式使用Spark的数据源API创建数据帧

StructType(Seq(StructField("name", StringType, true), 
                        StructField("age", IntegerType, true),
                        StructField("livesIn", StringType, true),
                        StructField("bornIn", StringType, true)))

我正在使用PrunedFilteredScan的buildScan（）方法对数据进行硬编码，如下所示：

当我创建DataFrame时，如下所示：

val dfPruned = sqlContext.read.format(dsPackage).load().select("livesIn")
dfPruned.show
dfPruned.printSchema

它为标题

livesIn

提供

name

列的数据。如果我遗漏了任何内容或这是Spark 2.1.1中的错误，请提供帮助

当您拥有

schema

并且将

rdd

转换为

行时，您应该创建dataframe

sqlContext.createDataFrame(rows, schema)

那么当你这么做的时候
val dfPruned = sqlContext.createDataFrame(rows, schema).select("livesIn")
dfPruned.show
dfPruned.printSchema

您应该得到的输出是
+---------+
|  livesIn|
+---------+
| Universe|
|   Mysore|
|Hyderabad|
|      Blr|
|      Chn|
|      Del|
+---------+

root
 |-- livesIn: string (nullable = true)

已编辑
如果您想使用数据源API，那么它更简单
sqlContext.read.format("csv").option("delimiter", " ").schema(schema).load("path to your file ").select("livesIn")

我们应该做到这一点
注意：我使用的输入文件如下
KBN 1000000 Universe Parangipettai
Sreedhar 38 Mysore Adoni
Siva 8 Hyderabad Hyderabad
Rishi 23 Blr Hyd
Ram 45 Chn Hyd
Abey 12 Del Hyd

当您拥有schema
并且将rdd
转换为行时，您应该创建dataframe

sqlContext.createDataFrame(rows, schema)

那么当你这么做的时候
val dfPruned = sqlContext.createDataFrame(rows, schema).select("livesIn")
dfPruned.show
dfPruned.printSchema

您应该得到的输出是
+---------+
|  livesIn|
+---------+
| Universe|
|   Mysore|
|Hyderabad|
|      Blr|
|      Chn|
|      Del|
+---------+

root
 |-- livesIn: string (nullable = true)

已编辑
如果您想使用数据源API，那么它更简单
sqlContext.read.format("csv").option("delimiter", " ").schema(schema).load("path to your file ").select("livesIn")

我们应该做到这一点
注意：我使用的输入文件如下
KBN 1000000 Universe Parangipettai
Sreedhar 38 Mysore Adoni
Siva 8 Hyderabad Hyderabad
Rishi 23 Blr Hyd
Ram 45 Chn Hyd
Abey 12 Del Hyd

如果您试图为rdd应用模式，您可以使用下面的createDataFrame
函数
   // create a row from your data by splitting wit " "
   val rows = rdd.map( value => {
      val data = value.split(" ")
   // you could use Rows.fromSeq(data) but since you need second field as int needs conversion

      Row(data(0), data(1).toInt, data(2), data(3))
    })

   //creating a dataframe with rows and schema 
    val df = sparkContext.createDataFrame(rows, schema)


    // selecting only column livesIn 
    df.select("livesIn")

输出：
+---------+
|  livesIn|
+---------+
| Universe|
|   Mysore|
|Hyderabad|
|      Blr|
|      Chn|
|      Del|
+---------+ 

希望这是有帮助的
 如果您试图为rdd应用模式，您可以使用下面的createDataFrame
函数
   // create a row from your data by splitting wit " "
   val rows = rdd.map( value => {
      val data = value.split(" ")
   // you could use Rows.fromSeq(data) but since you need second field as int needs conversion

      Row(data(0), data(1).toInt, data(2), data(3))
    })

   //creating a dataframe with rows and schema 
    val df = sparkContext.createDataFrame(rows, schema)


    // selecting only column livesIn 
    df.select("livesIn")

输出：
+---------+
|  livesIn|
+---------+
| Universe|
|   Mysore|
|Hyderabad|
|      Blr|
|      Chn|
|      Del|
+---------+ 

希望这是有帮助的
 谢谢Shankar。但是，我需要实现这个扩展Spark的数据源API，但不使用createDataFrame（）方法。谢谢Shankar。但是，我需要实现这个扩展Spark的数据源API，但不使用createDataFrame（）方法。但是，我需要实现这个扩展Spark的数据源API，但不使用createDataFrame（）方法。但是，我需要实现这个扩展Spark的数据源API，但不使用createDataFrame（）方法。