Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 在Spark中创建数据集时遇到错误_Scala_Apache Spark_Dataset - Fatal编程技术网

Scala 在Spark中创建数据集时遇到错误

Scala 在Spark中创建数据集时遇到错误,scala,apache-spark,dataset,Scala,Apache Spark,Dataset,错误: 这是我的测试数据: import org.apache.spark.sql.SparkSession import org.apache.spark.sql.types._ case class Drug(S_No: int,Name: string,Drug_Name: string,Gender: string,Drug_Value: int) scala> val ds=spark.read.csv("file:///home/xxx/drug_detail.csv").

错误:

这是我的测试数据:

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._

case class Drug(S_No: int,Name: string,Drug_Name: string,Gender: string,Drug_Value: int)

scala> val ds=spark.read.csv("file:///home/xxx/drug_detail.csv").as[Drug]
org.apache.spark.sql.AnalysisException: cannot resolve '`S_No`' given input columns: [_c1, _c2, _c3, _c4, _c0];
  at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$3.applyOrElse(CheckAnalysis.scala:110)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$3.applyOrElse(CheckAnalysis.scala:107)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:278)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:278)
  at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:277)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:275)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:275)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326)
  at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
  at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:275)
用作:

1,Brandon Buckner,avil,female,525
2,Veda Hopkins,avil,male,633
3,Zia Underwood,paracetamol,male,980
4,Austin Mayer,paracetamol,female,338
5,Mara Higgins,avil,female,153
6,Sybill Crosby,avil,male,193
7,Tyler Rosales,paracetamol,male,778
8,Ivan Hale,avil,female,454
9,Alika Gilmore,paracetamol,female,833
10,Len Burgess,metacin,male,325

如果您的csv文件包含标题,则可能包含选项(“标题”、“真”)


e、 g.:
spark.read.option(“header”、“true”).csv(“…”)作为[药物]
使用
sql编码器生成
structtype
模式,然后在读取csv文件时传递
模式
,并在case类中将类型定义为
Int,String
,而不是小写的
Int,String

示例:

val ds=spark.read.option("header", "true").csv("file:///home/xxx/drug_detail.csv").as[Drug]
cat drug_detail.csv
1,foo,bar,M,2
2,foo1,bar1,F,3
示例数据:

val ds=spark.read.option("header", "true").csv("file:///home/xxx/drug_detail.csv").as[Drug]
cat drug_detail.csv
1,foo,bar,M,2
2,foo1,bar1,F,3

火花壳:

val ds=spark.read.option("header", "true").csv("file:///home/xxx/drug_detail.csv").as[Drug]
cat drug_detail.csv
1,foo,bar,M,2
2,foo1,bar1,F,3

我应该在file.Yep的第一行中指定列名吗。比如:
S_No,Name,Drug_Name,Gender,Drug_Value
@KarthikaDavid您能提供文件Drug_detail.csvits中的前10条记录吗?谢谢shu。您能告诉我添加模式变量的原因吗(第2行)。@KarthikaDavid,由于列名不匹配,我们需要添加模式来创建数据集,如果您的csv文件中有标题,您可以这样做
spark.read.option(“header”,true)。option(“inferSchema”,true)。csv(“”)。as[Drug]。show()
如果您的inferSchema仅与
case class
中定义的类型匹配,则此操作将有效。对于您将来的问题,请尝试将描述放在正文中,即使你认为标题是不言自明的。将测试数据也添加到问题中,在格式化的块中,而不是在注释中。非常感谢。