Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/18.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Spark在调用scala类方法以逗号分隔字符串时失败_Scala_Apache Spark - Fatal编程技术网

Spark在调用scala类方法以逗号分隔字符串时失败

Spark在调用scala类方法以逗号分隔字符串时失败,scala,apache-spark,Scala,Apache Spark,我在spark中的scala shell中有以下CLS class StringSplit(val query:String) { def getStrSplit(rdd:RDD[String]):RDD[String]={ rdd.map(x=>x.split(query)) } } 我试图调用这个类中的方法,就像 val inputRDD=sc.parallelize(List("one","two","three")) val strSplit=new StringSplit(

我在spark中的scala shell中有以下CLS

class StringSplit(val query:String)
{
 def getStrSplit(rdd:RDD[String]):RDD[String]={
 rdd.map(x=>x.split(query))
}
}
我试图调用这个类中的方法,就像

val inputRDD=sc.parallelize(List("one","two","three"))
val strSplit=new StringSplit(",")
strSplit.getStrSplit(inputRDD)
->此步骤失败,出现错误:getStrSplit不是StringSplit的成员


你能告诉我这有什么问题吗?

这似乎是合理的做法,但是

  • getStrSplit
    的结果类型错误,因为
    .split
    返回
    Array[String]
    而不是
    String
  • 并行化列表(“一”、“二”、“三”)将导致存储“一”、“二”和“三”,并且没有字符串需要逗号分隔
另一种方式:

val input = sc.parallelize(List("1,2,3,4","5,6,7,8"))
input: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[16] at parallelize at <console>
那么原来的问题呢,做同样的操作 在课堂上。我试过:

class StringSplit(query:String){
  def get(rdd:RDD[String]) = rdd.map(_.split(query)); 
}

val ss = StringSplit(",");

ss.get(input);
--->  org.apache.spark.SparkException: Task not serializable
我猜发生这种情况是因为类没有序列化到每个worker,而Spark试图发送split函数,但它有一个参数也没有发送

scala> class commaSplitter {
     def get(rdd:RDD[String])=rdd.map(_.split(","));
     }
defined class commaSplitter

scala> val cs = new commaSplitter;
cs: commaSplitter = $iwC$$iwC$commaSplitter@262f1580

scala> cs.get(input);
res29: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[23] at map at <console>:10

scala> cs.get(input).collect()
res30: Array[Array[String]] = Array(Array(1, 2, 3, 4), Array(5, 6, 7, 8))

这很有效。

在类定义中,在
getStrSplit
前面是否有
def
val
关键字?太棒了,保罗……感谢你花时间解释细节……你让我的一天……太棒了。祝你好运。
class StringSplit(query:String){
  def get(rdd:RDD[String]) = rdd.map(_.split(query)); 
}

val ss = StringSplit(",");

ss.get(input);
--->  org.apache.spark.SparkException: Task not serializable
scala> class commaSplitter {
     def get(rdd:RDD[String])=rdd.map(_.split(","));
     }
defined class commaSplitter

scala> val cs = new commaSplitter;
cs: commaSplitter = $iwC$$iwC$commaSplitter@262f1580

scala> cs.get(input);
res29: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[23] at map at <console>:10

scala> cs.get(input).collect()
res30: Array[Array[String]] = Array(Array(1, 2, 3, 4), Array(5, 6, 7, 8))
scala> class stringSplitter(s:String) extends Serializable {
     def get(rdd:RDD[String]) = rdd.map(_.split(s)); 
     }
defined class stringSplitter

scala> val ss = new stringSplitter(",");
ss: stringSplitter = $iwC$$iwC$stringSplitter@2a33abcd

scala> ss.get(input)
res33: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[25] at map at <console>:10

scala> ss.get(input).collect()
res34: Array[Array[String]] = Array(Array(1, 2, 3, 4), Array(5, 6, 7, 8))