Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/20.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 无法在community.cloud.databricks中序列化任务_Scala_Apache Spark_Databricks - Fatal编程技术网

Scala 无法在community.cloud.databricks中序列化任务

Scala 无法在community.cloud.databricks中序列化任务,scala,apache-spark,databricks,Scala,Apache Spark,Databricks,Databricks社区云正在抛出 一个org.apache.spark.sparkeexception:Task not serializable异常,我的本地计算机在执行相同的代码时没有抛出该异常 代码来源于《星火行动》一书。代码所做的是读取一个包含github活动数据的json文件,然后从一个虚构的公司读取一个包含员工用户名的文件,最后根据推送次数对员工进行排名 为了避免额外的混乱,包含员工列表的变量将被广播,但是,当返回排名的时间是databricks community cloud抛出

Databricks社区云正在抛出 一个
org.apache.spark.sparkeexception:Task not serializable
异常,我的本地计算机在执行相同的代码时没有抛出该异常

代码来源于《星火行动》一书。代码所做的是读取一个包含github活动数据的json文件,然后从一个虚构的公司读取一个包含员工用户名的文件,最后根据推送次数对员工进行排名

为了避免额外的混乱,包含员工列表的变量将被广播,但是,当返回排名的时间是databricks community cloud抛出异常时

import org.apache.spark.sql.SparkSession
import scala.io.Source.fromURL

val spark = SparkSession.builder()
.appName("GitHub push counter")
.master("local[*]")
.getOrCreate()

val sc = spark.sparkContext

val inputPath = "/FileStore/tables/2015_03_01_0-a829c.json"
val pushes = spark.read.json(inputPath).filter("type = 'PushEvent'")
val grouped = pushes.groupBy("actor.login").count.orderBy(grouped("count").desc)

val empPath = "https://raw.githubusercontent.com/spark-in-action/first-edition/master/ch03/ghEmployees.txt"
val employees = Set() ++ (for { line <- fromURL(empPath).getLines} yield line.trim)

val bcEmployees = sc.broadcast(employees)

import spark.implicits._
val isEmp = user => bcEmployees.value.contains(user)
val isEmployee = spark.udf.register("SetContainsUdf", isEmp)
val filtered = ordered.filter(isEmployee($"login"))
filtered.show()
import org.apache.spark.sql.SparkSession
导入scala.io.Source.fromURL
val spark=SparkSession.builder()
.appName(“GitHub推送计数器”)
.master(“本地[*]”)
.getOrCreate()
val sc=spark.sparkContext
val inputPath=“/FileStore/tables/2015_03_01_0-a829c.json”
val pushs=spark.read.json(inputPath.filter)(“类型='PushEvent'))
val grouped=push.groupBy(“actor.login”).count.orderBy(grouped(“count”).desc)
val empPath=”https://raw.githubusercontent.com/spark-in-action/first-edition/master/ch03/ghEmployees.txt"
val employees=Set()++(用于{line bcEmployees.value.contains(用户))
val isEmployee=spark.udf.register(“SetContainsUdf”,isEmp)
val filtered=ordered.filter(isEmployee($“login”))
filtered.show()

如果您有自定义类,如employee/user,则需要实现序列化接口。Spark在处理自定义用户对象时,需要序列化它们


在databricks之外没有错误吗?我认为没有,因此您应该更改标题和标签