Scala Spark SQL UDF任务不可序列化_Scala_Apache Spark_Apache Spark Sql_Datastax

Scala Spark SQL UDF任务不可序列化

scala apache-spark

Scala Spark SQL UDF任务不可序列化,scala,apache-spark,apache-spark-sql,datastax,Scala,Apache Spark,Apache Spark Sql,Datastax,Cassandra&DataStax社区，我有一个问题，希望有明智的人能帮助我我们正在将analtics代码从Hadoop迁移到运行在Cassandra之上的Spark（通过DataStax Enterprise）。DSE 4.7正在生产中，但4.8正在开发中 Java7正在生产中，Java7/8正在开发中我们需要几个数据帧转换，我们认为通过Spark SQLContext针对内存中的数据帧编写一个UDF就可以了。其中主要包括：我们的数据的每个文本值都以“.”作为前缀和后缀，即“一些数据”

Cassandra&DataStax社区，我有一个问题，希望有明智的人能帮助我

我们正在将analtics代码从Hadoop迁移到运行在Cassandra之上的Spark（通过DataStax Enterprise）。DSE 4.7正在生产中，但4.8正在开发中

Java7正在生产中，Java7/8正在开发中

我们需要几个数据帧转换，我们认为通过Spark SQLContext针对内存中的数据帧编写一个UDF就可以了。其中主要包括：

我们的数据的每个文本值都以“.”作为前缀和后缀，即“一些数据”。这非常烦人，因此我们希望清除其中的每一个

我们想添加一个列，其中包含由许多其他列组成的哈希键

下面是我们的代码。在sqlContext中不包含UDF调用的情况下，这运行得很好，但是一旦添加了UDF调用，就会出现“任务不可序列化”错误

线程“main”org.apache.spark.SparkException中出现异常：任务不可序列化

我已经尝试将“implements Serializable”作为这个类（以及许多其他类）的基类，这会将错误类更改为链上的下一个类，但是这会导致异常类无法序列化……这可能意味着我走错了方向

我还尝试将UDF作为lambda实现，这也会导致相同的错误

如果有人能指出我做错了什么，我将不胜感激

public class entities implements Serializable{
    private spark_context m_spx = null;
    private DataFrame m_entities = null;
    private String m_timekey = null;

    public entities(spark_context _spx, String _timekey){
        m_spx = _spx;
        m_timekey = _timekey;
    }


    public DataFrame get_dimension(){
        if(m_entities == null) {

            DataFrame df = m_spx.get_flat_data(m_timekey).select("event", "url");

            //UDF to generate hashed ids
            UDF2 get_hashed_id = new UDF2<String, String, String>() {
                public String call(String o, String o2) throws Exception {
                    return o.concat(o2);
                }
            };


            //UDF to clean the " from strings
            UDF1 clean_string = new UDF1<String, String>() {
                public String call(String o) throws Exception {
                    return o.replace("\"","");
                }
            };


            //Get the Spark SQL Context from SC.
            SQLContext sqlContext = new SQLContext(m_spx.sc());


            //Register the UDFs
            sqlContext.udf().register("getid", get_hashed_id, DataTypes.StringType);
            sqlContext.udf().register("clean_string", clean_string, DataTypes.StringType);


            //Register the DF as a table.
            sqlContext.registerDataFrameAsTable(df, "entities");
            m_entities = sqlContext.sql("SELECT getid(event, url) as event_key, clean_string(event) as event_cleaned, clean_string(url) as url_cleaned FROM entities");
        }

        return m_entities;
    }
}

公共类实体实现可序列化{
私有spark_context m_spx=null；
私有数据帧m_实体=null；
私有字符串m_timekey=null；
公共实体（spark\u context\u spx、String\u timekey）{
m_spx=_spx；
m_timekey=_timekey；
}
公共数据帧get_维度（）{
如果（m_实体==null）{
数据帧df=m_spx。获取平面数据（m_timekey）。选择（“事件”、“url”）；
//UDF生成散列ID
UDF2 get_hashed_id=新UDF2（）{
公共字符串调用（字符串o、字符串o2）引发异常{
返回o.concat（o2）；
}
};
//UDF清除字符串中的
UDF1 clean_string=新UDF1（）{
公共字符串调用（字符串o）引发异常{
返回o.替换（“\”，“）；
}
};
//从SC获取Spark SQL上下文。
SQLContext-SQLContext=newsqlcontext（m_spx.sc（））；
//注册UDF
register（“getid”，get\u hashed\u id，DataTypes.StringType）；
register（“clean_string”，clean_string，DataTypes.StringType）；
//将DF注册为表。
registerDataFrameAsTable（df，“实体”）；
m_entities=sqlContext.sql（“选择getid（事件，url）作为事件_键，选择clean_字符串（事件）作为事件_清理，选择clean_字符串（url）作为url_从实体中清理”）；
}
返回m_实体；
}
}

您的

实体

类包含

SparkContext

成员-因此它不能序列化（SparkContext在国际上是不可序列化的，您不应该序列化它们）

由于

实体

不可序列化，因此它的任何非静态方法/成员/匿名内部类也不可序列化（因为它们将尝试序列化包含它们的

实体

实例）

在这种情况下，最好的解决方法是将匿名UDF提取到类的静态成员中：

private final static UDF2 get_hashed_id = new UDF2<String, String, String>() { public String call(String o, String o2) throws Exception { return o.concat(o2); } }; private final static UDF1 clean_string = new UDF1<String, String>() { public String call(String o) throws Exception { return o.replace("\"",""); } };

private final static UDF2 get_hashed_id=new UDF2（）{ 公共字符串调用（字符串o、字符串o2）引发异常{ 返回o.concat（o2）； } }; 私有最终静态UDF1清除字符串=新UDF1（）{ 公共字符串调用（字符串o）引发异常{ 返回o.替换（“\”，“）； } };

然后您就可以在
get_dimension
中使用它们了，因为您的
实体
类包含一个
SparkContext
成员，所以它不能序列化（SparkContext在国际上是不可序列化的，您不应该序列化它们）
由于
实体
不可序列化，因此它的任何非静态方法/成员/匿名内部类也不可序列化（因为它们将尝试序列化包含它们的
实体
实例）
在这种情况下，最好的解决方法是将匿名UDF提取到类的静态成员中：

private final static UDF2 get_hashed_id = new UDF2<String, String, String>() { public String call(String o, String o2) throws Exception { return o.concat(o2); } }; private final static UDF1 clean_string = new UDF1<String, String>() { public String call(String o) throws Exception { return o.replace("\"",""); } };

private final static UDF2 get_hashed_id=new UDF2（）{ 公共字符串调用（字符串o、字符串o2）引发异常{ 返回o.concat（o2）； } }; 私有最终静态UDF1清除字符串=新UDF1（）{ 公共字符串调用（字符串o）引发异常{ 返回o.替换（“\”，“）； } };
然后您就可以在
获取维度中使用它们了