Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 创建RDD:字段太多=>;RDD的用例类_Scala_Apache Spark_Rdd - Fatal编程技术网

Scala 创建RDD:字段太多=>;RDD的用例类

Scala 创建RDD:字段太多=>;RDD的用例类,scala,apache-spark,rdd,Scala,Apache Spark,Rdd,我有一个入侵数据集,我想用它来测试不同的有监督机器学习技术 这是我代码的一部分: object parser_dataset { val conf = new SparkConf() .setMaster("local[2]") .setAppName("kdd") .set("spark.executor.memory", "8g") conf.registerKryoClasses(Array( classOf

我有一个入侵数据集,我想用它来测试不同的有监督机器学习技术

这是我代码的一部分:

object parser_dataset {

   val conf = new SparkConf()
       .setMaster("local[2]")
       .setAppName("kdd")
       .set("spark.executor.memory", "8g")
        conf.registerKryoClasses(Array(
        classOf[Array[Any]],
        classOf[Array[scala.Tuple3[Int, Int, Int]]],
        classOf[String],
        classOf[Any]
    ))
    val context = new SparkContext(conf)

    def load(file: String): RDD[(Int, String, String,String,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Double,Double,Double,Double,Double,Double,Double, Int, Int,Double, Double, Double, Double, Double, Double, Double, Double, String)] = {

        val data = context.textFile(file)

        val res = data.map(x => {

            val s = x.split(",")
        (s(0).toInt, s(1), s(2), s(3), s(4).toInt, s(5).toInt, s(6).toInt, s(7).toInt, s(8).toInt, s(9).toInt, s(10).toInt, s(11).toInt, s(12).toInt, s(13).toInt, s(14).toInt, s(15).toInt, s(16).toInt, s(17).toInt, s(18).toInt, s(19).toInt, s(20).toInt, s(21).toInt, s(22).toInt, s(23).toInt, s(24).toDouble, s(25).toDouble, s(26).toDouble, s(27).toDouble, s(28).toDouble, s(29).toDouble, s(30).toDouble, s(31).toInt, s(32).toInt, s(33).toDouble, s(34).toDouble, s(35).toDouble, s(36).toDouble, s(37).toDouble, s(38).toDouble, s(39).toDouble, s(40).toDouble, s(41))    
        })
        .persist(StorageLevel.MEMORY_AND_DISK)
    return res
    }


   def main(args: Array[String]) {
     val data = this.load("/home/hvfd8529/Datasets/KDDCup99/kddcup.data_10_percent_corrected")

     data1.collect.foreach(println)
     data.distinct() 

    }
}

这不是我的代码,它是给我的,我只是修改了一些部分(特别是RDD和拆分部分),我是Scala和Spark的新手

编辑: 因此,我在加载函数上方添加了case类,如下所示:

case class BasicFeatures(duration:Int, protocol_type:String, service:String, flag:String, src_bytes:Int, dst_bytes:Int, land:Int, wrong_fragment:Int, urgent:Int) 

case class ContentFeatures(hot:Int, num_failed_logins:Int, logged_in:Int, num_compromised:Int, root_shell:Int, su_attempted:Int, num_root:Int, num_file_creations:Int, num_shells:Int, num_access_files:Int, num_outbound_cmds:Int, is_host_login:Int, is_guest_login:Int)

case class TrafficFeatures(count:Int, srv_count:Int, serror_rate:Double, srv_error_rate:Double, rerror_rate:Double, srv_rerror_rate:Double, same_srv_rate:Double, diff_srv_rate:Double, srv_diff_host_rate:Double, dst_host_count:Int, dst_host_srv_count:Int, dst_host_same_srv_rate:Double, dst_host_diff_srv_rate:Double, dst_host_same_src_port_rate:Double, dst_host_srv_diff_host_rate:Double, dst_host_serror_rate:Double, dst_host_srv_serror_rate:Double, dst_host_rerror_rate:Double, dst_host_srv_rerror_rate:Double, attack_type:String )
但现在我很困惑,我怎么能用这些来解决我的问题,因为我仍然需要一个RDD,以便有一个特性=一个字段 这是我要解析的文件的一行:

0,tcp,ftp_data,SF,491,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0.00,0.00,0.00,0.00,1.00,0.00,0.00,150,25,0.17,0.03,0.17,0.00,0.00,0.00,0.05,0.00,normal,20

Scala支持的最大元组大小为22。Scala函数的参数限制为22。因此,不能创建大于22的元组。

老实说,使用22个字段的元组是非常糟糕的做法,在我看来,元组描述不了任何东西。尽管这个问题,考虑使用自己的类,这有一些意义。一年后,当您不得不修改代码时,您会说“谢谢”。)我同意@T.Gawęda。你也会想读书的!《谢谢你》的可能副本,确实是一个副本,我确实认为我是因为它有太多的字段,但找不到关于它的主题:/那么你将如何@t.Gawęda写一些我可以在spark中使用的“类似”的东西呢?@LaureD你可以创建案例类。它也将是大的,但字段将有一些意义-稍后您将知道哪些字段在不阅读完整代码的情况下意味着什么