如何在scala IntelliJ中处理这些大数据?

如何在scala IntelliJ中处理这些大数据?,scala,intellij-idea,Scala,Intellij Idea,几天过去了,我开始在IntelliJ上学习Scala,现在我正在自学。请容忍我的新手错误。我有一个超过10000行13列的csv文件 各栏的标题为: 类别|评级|评论|大小|安装|类型|价格|内容评级|流派|上次更新|当前版本|安卓版本 我确实使用以下代码读取并显示了csv文件: import scala.io.Source object task { def main(args: Array[String]): Unit = { for(line <- Source.fr

几天过去了,我开始在IntelliJ上学习Scala,现在我正在自学。请容忍我的新手错误。我有一个超过10000行13列的csv文件

各栏的标题为:

类别|评级|评论|大小|安装|类型|价格|内容评级|流派|上次更新|当前版本|安卓版本

我确实使用以下代码读取并显示了csv文件:

import scala.io.Source


object task {
  def main(args: Array[String]): Unit = {
    for(line <- Source.fromFile("D:/data.csv"))
    {
      println(line)
    }
  }
}
导入scala.io.Source
对象任务{
def main(参数:数组[字符串]):单位={

for(line
Source
默认为您提供一个字节
迭代器
。若要在行中进行迭代,请使用
。getLines

 Source.fromFile(fileName)
   .getLines
   .foreach(println)
要将行拆分为数组,请使用
split
(假设列值不包含分隔符):

但最好避免使用原始数组。创建case类可以获得更好的可读代码:

   case class AppData(
     category: String,
     rating: Int,
     reviews: Int, 
     size: Int,
     installs: Int, 
     `type`: String, 
     price: Double,
     contentRating: Int, 
     generes: Seq[String], 
     lastUpdated: Long,
     version: String,
     androidVersion: String
  ) {
     def priority(maxRating: Int, maxReview: Int) = 
       if(maxRatings == 0 || maxReviews == 0) 0 else 
         (rating * 0.4 / maxRating + reviews * 0.6 /maxReview) * 100
  }

  object AppData {
    def apply(str: String) = {
       val fields = str.split("|")
       assert(fields.length == 12)
       AppData(
         fields(0),
         fields(1).toInt,   
         fields(2).toInt,
         fields(3).toInt,
         fields(4).toInt,
         fields(5),
         fields(6).toDouble,
         fields(7).toInt,
         fields(8).split(",").toSeq,
         fields(9).toLong,
         fields(10),
         fields(11)
       )
    }
  }
现在你可以干净利落地做你想做的事了:

  // Read the data, parse it and group by category
  // This gives you a map of categories to a seq of apps 
  val byCategory = Source.fromFile(fileName)
    .map(AppData)
    .groupBy(_.category)

  // Now, find out max ratings and reviews for each category
  // This could be done even nicer with another case class and 
  // a monoid, but tuple/fold will do too 
  // It is tempting to use `.mapValues` here, but that's not a good idea
  // because .mapValues is LAZY, it will recompute the max every time 
  // the value is accessed!
  val maxes = byVategory.map { case (cat, data) => 
     cat -> 
        data.foldLeft(0 -> 0) { case ((maxRatings, maxReviews), in) => 
          (maxRatings max in.rating, maxReviews max in.reviews)
        }
  }.withDefault( _ => (0,0))

  // And finally go through your categories, and find best for each, 
  // that's it!
  val bestByCategory = byCategory.map { case(cat, apps) => 
    cat -> apps.maxBy { _.priority.tupled(maxes(cat)) }
  }

Source
默认为您提供一个字节
迭代器
。若要在行中进行迭代,请使用
。getLines

 Source.fromFile(fileName)
   .getLines
   .foreach(println)
要将行拆分为数组,请使用
split
(假设列值不包含分隔符):

但最好避免使用原始数组。创建case类可以获得更好的可读代码:

   case class AppData(
     category: String,
     rating: Int,
     reviews: Int, 
     size: Int,
     installs: Int, 
     `type`: String, 
     price: Double,
     contentRating: Int, 
     generes: Seq[String], 
     lastUpdated: Long,
     version: String,
     androidVersion: String
  ) {
     def priority(maxRating: Int, maxReview: Int) = 
       if(maxRatings == 0 || maxReviews == 0) 0 else 
         (rating * 0.4 / maxRating + reviews * 0.6 /maxReview) * 100
  }

  object AppData {
    def apply(str: String) = {
       val fields = str.split("|")
       assert(fields.length == 12)
       AppData(
         fields(0),
         fields(1).toInt,   
         fields(2).toInt,
         fields(3).toInt,
         fields(4).toInt,
         fields(5),
         fields(6).toDouble,
         fields(7).toInt,
         fields(8).split(",").toSeq,
         fields(9).toLong,
         fields(10),
         fields(11)
       )
    }
  }
现在你可以干净利落地做你想做的事了:

  // Read the data, parse it and group by category
  // This gives you a map of categories to a seq of apps 
  val byCategory = Source.fromFile(fileName)
    .map(AppData)
    .groupBy(_.category)

  // Now, find out max ratings and reviews for each category
  // This could be done even nicer with another case class and 
  // a monoid, but tuple/fold will do too 
  // It is tempting to use `.mapValues` here, but that's not a good idea
  // because .mapValues is LAZY, it will recompute the max every time 
  // the value is accessed!
  val maxes = byVategory.map { case (cat, data) => 
     cat -> 
        data.foldLeft(0 -> 0) { case ((maxRatings, maxReviews), in) => 
          (maxRatings max in.rating, maxReviews max in.reviews)
        }
  }.withDefault( _ => (0,0))

  // And finally go through your categories, and find best for each, 
  // that's it!
  val bestByCategory = byCategory.map { case(cat, apps) => 
    cat -> apps.maxBy { _.priority.tupled(maxes(cat)) }
  }

谢谢。这帮了大忙。很抱歉,我无法更新你的答案,因为我是新用户。谢谢。这帮了大忙。很抱歉,我无法更新你的答案,因为我是新用户