如何在scala IntelliJ中处理这些大数据?
几天过去了,我开始在IntelliJ上学习Scala,现在我正在自学。请容忍我的新手错误。我有一个超过10000行13列的csv文件 各栏的标题为: 类别|评级|评论|大小|安装|类型|价格|内容评级|流派|上次更新|当前版本|安卓版本 我确实使用以下代码读取并显示了csv文件:如何在scala IntelliJ中处理这些大数据?,scala,intellij-idea,Scala,Intellij Idea,几天过去了,我开始在IntelliJ上学习Scala,现在我正在自学。请容忍我的新手错误。我有一个超过10000行13列的csv文件 各栏的标题为: 类别|评级|评论|大小|安装|类型|价格|内容评级|流派|上次更新|当前版本|安卓版本 我确实使用以下代码读取并显示了csv文件: import scala.io.Source object task { def main(args: Array[String]): Unit = { for(line <- Source.fr
import scala.io.Source
object task {
def main(args: Array[String]): Unit = {
for(line <- Source.fromFile("D:/data.csv"))
{
println(line)
}
}
}
导入scala.io.Source
对象任务{
def main(参数:数组[字符串]):单位={
for(lineSource
默认为您提供一个字节迭代器
。若要在行中进行迭代,请使用。getLines
:
Source.fromFile(fileName)
.getLines
.foreach(println)
要将行拆分为数组,请使用split
(假设列值不包含分隔符):
但最好避免使用原始数组。创建case类可以获得更好的可读代码:
case class AppData(
category: String,
rating: Int,
reviews: Int,
size: Int,
installs: Int,
`type`: String,
price: Double,
contentRating: Int,
generes: Seq[String],
lastUpdated: Long,
version: String,
androidVersion: String
) {
def priority(maxRating: Int, maxReview: Int) =
if(maxRatings == 0 || maxReviews == 0) 0 else
(rating * 0.4 / maxRating + reviews * 0.6 /maxReview) * 100
}
object AppData {
def apply(str: String) = {
val fields = str.split("|")
assert(fields.length == 12)
AppData(
fields(0),
fields(1).toInt,
fields(2).toInt,
fields(3).toInt,
fields(4).toInt,
fields(5),
fields(6).toDouble,
fields(7).toInt,
fields(8).split(",").toSeq,
fields(9).toLong,
fields(10),
fields(11)
)
}
}
现在你可以干净利落地做你想做的事了:
// Read the data, parse it and group by category
// This gives you a map of categories to a seq of apps
val byCategory = Source.fromFile(fileName)
.map(AppData)
.groupBy(_.category)
// Now, find out max ratings and reviews for each category
// This could be done even nicer with another case class and
// a monoid, but tuple/fold will do too
// It is tempting to use `.mapValues` here, but that's not a good idea
// because .mapValues is LAZY, it will recompute the max every time
// the value is accessed!
val maxes = byVategory.map { case (cat, data) =>
cat ->
data.foldLeft(0 -> 0) { case ((maxRatings, maxReviews), in) =>
(maxRatings max in.rating, maxReviews max in.reviews)
}
}.withDefault( _ => (0,0))
// And finally go through your categories, and find best for each,
// that's it!
val bestByCategory = byCategory.map { case(cat, apps) =>
cat -> apps.maxBy { _.priority.tupled(maxes(cat)) }
}
Source
默认为您提供一个字节迭代器
。若要在行中进行迭代,请使用。getLines
:
Source.fromFile(fileName)
.getLines
.foreach(println)
要将行拆分为数组,请使用split
(假设列值不包含分隔符):
但最好避免使用原始数组。创建case类可以获得更好的可读代码:
case class AppData(
category: String,
rating: Int,
reviews: Int,
size: Int,
installs: Int,
`type`: String,
price: Double,
contentRating: Int,
generes: Seq[String],
lastUpdated: Long,
version: String,
androidVersion: String
) {
def priority(maxRating: Int, maxReview: Int) =
if(maxRatings == 0 || maxReviews == 0) 0 else
(rating * 0.4 / maxRating + reviews * 0.6 /maxReview) * 100
}
object AppData {
def apply(str: String) = {
val fields = str.split("|")
assert(fields.length == 12)
AppData(
fields(0),
fields(1).toInt,
fields(2).toInt,
fields(3).toInt,
fields(4).toInt,
fields(5),
fields(6).toDouble,
fields(7).toInt,
fields(8).split(",").toSeq,
fields(9).toLong,
fields(10),
fields(11)
)
}
}
现在你可以干净利落地做你想做的事了:
// Read the data, parse it and group by category
// This gives you a map of categories to a seq of apps
val byCategory = Source.fromFile(fileName)
.map(AppData)
.groupBy(_.category)
// Now, find out max ratings and reviews for each category
// This could be done even nicer with another case class and
// a monoid, but tuple/fold will do too
// It is tempting to use `.mapValues` here, but that's not a good idea
// because .mapValues is LAZY, it will recompute the max every time
// the value is accessed!
val maxes = byVategory.map { case (cat, data) =>
cat ->
data.foldLeft(0 -> 0) { case ((maxRatings, maxReviews), in) =>
(maxRatings max in.rating, maxReviews max in.reviews)
}
}.withDefault( _ => (0,0))
// And finally go through your categories, and find best for each,
// that's it!
val bestByCategory = byCategory.map { case(cat, apps) =>
cat -> apps.maxBy { _.priority.tupled(maxes(cat)) }
}
谢谢。这帮了大忙。很抱歉,我无法更新你的答案,因为我是新用户。谢谢。这帮了大忙。很抱歉,我无法更新你的答案,因为我是新用户