Java 如何使用avg和STDEV查找异常值?
我正在对一个数据集进行冲突过滤您询问我目前根本没有使用的Java,因此这里提供了一个Scala版本,我希望它能帮助您找到相应的Java版本 下面的解决方案怎么样Java 如何使用avg和STDEV查找异常值?,java,apache-spark,apache-spark-sql,Java,Apache Spark,Apache Spark Sql,我正在对一个数据集进行冲突过滤您询问我目前根本没有使用的Java,因此这里提供了一个Scala版本,我希望它能帮助您找到相应的Java版本 下面的解决方案怎么样 // preparing the dataset val input = spark. read. text("input.txt"). as[String]. filter(line => !line.startsWith("Name")). map(_.split("\\W+")). withColum
// preparing the dataset
val input = spark.
read.
text("input.txt").
as[String].
filter(line => !line.startsWith("Name")).
map(_.split("\\W+")).
withColumn("name", $"value"(0)).
withColumn("size", $"value"(1) cast "int").
withColumn("volumes", $"value"(2) cast "int").
select("name", "size", "volumes")
scala> input.show
+------+----+-------+
| name|size|volumes|
+------+----+-------+
| File1|1030| 107529|
| File2| 997| 106006|
| File3|1546| 112426|
| File4|2235| 117335|
| File5|2061| 115363|
| File6|1875| 114015|
| File7|1237| 110002|
| File8|1546| 112289|
| File9|1030| 107154|
|File10|1339| 110276|
+------+----+-------+
// the final computation
import org.apache.spark.sql.functions._
val (sizeAvg, sizeStddev, volumesAvg, volumesStddev) = input.
groupBy().
agg(avg("size"), stddev("size"), avg("volumes"), stddev("volumes")).
as[(Double, Double, Double, Double)].
head
val sizeLessThanStddev = col("size") < (sizeAvg - 2 * sizeStddev)
input.filter(sizeLessThanStddev)
这只是4部分筛选运算符的第一部分,其余部分作为家庭练习。谢谢大家的评论 所以这个解决方案是针对Spark的Java实现的。如果您想要实现Scala,请查看Jacek Laskowski post 解决方案:
//first convert the columns Size and Records to a List<Double>
List<Double> sizeList = dataFrame.select("Size").javaRDD().map(f -> f.getDouble(0)).collect();
List<Double> recordsList = dataFrame.select("Records").javaRDD().map(f -> f.getDouble(0)).collect();
//then convert the lists into JavaDoubleRDD
JavaDoubleRDD size = sparkContext.parallelizeDoubles(sizeList);
JavaDoubleRDD records = sparkContext.parallelizeDoubles(recordsList);
//calculate the mean and stddev using the built in functions:
double sizeMean = size.mean();
double sizeStdev = size.stdev();
double recordsMean = records.mean();
double recordsStdev = records.stdev();
之后,我最终可以在列比较函数中使用这些值。
希望您理解我所做的。错误是什么?我认为这是因为AVG和STDEV是聚合函数,在SQL中,我知道您不能在Where筛选器上使用聚合函数。唯一的解决方法是在嵌套查询中使用这些函数,但我不知道如何在线程main java.lang.UnsupportedOperationException中执行itException:无法将表达式:avgcastinput[1,int,false]计算为bigint如何将所有嵌套查询作为独立查询运行,获取数据并将其传递给原始查询?这就是我的意思。。。val结果=session.loadSELECT AVGsize-2STDEVsize FROM DATASETT。。。。。val query=ssSELECT*来自数据集,其中大小<${result}。。。。。Tval finalResult=session.loadquery
Column meanSize = functions.mean(size);
Column meanRecords = functions.mean(records);
Column stdSize = functions.stddev(size);
Column stdRecords = functions.stddev(records);
Column lowerSizeThreshold = size.lt((meanSize.minus(stdSize).minus(stdSize)));
Column upperSizeThreshold = size.gt(meanSize.plus(stdSize).plus(stdSize));
Column lowerRecordsThreshold = records.lt(meanRecords.minus(stdRecords).minus(stdRecords));
Column upperRecordsThreshold = records.gt(meanRecords.plus(stdRecords).plus(stdRecords));
Dataset<Row> outliers = dataFrame.where(lowerSizeThreshold.or(upperSizeThreshold).or(lowerRecordsThreshold).or(upperRecordsThreshold));
// preparing the dataset
val input = spark.
read.
text("input.txt").
as[String].
filter(line => !line.startsWith("Name")).
map(_.split("\\W+")).
withColumn("name", $"value"(0)).
withColumn("size", $"value"(1) cast "int").
withColumn("volumes", $"value"(2) cast "int").
select("name", "size", "volumes")
scala> input.show
+------+----+-------+
| name|size|volumes|
+------+----+-------+
| File1|1030| 107529|
| File2| 997| 106006|
| File3|1546| 112426|
| File4|2235| 117335|
| File5|2061| 115363|
| File6|1875| 114015|
| File7|1237| 110002|
| File8|1546| 112289|
| File9|1030| 107154|
|File10|1339| 110276|
+------+----+-------+
// the final computation
import org.apache.spark.sql.functions._
val (sizeAvg, sizeStddev, volumesAvg, volumesStddev) = input.
groupBy().
agg(avg("size"), stddev("size"), avg("volumes"), stddev("volumes")).
as[(Double, Double, Double, Double)].
head
val sizeLessThanStddev = col("size") < (sizeAvg - 2 * sizeStddev)
input.filter(sizeLessThanStddev)
//first convert the columns Size and Records to a List<Double>
List<Double> sizeList = dataFrame.select("Size").javaRDD().map(f -> f.getDouble(0)).collect();
List<Double> recordsList = dataFrame.select("Records").javaRDD().map(f -> f.getDouble(0)).collect();
//then convert the lists into JavaDoubleRDD
JavaDoubleRDD size = sparkContext.parallelizeDoubles(sizeList);
JavaDoubleRDD records = sparkContext.parallelizeDoubles(recordsList);
//calculate the mean and stddev using the built in functions:
double sizeMean = size.mean();
double sizeStdev = size.stdev();
double recordsMean = records.mean();
double recordsStdev = records.stdev();