Java 如何使用avg和STDEV查找异常值？_Java_Apache Spark_Apache Spark Sql

Java 如何使用avg和STDEV查找异常值？

java apache-spark

Java 如何使用avg和STDEV查找异常值？,java,apache-spark,apache-spark-sql,Java,Apache Spark,Apache Spark Sql,我正在对一个数据集进行冲突过滤您询问我目前根本没有使用的Java，因此这里提供了一个Scala版本，我希望它能帮助您找到相应的Java版本下面的解决方案怎么样 // preparing the dataset val input = spark. read. text("input.txt"). as[String]. filter(line => !line.startsWith("Name")). map(_.split("\\W+")). withColum

我正在对一个数据集进行冲突过滤您询问我目前根本没有使用的Java，因此这里提供了一个Scala版本，我希望它能帮助您找到相应的Java版本

下面的解决方案怎么样

// preparing the dataset
val input = spark.
  read.
  text("input.txt").
  as[String].
  filter(line => !line.startsWith("Name")).
  map(_.split("\\W+")).
  withColumn("name", $"value"(0)).
  withColumn("size", $"value"(1) cast "int").
  withColumn("volumes", $"value"(2) cast "int").
  select("name", "size", "volumes")
scala> input.show
+------+----+-------+
|  name|size|volumes|
+------+----+-------+
| File1|1030| 107529|
| File2| 997| 106006|
| File3|1546| 112426|
| File4|2235| 117335|
| File5|2061| 115363|
| File6|1875| 114015|
| File7|1237| 110002|
| File8|1546| 112289|
| File9|1030| 107154|
|File10|1339| 110276|
+------+----+-------+

// the final computation
import org.apache.spark.sql.functions._
val (sizeAvg, sizeStddev, volumesAvg, volumesStddev) = input.
  groupBy().
  agg(avg("size"), stddev("size"), avg("volumes"), stddev("volumes")).
  as[(Double, Double, Double, Double)].
  head

val sizeLessThanStddev = col("size") < (sizeAvg - 2 * sizeStddev)
input.filter(sizeLessThanStddev)

这只是4部分筛选运算符的第一部分，其余部分作为家庭练习。

谢谢大家的评论

所以这个解决方案是针对Spark的Java实现的。如果您想要实现Scala，请查看Jacek Laskowski post

解决方案：

//first convert the columns Size and Records to a List<Double>
List<Double> sizeList = dataFrame.select("Size").javaRDD().map(f -> f.getDouble(0)).collect();
List<Double> recordsList = dataFrame.select("Records").javaRDD().map(f -> f.getDouble(0)).collect();

//then convert the lists into JavaDoubleRDD
JavaDoubleRDD size = sparkContext.parallelizeDoubles(sizeList);
JavaDoubleRDD records = sparkContext.parallelizeDoubles(recordsList);

//calculate the mean and stddev using the built in functions:
double sizeMean = size.mean();
double sizeStdev = size.stdev();
double recordsMean = records.mean();
double recordsStdev = records.stdev();

之后，我最终可以在列比较函数中使用这些值。

希望您理解我所做的。

错误是什么？我认为这是因为AVG和STDEV是聚合函数，在SQL中，我知道您不能在Where筛选器上使用聚合函数。唯一的解决方法是在嵌套查询中使用这些函数，但我不知道如何在线程main java.lang.UnsupportedOperationException中执行itException：无法将表达式：avgcastinput[1，int，false]计算为bigint如何将所有嵌套查询作为独立查询运行，获取数据并将其传递给原始查询？这就是我的意思。。。val结果=session.loadSELECT AVGsize-2STDEVsize FROM DATASETT。。。。。val query=ssSELECT*来自数据集，其中大小<${result}。。。。。Tval finalResult=session.loadquery

Column meanSize = functions.mean(size);
Column meanRecords = functions.mean(records);
Column stdSize = functions.stddev(size);
Column stdRecords = functions.stddev(records);

Column lowerSizeThreshold = size.lt((meanSize.minus(stdSize).minus(stdSize)));
Column upperSizeThreshold = size.gt(meanSize.plus(stdSize).plus(stdSize));
Column lowerRecordsThreshold = records.lt(meanRecords.minus(stdRecords).minus(stdRecords));
Column upperRecordsThreshold = records.gt(meanRecords.plus(stdRecords).plus(stdRecords));

Dataset<Row> outliers = dataFrame.where(lowerSizeThreshold.or(upperSizeThreshold).or(lowerRecordsThreshold).or(upperRecordsThreshold));

// preparing the dataset
val input = spark.
  read.
  text("input.txt").
  as[String].
  filter(line => !line.startsWith("Name")).
  map(_.split("\\W+")).
  withColumn("name", $"value"(0)).
  withColumn("size", $"value"(1) cast "int").
  withColumn("volumes", $"value"(2) cast "int").
  select("name", "size", "volumes")
scala> input.show
+------+----+-------+
|  name|size|volumes|
+------+----+-------+
| File1|1030| 107529|
| File2| 997| 106006|
| File3|1546| 112426|
| File4|2235| 117335|
| File5|2061| 115363|
| File6|1875| 114015|
| File7|1237| 110002|
| File8|1546| 112289|
| File9|1030| 107154|
|File10|1339| 110276|
+------+----+-------+

// the final computation
import org.apache.spark.sql.functions._
val (sizeAvg, sizeStddev, volumesAvg, volumesStddev) = input.
  groupBy().
  agg(avg("size"), stddev("size"), avg("volumes"), stddev("volumes")).
  as[(Double, Double, Double, Double)].
  head

val sizeLessThanStddev = col("size") < (sizeAvg - 2 * sizeStddev)
input.filter(sizeLessThanStddev)

//first convert the columns Size and Records to a List<Double>
List<Double> sizeList = dataFrame.select("Size").javaRDD().map(f -> f.getDouble(0)).collect();
List<Double> recordsList = dataFrame.select("Records").javaRDD().map(f -> f.getDouble(0)).collect();

//then convert the lists into JavaDoubleRDD
JavaDoubleRDD size = sparkContext.parallelizeDoubles(sizeList);
JavaDoubleRDD records = sparkContext.parallelizeDoubles(recordsList);

//calculate the mean and stddev using the built in functions:
double sizeMean = size.mean();
double sizeStdev = size.stdev();
double recordsMean = records.mean();
double recordsStdev = records.stdev();