Java 使用ApacheSpark从数据帧获取不同计数

Java 使用ApacheSpark从数据帧获取不同计数,java,apache-spark-sql,aggregate-functions,apache-spark-2.0,Java,Apache Spark Sql,Aggregate Functions,Apache Spark 2.0,我有这样的数据 +--------------+---------+-------+---------+ | dataOne|OtherData|dataTwo|dataThree| +--------------+---------|-------+---------+ | Best| tree| 5| 533| | OK| bush| e| 3535| | MEH|

我有这样的数据

+--------------+---------+-------+---------+
|       dataOne|OtherData|dataTwo|dataThree|
+--------------+---------|-------+---------+
|          Best|     tree|      5|      533|
|            OK|     bush|      e|     3535|
|           MEH|      cow|      -|     3353|
|           MEH|      oak|   none|       12|
+--------------+---------+-------+---------+
我正试图把它输入到

+--------------+---------+
|       dataOne|    Count|
+--------------+---------|
|          Best|        1|
|            OK|        1|
|           Meh|        2|
+--------------+---------+
我自己将dataOne放入一个dataframe并显示它的内容,以确保我只是抓取dataOne列,没有问题, 然而,我似乎找不到正确的语法来将sql查询转换为我需要的数据。我尝试从由整个数据集创建的临时视图创建以下数据帧

Dataset<Row> dataOneCount = spark.sql("select dataOne, count(*) from 
dataFrame group by dataOne");
dataOneCount.show();
我还尝试了应用functions()方法

Column countNum = countDistinct(dataFrame.col("dataOne"));
Dataset<Row> result = dataOneDataFrame.withColumn("count",countNum);
result.show();
但是它会返回一个分析异常,我对spark还是新手,所以我不确定在评估countDistinct方法的方式/时间上是否有错误

编辑:为了澄清,显示的第一个表是我通过读取文本文件并对其应用自定义模式创建的数据框的结果(它们仍然都是字符串)


为了方便起见,使用Scala语法。它与Java语法非常相似:

// Input data
val df = {
  import org.apache.spark.sql._
  import org.apache.spark.sql.types._
  import scala.collection.JavaConverters._

  val simpleSchema = StructType(
    StructField("dataOne", StringType) ::
    StructField("OtherData", StringType) ::
    StructField("dataTwo", StringType) ::
    StructField("dataThree", IntegerType) :: Nil)

  val data = List(
    Row("Best", "tree", "5", 533),
    Row("OK", "bush", "e", 3535),
    Row("MEH", "cow", "-", 3353),
    Row("MEH", "oak", "none", 12)
  )

  spark.createDataFrame(data.asJava, simpleSchema)
}

df.show
我可以提交上面给出的Java代码,如下所示,并在S3上提交四行数据文件,它运行良好:

$SPARK_HOME/bin/spark-submit \
  --class sparktest.FromStackOverflow \
  --packages "org.apache.hadoop:hadoop-aws:2.7.3" \
  target/scala-2.11/sparktest_2.11-1.0.0-SNAPSHOT.jar "s3a://my-bucket-name/sample.txt"

我已经在我的java程序中尝试过这种方法,它返回java.lang.ArrayIndexOutOfBoundsException:11您能用一个像我展示的那样的小的独立示例重现它吗?您使用的是什么版本的Spark?您尝试过其他版本吗?使用spark 2.1添加了完整的代码,我不想回归到spark Though的早期版本,我使用的是您的Java代码。我不得不把“命令”改为“数据一”。数据帧中没有名为“command”的列。但除此之外,它与本地Spark 2.1.1安装完美配合。我应该更明确地说抱歉,我正在尝试将此代码部署到本地集群,提取存储在hadoop文件系统中的文件。也许是环境问题导致了这个问题。虽然它让我很困惑,因为它运行简单的sql命令,比如从一些tempview中选择*
Dataset<Row> dataFrame 
public static void main(String[] args) {


    SparkSession spark = SparkSession
            .builder()
            .appName("Log File Reader")
            .getOrCreate();

    //args[0] is the textfile location
    JavaRDD<String> logsRDD = spark.sparkContext()
            .textFile(args[0],1)
            .toJavaRDD();

    String schemaString = "dataOne OtherData dataTwo dataThree";

    List<StructField> fields = new ArrayList<>();
    String[] fieldName = schemaString.split(" ");


    for (String field : fieldName){
        fields.add(DataTypes.createStructField(field, DataTypes.StringType, true));
    }
    StructType schema = DataTypes.createStructType(fields);

    JavaRDD<Row> rowRDD = logsRDD.map((Function<String, Row>) record -> {
       String[] attributes = record.split(" ");
       return RowFactory.create(attributes[0],attributes[1],attributes[2],attributes[3]);
    });


    Dataset<Row> dF = spark.createDataFrame(rowRDD, schema);

    //first attempt
    dF.groupBy(col("dataOne")).count().show();

    //Trying with a sql statement
    dF.createOrReplaceTempView("view");
    dF.sparkSession().sql("select command, count(*) from view group by command").show();
best tree 5 533
OK bush e 3535
MEH cow - 3353
MEH oak none 12
// Input data
val df = {
  import org.apache.spark.sql._
  import org.apache.spark.sql.types._
  import scala.collection.JavaConverters._

  val simpleSchema = StructType(
    StructField("dataOne", StringType) ::
    StructField("OtherData", StringType) ::
    StructField("dataTwo", StringType) ::
    StructField("dataThree", IntegerType) :: Nil)

  val data = List(
    Row("Best", "tree", "5", 533),
    Row("OK", "bush", "e", 3535),
    Row("MEH", "cow", "-", 3353),
    Row("MEH", "oak", "none", 12)
  )

  spark.createDataFrame(data.asJava, simpleSchema)
}

df.show
+-------+---------+-------+---------+
|dataOne|OtherData|dataTwo|dataThree|
+-------+---------+-------+---------+
|   Best|     tree|      5|      533|
|     OK|     bush|      e|     3535|
|    MEH|      cow|      -|     3353|
|    MEH|      oak|   none|       12|
+-------+---------+-------+---------+
df.groupBy(col("dataOne")).count().show()
+-------+-----+
|dataOne|count|
+-------+-----+
|    MEH|    2|
|   Best|    1|
|     OK|    1|
+-------+-----+
$SPARK_HOME/bin/spark-submit \
  --class sparktest.FromStackOverflow \
  --packages "org.apache.hadoop:hadoop-aws:2.7.3" \
  target/scala-2.11/sparktest_2.11-1.0.0-SNAPSHOT.jar "s3a://my-bucket-name/sample.txt"