Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Java 如何按Spark中的字段对数据进行分组?_Java_Apache Spark_Cassandra - Fatal编程技术网

Java 如何按Spark中的字段对数据进行分组?

Java 如何按Spark中的字段对数据进行分组?,java,apache-spark,cassandra,Java,Apache Spark,Cassandra,我想从数据库中读取两列,按第一列对它们进行分组,然后使用Spark将结果插入另一个表中。我的程序是用Java编写的。我尝试了以下方法: public static void aggregateSessionEvents(org.apache.spark.SparkContext sparkContext) { com.datastax.spark.connector.japi.rdd.CassandraJavaPairRDD<String, String> logs = ja

我想从数据库中读取两列,按第一列对它们进行分组,然后使用Spark将结果插入另一个表中。我的程序是用Java编写的。我尝试了以下方法:

public static void aggregateSessionEvents(org.apache.spark.SparkContext sparkContext) {
    com.datastax.spark.connector.japi.rdd.CassandraJavaPairRDD<String, String> logs = javaFunctions(sparkContext)
            .cassandraTable("dove", "event_log", mapColumnTo(String.class), mapColumnTo(String.class))
            .select("session_id", "event");
    logs.groupByKey();
    com.datastax.spark.connector.japi.CassandraJavaUtil.javaFunctions(logs).writerBuilder("dove", "event_aggregation", null).saveToCassandra();
    sparkContext.stop();
}
public static void aggregateSessionEvents(org.apache.spark.SparkContext SparkContext){
com.datastax.spark.connector.japi.rdd.cassandrajavapairdd logs=javaFunctions(sparkContext)
.cassandraTable(“dove”、“事件日志”、mapColumnTo(String.class)、mapColumnTo(String.class))
.选择(“会话id”、“事件”);
logs.groupByKey();
com.datastax.spark.connector.japi.CassandraJavaUtil.javaFunctions(logs).writerBuilder(“dove”,“event_aggregation”,null).saveToCassandra();
sparkContext.stop();
}
这给了我一个错误:

The method cassandraTable(String, String, RowReaderFactory<T>) in the type SparkContextJavaFunctions is not applicable for the arguments (String, String, RowReaderFactory<String>, mapColumnTo(String.class))
SparkContextJavaFunctions类型中的方法cassandraTable(String,String,RowReaderFactory)不适用于参数(String,String,RowReaderFactory,mapColumnTo(String.class)) 我的依赖项是:

<dependencies>
    <dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.10</artifactId>
    <version>2.0.1</version>
    <scope>provided</scope>
</dependency>
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming_2.10</artifactId>
    <version>2.0.1</version>
    <scope>provided</scope>
</dependency>
<dependency>
    <groupId>com.datastax.spark</groupId>
    <artifactId>spark-cassandra-connector_2.10</artifactId>
    <version>1.6.2</version>
</dependency>
</dependencies>

org.apache.spark
spark-core_2.10
2.0.1
假如
org.apache.spark
spark-2.10
2.0.1
假如
com.datasax.spark
spark-cassandra-connector_2.10
1.6.2
如何解决此问题?

更改此选项:

.cassandraTable("dove", "event_log", mapColumnTo(String.class), mapColumnTo(String.class))
致:


您正在发送额外的参数。

要按字段分组数据,请执行以下步骤:

  • 必须将数据检索到该表的JavaRDD中
  • 必须将所需的列提取成一对,其中键为第一个,其余数据为第二个
  • 使用reduceByKey根据要求聚合值
  • 之后,可以将数据插入另一个表中或用于某些进一步的处理

    public static void aggregateSessionEvents(SparkContext sparkContext) {
        JavaRDD<Data> datas = javaFunctions(sparkContext).cassandraTable("test", "data",
                mapRowTo(Data.class));
        JavaPairRDD<String, String> pairDatas = datas
                .mapToPair(data -> new Tuple2<>(data.getKey(), data.getValue()));
        pairDatas.reduceByKey((value1, value2) -> value1 + "," + value2);
        sparkContext.stop();
    }
    
    publicstaticvoidaggregateSessionEvents(SparkContext SparkContext){
    javarddatas=javaFunctions(sparkContext).cassandraTable(“测试”,“数据”,
    mapRowTo(Data.class));
    javapairdd pairdata=datas
    .mapToPair(data->new Tuple2(data.getKey(),data.getValue());
    pairDatas.reduceByKey((value1,value2)->value1+“,”+value2);
    sparkContext.stop();
    }
    
    这给了我一个错误:
    类型不匹配:无法从CassandraTableScanJavaRDD转换为CassandraJavaPairRDD
    public static void aggregateSessionEvents(SparkContext sparkContext) {
        JavaRDD<Data> datas = javaFunctions(sparkContext).cassandraTable("test", "data",
                mapRowTo(Data.class));
        JavaPairRDD<String, String> pairDatas = datas
                .mapToPair(data -> new Tuple2<>(data.getKey(), data.getValue()));
        pairDatas.reduceByKey((value1, value2) -> value1 + "," + value2);
        sparkContext.stop();
    }