Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Java Spark-带排序的字数(不是排序)_Java_Apache Spark - Fatal编程技术网

Java Spark-带排序的字数(不是排序)

Java Spark-带排序的字数(不是排序),java,apache-spark,Java,Apache Spark,我正在学习Spark,并尝试扩展WordCount示例,按出现次数对单词进行排序。问题是,在运行代码后,我得到了未排序的结果: (708,word1) (46,word2) (65,word3) 因此,排序似乎因某种原因而失败。wordSortedByCount.first()命令的效果与此类似,并且将执行限制为仅一个线程 import java.io.Serializable; import java.util.Arrays; import java.util.Comparator; imp

我正在学习Spark,并尝试扩展WordCount示例,按出现次数对单词进行排序。问题是,在运行代码后,我得到了未排序的结果:

(708,word1)
(46,word2)
(65,word3)
因此,排序似乎因某种原因而失败。wordSortedByCount.first()命令的效果与此类似,并且将执行限制为仅一个线程

import java.io.Serializable;
import java.util.Arrays;
import java.util.Comparator;
import java.util.List;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;

import scala.Tuple2;

public class JavaWordCount2 {
    public static void main(String[] args) {
        SparkConf sparkConf = new SparkConf().setAppName("JavaWordCountAndSort");
        int numOfKernels = 8;
        sparkConf.setMaster("local[" + numOfKernels + "]");
        JavaSparkContext ctx = new JavaSparkContext(sparkConf);

        JavaRDD<String> lines = ctx.textFile("data.csv", 1);
        JavaRDD<String> words = lines.flatMap(line -> Arrays.asList(line
                .split("[,; :\\.]")));
        words = words.flatMap(line -> Arrays.asList(line.replaceAll("[\"\\(\\)]", "").toLowerCase()));

        // sum words
        JavaPairRDD<String, Integer> counts = words.mapToPair(
                w -> new Tuple2<String, Integer>(w, 1)).reduceByKey(
                (x, y) -> x + y);

        // minimum 5 occurences
        // counts = counts.filer(s -> s._2 > 5);
        counts = counts.filter(new Function<Tuple2<String,Integer>, Boolean>() {
            @Override
            public Boolean call(Tuple2<String, Integer> v1) throws Exception {
                return v1._2 > 5;
            }
        });

        // to enable sorting by value (count) and not key -> value-to-key conversion pattern
        // setting value to null, since it won't be used anymore
        JavaPairRDD<Tuple2<Integer, String>, Integer> countInKey = counts.mapToPair(a -> new Tuple2(new Tuple2<Integer, String>(a._2, a._1), null));

        // sort by num of occurences
        JavaPairRDD<Tuple2<Integer, String>, Integer> wordSortedByCount = countInKey.sortByKey(new TupleComparator(), true);

        // print result    
        List<Tuple2<Tuple2<Integer, String>, Integer>> output = wordSortedByCount.take(10);
        for (Tuple2<?, ?> tuple : output) {
            System.out.println(tuple._1());
        }
        ctx.stop();
    }
}
import java.io.Serializable;
导入java.util.array;
导入java.util.Comparator;
导入java.util.List;
导入org.apache.spark.SparkConf;
导入org.apache.spark.api.java.javapairdd;
导入org.apache.spark.api.java.JavaRDD;
导入org.apache.spark.api.java.JavaSparkContext;
导入org.apache.spark.api.java.function.function;
导入scala.Tuple2;
公共类JavaWordCount2{
公共静态void main(字符串[]args){
SparkConf SparkConf=new SparkConf().setAppName(“JavaWordCountAndSort”);
int numOfKernels=8;
setMaster(“local[“+numfkernels+”]);
JavaSparkContext ctx=新的JavaSparkContext(sparkConf);
javarddlines=ctx.textFile(“data.csv”,1);
javarddwords=lines.flatMap(line->Arrays.asList(line
.split(“[,;:\\.]”);
words=words.flatMap(line->array.asList(line.replaceAll(“[\”\\(\\)]”,“).toLowerCase());
//总和词
javapairdd counts=words.mapToPair(
w->新元组2(w,1)).reduceByKey(
(x,y)->x+y);
//最少5次
//计数=计数。文件管理器(s->s.\U 2>5);
counts=counts.filter(新函数(){
@凌驾
公共布尔调用(tuple2v1)引发异常{
返回v1._2>5;
}
});
//启用按值(计数)而非键排序->值到键转换模式
//将值设置为null,因为它将不再使用
javapairdd countInKey=counts.mapToPair(a->new Tuple2(new Tuple2(a.。\u 2,a.。\u 1),null));
//按发生次数排序
JavaPairdd wordSortedByCount=countInKey.sortByKey(新的TupleComparator(),true);
//打印结果
列表输出=wordSortedByCount.take(10);
for(Tuple2 tuple:output){
System.out.println(tuple._1());
}
ctx.stop();
}
}
比较类别:

import java.io.Serializable;
import java.util.Comparator;
import scala.Tuple2;
public class TupleComparator implements Comparator<Tuple2<Integer, String>>,
        Serializable {
    @Override
    public int compare(Tuple2<Integer, String> tuple1,
            Tuple2<Integer, String> tuple2) {
        return tuple1._1 < tuple2._1 ? 0 : 1;
    }
}
import java.io.Serializable;
导入java.util.Comparator;
导入scala.Tuple2;
公共类TupleComparator实现Comparator,
可序列化{
@凌驾
公共整数比较(Tuple2 tuple1,
图2(图2){
返回tuple1.\u1

有人能告诉我代码有什么问题吗?

您的代码的第一个问题在比较器中。事实上,您返回的是0或1,而
compare
方法应该返回一些负值,无论第一个元素在第二个元素之前。因此将其更改为:

@Override
public int compare(Tuple2<Integer, String> tuple1,
        Tuple2<Integer, String> tuple2) {
    return tuple1._1 - tuple2._1;
}
@覆盖
公共整数比较(Tuple2 tuple1,
图2(图2){
返回tuple1.\u 1-tuple2.\u 1;
}
此外,您应该将
sortByKey
的第二个参数设置为
false
,否则您将得到升序,即从最低到最高,我认为这与您想要的正好相反