使用java-java.lang.StringIndexOutOfBoundsException映射reduce:字符串索引超出范围:0

使用java-java.lang.StringIndexOutOfBoundsException映射reduce:字符串索引超出范围:0,java,apache-spark,iterator,stringindexoutofbounds,Java,Apache Spark,Iterator,Stringindexoutofbounds,我正在尝试编写一个Spark应用程序,它输出以每个字母开头的单词数。我得到一个字符串索引超出范围错误。有什么建议吗,或者我没有以正确的方式处理这个问题 public class Main { public static void main(String[] args) throws Exception{ //Tell spark to access a cluster SparkConf conf = new SparkConf().setAppName

我正在尝试编写一个Spark应用程序,它输出以每个字母开头的单词数。我得到一个字符串索引超出范围错误。有什么建议吗,或者我没有以正确的方式处理这个问题

public class Main {
    public static void main(String[] args) throws Exception{

        //Tell spark to access a cluster
        SparkConf conf = new SparkConf().setAppName("App").setMaster("local");
        JavaSparkContext sc = new JavaSparkContext(conf);
        System.out.printf("%d lines\n", sc.textFile("pg100.txt").count());


        //MARK: Mapping
        //Read target file into an Resilient Distributed Dataset(RDD)
        JavaRDD<String> lines = sc.textFile("pg100.txt");

        //Split lines into individual words by converting each line into an array of words
        //Treat all words as lowercase
        //Ignore non-alphabetic characters
        JavaRDD<String> words = lines.flatMap(line -> Arrays.asList(line.split(" ")).iterator()).map(line -> line.replaceAll("[^a-zA-Z0-9_-]","").replaceAll("\\.", "").toLowerCase());

        //MARK: Sorting
        //Count the total number of words that start with each letter
        JavaPairRDD<Character, Integer> letters = words.mapToPair(w -> new Tuple2<>(w.charAt(0), 1));

        //MARK: Reducing
        //Get count of number of instances of each word
        JavaPairRDD<Character, Integer> counts = letters.reduceByKey((n1,n2) -> n1 + n2);

        counts.saveAsTextFile("result");
        sc.stop();

    }
}
公共类主{
公共静态void main(字符串[]args)引发异常{
//告诉spark访问群集
SparkConf conf=新的SparkConf().setAppName(“App”).setMaster(“本地”);
JavaSparkContext sc=新的JavaSparkContext(conf);
System.out.printf(“%d行\n”,sc.textFile(“pg100.txt”).count());
//标记:映射
//将目标文件读入弹性分布式数据集(RDD)
JavaRDD lines=sc.textFile(“pg100.txt”);
//通过将每行转换为单词数组,将行拆分为单个单词
//将所有单词视为小写
//忽略非字母字符
JavaRDD words=lines.flatMap(line->Arrays.asList(line.split(“”).iterator()).map(line->line.replaceAll(“[^a-zA-Z0-9\”,”).replaceAll(“\\”,”).toLowerCase());
//马克:分类
//计算以每个字母开头的单词总数
javapairdd字母=words.mapToPair(w->new Tuple2(w.charAt(0),1));
//马克:减少
//获取每个单词的实例数
JavaPairRDD计数=字母.reduceByKey((n1,n2)->n1+n2);
counts.saveAsTextFile(“结果”);
sc.停止();
}
}

我怀疑某些单词仅由以下行替换的字符组成:

JavaRDD<String> words = lines.flatMap(line -> Arrays.asList(line.split(" ")).iterator()).map(line -> line.replaceAll("[^a-zA-Z0-9_-]","").replaceAll("\\.", "").toLowerCase());

您的代码段包含31行,因此不清楚“第33行”是什么意思。您还发布了带有标记scala的java代码。请更新你的问题,你是对的。输出RDD时,RDD中有空字符串。但是,我无法使用筛选方法筛选出空字符串。
words.filter(line -> !line.equals(""));