使用java-java.lang.StringIndexOutOfBoundsException映射reduce:字符串索引超出范围:0
我正在尝试编写一个Spark应用程序,它输出以每个字母开头的单词数。我得到一个字符串索引超出范围错误。有什么建议吗,或者我没有以正确的方式处理这个问题使用java-java.lang.StringIndexOutOfBoundsException映射reduce:字符串索引超出范围:0,java,apache-spark,iterator,stringindexoutofbounds,Java,Apache Spark,Iterator,Stringindexoutofbounds,我正在尝试编写一个Spark应用程序,它输出以每个字母开头的单词数。我得到一个字符串索引超出范围错误。有什么建议吗,或者我没有以正确的方式处理这个问题 public class Main { public static void main(String[] args) throws Exception{ //Tell spark to access a cluster SparkConf conf = new SparkConf().setAppName
public class Main {
public static void main(String[] args) throws Exception{
//Tell spark to access a cluster
SparkConf conf = new SparkConf().setAppName("App").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
System.out.printf("%d lines\n", sc.textFile("pg100.txt").count());
//MARK: Mapping
//Read target file into an Resilient Distributed Dataset(RDD)
JavaRDD<String> lines = sc.textFile("pg100.txt");
//Split lines into individual words by converting each line into an array of words
//Treat all words as lowercase
//Ignore non-alphabetic characters
JavaRDD<String> words = lines.flatMap(line -> Arrays.asList(line.split(" ")).iterator()).map(line -> line.replaceAll("[^a-zA-Z0-9_-]","").replaceAll("\\.", "").toLowerCase());
//MARK: Sorting
//Count the total number of words that start with each letter
JavaPairRDD<Character, Integer> letters = words.mapToPair(w -> new Tuple2<>(w.charAt(0), 1));
//MARK: Reducing
//Get count of number of instances of each word
JavaPairRDD<Character, Integer> counts = letters.reduceByKey((n1,n2) -> n1 + n2);
counts.saveAsTextFile("result");
sc.stop();
}
}
公共类主{
公共静态void main(字符串[]args)引发异常{
//告诉spark访问群集
SparkConf conf=新的SparkConf().setAppName(“App”).setMaster(“本地”);
JavaSparkContext sc=新的JavaSparkContext(conf);
System.out.printf(“%d行\n”,sc.textFile(“pg100.txt”).count());
//标记:映射
//将目标文件读入弹性分布式数据集(RDD)
JavaRDD lines=sc.textFile(“pg100.txt”);
//通过将每行转换为单词数组,将行拆分为单个单词
//将所有单词视为小写
//忽略非字母字符
JavaRDD words=lines.flatMap(line->Arrays.asList(line.split(“”).iterator()).map(line->line.replaceAll(“[^a-zA-Z0-9\”,”).replaceAll(“\\”,”).toLowerCase());
//马克:分类
//计算以每个字母开头的单词总数
javapairdd字母=words.mapToPair(w->new Tuple2(w.charAt(0),1));
//马克:减少
//获取每个单词的实例数
JavaPairRDD计数=字母.reduceByKey((n1,n2)->n1+n2);
counts.saveAsTextFile(“结果”);
sc.停止();
}
}
我怀疑某些单词仅由以下行替换的字符组成:
JavaRDD<String> words = lines.flatMap(line -> Arrays.asList(line.split(" ")).iterator()).map(line -> line.replaceAll("[^a-zA-Z0-9_-]","").replaceAll("\\.", "").toLowerCase());
您的代码段包含31行,因此不清楚“第33行”是什么意思。您还发布了带有标记scala的java代码。请更新你的问题,你是对的。输出RDD时,RDD中有空字符串。但是,我无法使用筛选方法筛选出空字符串。
words.filter(line -> !line.equals(""));