使用java spark从csv读取列
我尝试用java和spark阅读csv 现在我这样做:使用java spark从csv读取列,java,csv,apache-spark,Java,Csv,Apache Spark,我尝试用java和spark阅读csv 现在我这样做: String master = "local[2]"; String csvInput = "/home/username/Downloads/countrylist.csv"; String csvOutput = "/home/username/Downloads/countrylist"; JavaSparkContext sc = new JavaSparkContext(master, "load
String master = "local[2]";
String csvInput = "/home/username/Downloads/countrylist.csv";
String csvOutput = "/home/username/Downloads/countrylist";
JavaSparkContext sc = new JavaSparkContext(master, "loadwholecsv", System.getenv("SPARK_HOME"), System.getenv("JARS"));
JavaRDD<String> csvData = sc.textFile(csvInput, 1);
JavaRDD<List<String>> lines = csvData.map(new Function <String, List<String>>() {
@Override
public List<String> call(String s) {
return new ArrayList<String>(Arrays.asList(s.split("\\s*,\\s*")));
}
});
我的RDD行如下所示:
one, two, three
four, five, six
seven, eight, nine
[one, two, three]
[four, five, six]
[seven, eigth, nine]
但是我想要这个:
[one, four, seven]
[two, five, eight]
[three, six, nine]
要执行基于map reduce的矩阵转置(这基本上就是所要求的),您可以通过以下步骤进行操作:
SparkSession spark=SparkSession.builder().appName(“csvReader”).master(“local[2]”).config(“com.databricks.spark.csv”,“some value”).getOrCreate();
String path=“C://Users//U6048715//Desktop//om.csv”;
数据集df=spark.read().csv(路径);
df.show();
那么,预期RDD的类型是什么RDD
?它是RDDSo,是您已经拥有的。需要更改什么?正如我所说,我希望RDDSo中的列不是行,因为您的原始数据是“1,1,uno\2,2,dos\3,3,tres”
,而您当前的RDD是[“1”,“1”,“uno”\“2”,“2”,“dos”\“3”,“tres”]
,您想要一个RDD:[“1”,“2”,“2”,“3”\“uno”,“dos”,“tres”]
,基本上是转置RDD?我如何在RDD内的列表上做到这一点?通过使用zipWithIndex,我会得到一个类似(列表,索引)的元组。我有点困惑,如何获得([(元素,索引),(元素,索引)]@progNewFag你如何将列表(a,B,C)
转换为列表((a,1),(B,2),(C,3))
?提示:前两步是纯Java,不是SparkI:。但我不知道如何在列表上方进行分组。你能帮我吗?@progNewFag我有点困惑,因为我没有JavaPairDD。相反,我只有一个JavaRDD,如何在那里进行分组?输出:+-------------+---+----+----U c0 | U c1 | U c2 |+-------------------+---+----+| CCC:000455763周四:CCCCC:000455757637万0 0 0 0 0 0| CCCCC:00045577 7 7 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0| CCC:000455757637 7 7 7 7 3万0 0 0 0 0 0 0 0 0 0 0 0 0 0 01244\124; CCC| CCC:强强强强强强强强强强强强强赛赛赛赛赛会:3 3 3 3 3 3 3 3 3 3 3 31244;2| CCC:中国国家赛赛赛赛赛赛赛赛会:0 0 0:0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0:0 0 0 0 0 0 0 0 0 0 0 0 0 0 45576380008 | 4 | s4 | WOS:00045576380009 | 1 | s5 |+-------------------+---+----+
[one, four, seven]
[two, five, eight]
[three, six, nine]
[(1,1,one), (1,2,two), (1,3,three)]
[(2,1,four), (2,2,five), (2,3,six)]
[(3,1,seven), (3,2,eigth), (3,3,nine)]
[(1,(1,1,one)), (2,(1,2,two)), (3,(1,3,three))]
[(1,(2,1,four)), (2,(2,2,five)),(3,(2,3,six))]
[(1,(3,1,seven)), (2,(3,2,eigth)), (3,(3,3,nine))]
[(1,[(3,1,seven), (1,1,one), (2,1,four)])]
[(2,[(1,2,two), (3,2,eigth), (2,2,five)])]
[(3,[,(2,3,six),(1,3,three), (3,3,nine))])]
[ one, four, seven ]
[ two, five, eigth ]
[ three, six, nine ]
SparkSession spark = SparkSession.builder().appName("csvReader").master("local[2]").config("com.databricks.spark.csv","some-value").getOrCreate();
String path ="C://Users//U6048715//Desktop//om.csv";
Dataset<org.apache.spark.sql.Row> df =spark.read().csv(path);
df.show();