使用java spark从csv读取列

使用java spark从csv读取列,java,csv,apache-spark,Java,Csv,Apache Spark,我尝试用java和spark阅读csv 现在我这样做: String master = "local[2]"; String csvInput = "/home/username/Downloads/countrylist.csv"; String csvOutput = "/home/username/Downloads/countrylist"; JavaSparkContext sc = new JavaSparkContext(master, "load

我尝试用java和spark阅读csv

现在我这样做:

    String master = "local[2]";
    String csvInput = "/home/username/Downloads/countrylist.csv";
    String csvOutput = "/home/username/Downloads/countrylist";

    JavaSparkContext sc = new JavaSparkContext(master, "loadwholecsv", System.getenv("SPARK_HOME"), System.getenv("JARS"));

    JavaRDD<String> csvData = sc.textFile(csvInput, 1);
    JavaRDD<List<String>> lines = csvData.map(new Function <String, List<String>>() {
        @Override
        public List<String> call(String s) {
            return new ArrayList<String>(Arrays.asList(s.split("\\s*,\\s*")));
        }
    });
我的RDD行如下所示:

one, two, three
four, five, six
seven, eight, nine
[one, two, three]
[four, five, six]
[seven, eigth, nine]
但是我想要这个:

[one, four, seven]
[two, five, eight]
[three, six, nine]

要执行基于map reduce的矩阵转置(这基本上就是所要求的),您可以通过以下步骤进行操作:

  • 将行转换为索引元组:(提示:使用zipWithIndex和map)

  • 将列作为键添加到每个元组:(提示:使用映射)

  • 按键分组

  • 将值按顺序重新排序并删除索引工件(提示:map)

  • SparkSession spark=SparkSession.builder().appName(“csvReader”).master(“local[2]”).config(“com.databricks.spark.csv”,“some value”).getOrCreate();
    String path=“C://Users//U6048715//Desktop//om.csv”;
    数据集df=spark.read().csv(路径);
    df.show();
    
    那么,预期RDD的类型是什么
    RDD
    ?它是RDDSo,是您已经拥有的。需要更改什么?正如我所说,我希望RDDSo中的列不是行,因为您的原始数据是
    “1,1,uno\2,2,dos\3,3,tres”
    ,而您当前的RDD是
    [“1”,“1”,“uno”\“2”,“2”,“dos”\“3”,“tres”]
    ,您想要一个RDD:
    [“1”,“2”,“2”,“3”\“uno”,“dos”,“tres”]
    ,基本上是转置RDD?我如何在RDD内的列表上做到这一点?通过使用zipWithIndex,我会得到一个类似(列表,索引)的元组。我有点困惑,如何获得([(元素,索引),(元素,索引)]@progNewFag你如何将
    列表(a,B,C)
    转换为
    列表((a,1),(B,2),(C,3))
    ?提示:前两步是纯Java,不是SparkI:。但我不知道如何在列表上方进行分组。你能帮我吗?@progNewFag我有点困惑,因为我没有JavaPairDD。相反,我只有一个JavaRDD,如何在那里进行分组?输出:+-------------+---+----+----U c0 | U c1 | U c2 |+-------------------+---+----+| CCC:000455763周四:CCCCC:000455757637万0 0 0 0 0 0| CCCCC:00045577 7 7 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0| CCC:000455757637 7 7 7 7 3万0 0 0 0 0 0 0 0 0 0 0 0 0 0 01244\124; CCC| CCC:强强强强强强强强强强强强强赛赛赛赛赛会:3 3 3 3 3 3 3 3 3 3 3 31244;2| CCC:中国国家赛赛赛赛赛赛赛赛会:0 0 0:0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0:0 0 0 0 0 0 0 0 0 0 0 0 0 0 45576380008 | 4 | s4 | WOS:00045576380009 | 1 | s5 |+-------------------+---+----+
    [one, four, seven]
    [two, five, eight]
    [three, six, nine]
    
    [(1,1,one), (1,2,two), (1,3,three)]
    [(2,1,four), (2,2,five), (2,3,six)]
    [(3,1,seven), (3,2,eigth), (3,3,nine)]
    
    [(1,(1,1,one)), (2,(1,2,two)), (3,(1,3,three))]
    [(1,(2,1,four)), (2,(2,2,five)),(3,(2,3,six))]
    [(1,(3,1,seven)), (2,(3,2,eigth)), (3,(3,3,nine))]
    
    [(1,[(3,1,seven), (1,1,one), (2,1,four)])]
    [(2,[(1,2,two), (3,2,eigth), (2,2,five)])]
    [(3,[,(2,3,six),(1,3,three), (3,3,nine))])]
    
    [ one, four, seven ]
    [ two, five, eigth ]
    [ three, six, nine ]
    
    SparkSession spark = SparkSession.builder().appName("csvReader").master("local[2]").config("com.databricks.spark.csv","some-value").getOrCreate();  
    
    String path ="C://Users//U6048715//Desktop//om.csv";    
    
    Dataset<org.apache.spark.sql.Row> df =spark.read().csv(path);   
    df.show();