ApacheSpark，具有用于HadoopRDD的自定义InputFormat_Hadoop_Apache Spark

ApacheSpark，具有用于HadoopRDD的自定义InputFormat

hadoop apache-spark

ApacheSpark，具有用于HadoopRDD的自定义InputFormat,hadoop,apache-spark,Hadoop,Apache Spark,我目前正在Apache Spark上工作。我已经为ApacheHadoop实现了一个定制的InputFormat，它通过TCP套接字读取键值记录。我想将此代码移植到ApacheSpark，并将其与hadoopRDD（）函数一起使用。我的Apache Spark代码如下所示： public final class SparkParallelDataLoad { public static void main(String[] args) { int iterations

我目前正在Apache Spark上工作。我已经为ApacheHadoop实现了一个定制的

InputFormat

，它通过TCP套接字读取键值记录。我想将此代码移植到ApacheSpark，并将其与

hadoopRDD（）

函数一起使用。我的Apache Spark代码如下所示：

public final class SparkParallelDataLoad {

    public static void main(String[] args) {
        int iterations = 100;
        String dbNodesLocations = "";
        if(args.length < 3) {
            System.err.printf("Usage ParallelLoad <coordinator-IP> <coordinator-port> <numberOfSplits>\n");
            System.exit(1);
        }
        JobConf jobConf = new JobConf();
        jobConf.set(CustomConf.confCoordinatorIP, args[0]);
        jobConf.set(CustomConf.confCoordinatorPort, args[1]);
        jobConf.set(CustomConf.confDBNodesLocations, dbNodesLocations);

        int numOfSplits = Integer.parseInt(args[2]);

        CustomInputFormat.setCoordinatorIp(args[0]);
        CustomInputFormat.setCoordinatorPort(Integer.parseInt(args[1]));

        SparkConf sparkConf = new SparkConf().setAppName("SparkParallelDataLoad");

        JavaSparkContext sc = new JavaSparkContext(sparkConf);

        JavaPairRDD<LongWritable, Text> records = sc.hadoopRDD(jobConf, 
                CustomInputFormat.class, LongWritable.class, Text.class, 
                numOfSplits);

        JavaRDD<LabeledPoint> points = records.map(new Function<Tuple2<LongWritable, Text>, LabeledPoint>() {

            private final Log log = LogFactory.getLog(Function.class);
            /**
             * 
             */
            private static final long serialVersionUID = -1771348263117622186L;

            private final Pattern SPACE = Pattern.compile(" ");
            @Override
            public LabeledPoint call(Tuple2<LongWritable, Text> tuple)
                    throws Exception {
                if(tuple == null || tuple._1() == null || tuple._2() == null)
                    return null;
                double y = Double.parseDouble(Long.toString(tuple._1.get()));
                String[] tok = SPACE.split(tuple._2.toString());
                double[] x = new double[tok.length];
                for (int i = 0; i < tok.length; ++i) {
                    if(tok[i].isEmpty() == false)
                        x[i] = Double.parseDouble(tok[i]);
                }
                return new LabeledPoint(y, Vectors.dense(x));
            }

        });

        System.out.println("Number of records: " + points.count());
        LinearRegressionModel model = LinearRegressionWithSGD.train(points.rdd(), iterations);
        System.out.println("Model weights: " + model.weights());

        sc.stop();
    }
}

公共最终类SparkParallelDataLoad{
公共静态void main（字符串[]args）{
int迭代次数=100；
字符串dbNodesLocations=“”；
如果（参数长度<3）{
System.err.printf（“用法并行加载\n”）；
系统出口（1）；
}
JobConf JobConf=新的JobConf（）；
jobConf.set（CustomConf.confCoordinatorIP，args[0]）；
jobConf.set（CustomConf.confordinatorport，args[1]）；
jobConf.set（CustomConf.confDBNodesLocations、dbNodesLocations）；
int numOfSplits=Integer.parseInt（args[2]）；
CustomInputFormat.setCoordinatorIp（args[0]）；
setCoordinatorPort（Integer.parseInt（args[1]）；
SparkConf SparkConf=new SparkConf（）.setAppName（“SparkParallelDataLoad”）；
JavaSparkContext sc=新的JavaSparkContext（sparkConf）；
javapairdd records=sc.hadoopRDD（jobConf，
CustomInputFormat.class、LongWritable.class、Text.class、，
努莫夫分裂）；
JavaRDD points=records.map（新函数（）{
私有最终日志=LogFactory.getLog（Function.class）；
/**
* 
*/
私有静态最终长serialVersionUID=-1771348263117622186L；
私有最终模式空间=Pattern.compile（“”）；
@凌驾
公共标签点调用（Tuple2 tuple）
抛出异常{
if（tuple==null | | | tuple._1（）==null | | tuple._2（）==null）
返回null；
double y=double.parseDouble（Long.toString（tuple.\u 1.get（））；
String[]tok=SPACE.split（tuple._2.toString（））；
double[]x=新的double[tok.length]；
对于（int i=0；i


在我的项目中，我还必须决定哪个Spark Worker将连接到哪个数据源（类似于1:1关系的“配对”过程）。因此，我创建了一个与数据源数量相等的InputSplit
s，以便将我的数据与SparkContext
并行发送。我的问题如下：
方法inpuplit.getLength（）
的结果是否会影响RecordReader
返回的记录数？详细地说，我在测试运行中看到，作业仅在返回一条记录后结束，这只是因为我从CustomInputSplit.getLength（）函数返回了一个值0

在Apache Spark上下文中，至少在执行records.map（）
函数调用时，worker的数量是否等于从myInputFormat
生成的InputSplits
的数量
以上问题2的答案对我的项目来说非常重要
谢谢,，
Nick
是。Spark的sc.hadoopRDD
将创建一个RDD，其分区数量与InputFormat.getSplits
报告的分区数量相同
hadoopRDD
的最后一个参数名为minPartitions
（numOfSplits
，在代码中）将用作InputFormat.getSplits
的提示。但是getSplits
返回的数字无论是大还是小都会得到尊重
查看
处的代码我认为您在问题中的许多地方使用了InputSplit.getLength
，您的意思是InputFormat.getSplits
。抱歉，如果我感到困惑的话。我必须做一些类似的事情，我的示例项目在github上，请检查这里。谢谢你的回答。您还知道从inputspits
读取数据是否并行进行吗？以不同的方式表达前面的问题：RDD是否将通过从数据源检索数据来并行创建？将为每个拆分创建一个任务。每个工作CPU核心拾取一个任务。如果您有足够的工作内核，所有拆分都将并行进行。否则，每个核心将进行一次分割，然后在分割完成后进行更多的工作，直到所有任务都完成。好的，我明白了。谢谢你的时间和帮助。