Java 如何防止hadoop流关闭？_Java_Multithreading_Hadoop

Java 如何防止hadoop流关闭？

java multithreading hadoop

Java 如何防止hadoop流关闭？,java,multithreading,hadoop,Java,Multithreading,Hadoop,我构建了一个基本的web解析器，它使用hadoop将URL传递给多个线程。在我到达输入文件的末尾之前，它工作得相当好，Hadoop在仍有线程运行的情况下声明它自己完成了。这将导致错误org.apache.hadoop.fs.FSError:java.io.IOException:streamclosed。有没有办法让流保持足够长的开放时间，以便线程结束？（我可以合理准确地预测线程在单个url上花费的最大时间）下面是我如何执行线程的 public static class Map extends

我构建了一个基本的web解析器，它使用hadoop将URL传递给多个线程。在我到达输入文件的末尾之前，它工作得相当好，Hadoop在仍有线程运行的情况下声明它自己完成了。这将导致错误org.apache.hadoop.fs.FSError:java.io.IOException:streamclosed。有没有办法让流保持足够长的开放时间，以便线程结束？（我可以合理准确地预测线程在单个url上花费的最大时间）

下面是我如何执行线程的

public static class Map extends MapReduceBase implements
            Mapper<LongWritable, Text, Text, Text> {
        private Text word = new Text();
        private URLPile pile = new URLPile();
        private MSLiteThread[] Threads = new MSLiteThread[16];
        private boolean once = true;

        @Override
        public void map(LongWritable key, Text value,
                OutputCollector<Text, Text> output, Reporter reporter) {

            String url = value.toString();
            StringTokenizer urls = new StringTokenizer(url);
            Config.LoggerProvider = LoggerProvider.DISABLED;
             System.out.println("In Mapper");
            if (once) {
                for (MSLiteThread thread : Threads) {
                    System.out.println("created thread");
                    thread = new MSLiteThread(pile);
                    thread.start();
                }
                once = false;
            }

            while (urls.hasMoreTokens()) {
                try {
                    word.set(urls.nextToken());
                    String currenturl = word.toString();
                    pile.addUrl(currenturl, output);

                } catch (Exception e) {
                    e.printStackTrace();
                    continue;
                }

            }

        }

并介绍了urlpile中的相关方法

public synchronized void addUrl(String url,OutputCollector<Text, Text> output) throws InterruptedException {
        while(queue.size()>16){
            System.out.println("queue full");
            wait();
        }
        finishedParcing--;
        queue.add(new MSLiteURL(output,url));
        notifyAll();
    }

    private Queue<MSLiteURL> queue = new LinkedList<MSLiteURL>();
    private int sent = 0;
    private int finishedParcing = 0;
    public synchronized MSLiteURL getNextURL() throws InterruptedException {

        notifyAll();
        sent++;
        //System.out.println(queue.peek());
        return queue.remove();

    }

public synchronized void addUrl（字符串url，OutputCollector输出）引发中断异常{
while（queue.size（）>16）{
System.out.println（“队列已满”）；
等待（）；
}
终弧--；
添加（新的MSLiteURL（输出，url））；
notifyAll（）；
}
private Queue Queue=new LinkedList（）；
发送的私有int=0；
专用int finishedParcing=0；
公共同步MSLiteURL getNextURL（）引发InterruptedException{
notifyAll（）；
发送++；
//System.out.println（queue.peek（））；
return queue.remove（）；
}

正如我从下面的注释中推断的那样，您可能可以在map（）函数的每个部分中执行此操作，以简化操作。我看到您执行了以下操作，以预创建一些空闲线程。您可以将以下代码移动到

if (once) {
  for (MSLiteThread thread : Threads) {
     System.out.println("created thread");
     thread = new MSLiteThread(pile);
     thread.start();
  }
once = false;
}

对,

您的代码可能存在问题： 在

pile.addUrl（当前URL，输出），当您添加一个新的url时，同时所有16个线程都会得到更新（我不太确定），因为相同的堆对象被传递给16个线程。你的URL有可能被重新处理，或者你可能会得到一些其他的副作用（我不是很确定）
其他建议：
此外，您可能希望使用以下命令增加映射任务超时
mapred.task.timeout
（默认值=600000ms）=10分钟
说明：如果任务既不读取输入，也不写入输出或更新，则任务终止前的毫秒数
它的状态字符串
您可以在mapred site.xml中添加/重写此属性，正如我从下面的注释中推断的那样，您可能可以在每个map（）函数中执行此操作以简化操作。
我看到您执行了以下操作，以预创建一些空闲线程。
您可以将以下代码移动到
if (once) {
  for (MSLiteThread thread : Threads) {
     System.out.println("created thread");
     thread = new MSLiteThread(pile);
     thread.start();
  }
once = false;
}

对,
您的代码可能存在问题：
在pile.addUrl（当前URL，输出），当您添加一个新的url时，同时所有16个线程都会得到更新（我不太确定），因为相同的堆对象被传递给16个线程。你的URL有可能被重新处理，或者你可能会得到一些其他的副作用（我不是很确定）
其他建议：
此外，您可能希望使用以下命令增加映射任务超时
mapred.task.timeout
（默认值=600000ms）=10分钟
说明：如果任务既不读取输入，也不写入输出或更新，则任务终止前的毫秒数
它的状态字符串
您可以在mapred site.xml中添加/重写此属性，如果任务确实超时，则该属性会声明任务失败，而这并不是我想要的，但它似乎在正确的轨道上。啊！我可能遗漏了问题中的一些细节。你是说你有从一个映射任务运行的线程，当映射完成处理它的输入时，Hadoop会退出吗？或多或少。线程需要一段时间来处理每个输入，这就是为什么我有多个线程。但是，一旦hadoop声明map任务已完成，线程就没有放置其输出的位置。您能说明如何在map（）函数中执行线程吗？那样的话我可能会回答得更好。非常感谢你的帮助！希望这将解决我一直遇到的问题。如果任务确实超时，这将宣布任务失败，这并不完全是我想要的，但它似乎在正确的轨道上。啊！我可能遗漏了问题中的一些细节。你是说你有从一个映射任务运行的线程，当映射完成处理它的输入时，Hadoop会退出吗？或多或少。线程需要一段时间来处理每个输入，这就是为什么我有多个线程。但是，一旦hadoop声明map任务已完成，线程就没有放置其输出的位置。您能说明如何在map（）函数中执行线程吗？那样的话我可能会回答得更好。非常感谢你的帮助！希望这能解决我一直遇到的问题。
public static class Map extends MapReduceBase implements
            Mapper<LongWritable, Text, Text, Text> {
    @Override
    public void configure(JobConf job) {
       for (MSLiteThread thread : Threads) {
         System.out.println("created thread");
         thread = new MSLiteThread(pile);
         thread.start();
       }
    }

    @Override
    public void map(LongWritable key, Text value,
       OutputCollector<Text, Text> output, Reporter reporter) {
    }

}

 public static class Map extends MapReduceBase implements
                Mapper<LongWritable, Text, Text, Text> {

            @Override
            public void map(LongWritable key, Text value,
                OutputCollector<Text, Text> output, Reporter reporter) {

                String url = value.toString();
                StringTokenizer urls = new StringTokenizer(url);
                Config.LoggerProvider = LoggerProvider.DISABLED;

            //setting countdownlatch to urls.countTokens() to block off that many threads.
            final CountDownLatch latch = new CountDownLatch(urls.countTokens());
            while (urls.hasMoreTokens()) {
                try {
                    word.set(urls.nextToken());
                    String currenturl = word.toString();
                    //create thread and fire for current URL here
                    thread = new URLProcessingThread(currentURL, latch);
                    thread.start();
                } catch (Exception e) {
                    e.printStackTrace();
                    continue;
                }

            }

          latch.await();//wait for 16 threads to complete execution
          //sleep here for sometime if you wish

        }

    }

public class URLProcessingThread implments Runnable {
    CountDownLatch latch;
    URL url;
    public  URLProcessingThread(URL url,  CountDownLatch latch){
       this.latch = latch;
       this.url = url;
    }
    void run() {
         //process url here
         //after everything finishes decrement the latch
         latch.countDown();//reduce count of CountDownLatch by 1

    }
}