Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/amazon-s3/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Amazon s3 未使用SinkMode.REPLACE删除级联S3接收器抽头_Amazon S3_Directory_Amazon_Hdfs_Cascading - Fatal编程技术网

Amazon s3 未使用SinkMode.REPLACE删除级联S3接收器抽头

Amazon s3 未使用SinkMode.REPLACE删除级联S3接收器抽头,amazon-s3,directory,amazon,hdfs,cascading,Amazon S3,Directory,Amazon,Hdfs,Cascading,我们正在运行级联,配置了一个Sink-Tap以存储在AmazonS3中,并且面临一些FileReadyExistsException(参见[1])。 这只是偶尔发生的(1次发生在100左右),并且不可复制 深入研究级联代码,我们发现BaseFlow.deleteSinksIfNotUpdate()调用了Hfs.deleteResource()。 顺便说一句,我们对静默的NPE很感兴趣(评论是“当fs到达根目录时抛出的绕过NPE的黑客”) 在此基础上,我们用自己的tap扩展了Hfs tap,在de

我们正在运行级联,配置了一个Sink-Tap以存储在AmazonS3中,并且面临一些FileReadyExistsException(参见[1])。 这只是偶尔发生的(1次发生在100左右),并且不可复制

深入研究级联代码,我们发现BaseFlow.deleteSinksIfNotUpdate()调用了Hfs.deleteResource()。 顺便说一句,我们对静默的NPE很感兴趣(评论是“当fs到达根目录时抛出的绕过NPE的黑客”)

在此基础上,我们用自己的tap扩展了Hfs tap,在deleteResource()方法(参见[2])中添加了更多操作,其中重试机制直接调用getFileSystem(conf).delete

重试机制似乎带来了改进,但我们有时仍然面临失败(参见[3]中的示例):听起来HDFS返回isDeleted=true,但在文件夹存在后直接询问,我们收到exists=true,这是不应该发生的。当流成功时,日志还会随机显示isDeleted true或false,这听起来像是返回的值不相关或不可信

任何人都可以通过这样的行为带来自己的S3体验:“应该删除文件夹,但它不是”?我们怀疑S3存在问题,但它是否也可能存在于级联或HDFS中

我们在Hadoop Cloudera-cdh3u5和级联2.0.1-wip-dev上运行

[1]

[2]

[3]


首先,仔细检查级联兼容性页面以了解受支持的发行版

注意,AmazonEMR是在定期运行兼容性测试并报告结果时列出的

第二,S3是一个最终一致的文件系统。HDFS不是。因此,关于HDF行为的假设不会继续针对S3存储数据。例如,重命名实际上是复制和删除。复印件可能需要几个小时。亚马逊已经修补了其内部发行版,以适应许多差异

第三,S3中没有目录。这是一种黑客攻击,不同的S3接口(jets3t vs s3cmd vs…)对其提供不同的支持。考虑到前面的一点,这肯定是有问题的

第四,网络延迟和可靠性至关重要,尤其是在与S3通信时。历史上,我发现Amazon网络在使用EMR和标准EC2实例处理S3上的海量数据集时表现更好。我也相信他们的做法是EMR中的一个补丁,它也改善了这里的情况


因此,我建议尝试运行EMR Apache Hadoop发行版,看看您的问题是否得到解决。

在Hadoop上运行任何使用S3中文件的作业时,必须记住最终一致性的细微差别

我帮助解决了许多应用程序的问题,这些应用程序的根本问题是具有类似的删除竞争条件——无论它们是在级联、Hadoop流中还是直接用Java编写的

曾经有一次讨论过在给定的键/值对被完全删除后从S3发出通知。我没有跟上这一特点的发展。否则,可能最好还是设计系统——不管是在级联还是任何其他使用S3的应用程序中——这样,批处理工作流消耗或生成的数据都可以在HDFS或HBase或键/值框架中进行管理(例如,我们已经为此使用了Redis)。然后S3用于持久存储,但不用于中间数据

org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3n://... already exists
    at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132)
    at com.twitter.elephantbird.mapred.output.DeprecatedOutputFormatWrapper.checkOutputSpecs(DeprecatedOutputFormatWrapper.java:75)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:923)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:882)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:882)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:856)
    at cascading.flow.hadoop.planner.HadoopFlowStepJob.internalNonBlockingStart(HadoopFlowStepJob.java:104)
    at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:174)
    at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:137)
    at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:122)
    at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:42)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
    at java.util.concurrent.FutureTask.run(FutureTask.java:138)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.j
  @Override
  public boolean deleteResource(JobConf conf) throws IOException {
    LOGGER.info("Deleting resource {}", getIdentifier());

    boolean isDeleted = super.deleteResource(conf);
    LOGGER.info("Hfs Sink Tap isDeleted is {} for {}", isDeleted,
        getIdentifier());

    Path path = new Path(getIdentifier());

    int retryCount = 0;
    int cumulativeSleepTime = 0;
    int sleepTime = 1000;

    while (getFileSystem(conf).exists(path)) {
      LOGGER
          .info(
              "Resource {} still exists, it should not... - I will continue to wait patiently...",
              getIdentifier());
      try {
        LOGGER.info("Now I will sleep " + sleepTime / 1000
            + " seconds while trying to delete {} - attempt: {}",
            getIdentifier(), retryCount + 1);
        Thread.sleep(sleepTime);
        cumulativeSleepTime += sleepTime;
        sleepTime *= 2;
      } catch (InterruptedException e) {
        e.printStackTrace();
        LOGGER
            .error(
                "Interrupted while sleeping trying to delete {} with message {}...",
                getIdentifier(), e.getMessage());
        throw new RuntimeException(e);
      }

      if (retryCount == 0) {
        getFileSystem(conf).delete(getPath(), true);
      }

      retryCount++;

      if (cumulativeSleepTime > MAXIMUM_TIME_TO_WAIT_TO_DELETE_MS) {
        break;
      }
    }

    if (getFileSystem(conf).exists(path)) {
      LOGGER
          .error(
              "We didn't succeed to delete the resource {}. Throwing now a runtime exception.",
              getIdentifier());
      throw new RuntimeException(
          "Although we waited to delete the resource for "
              + getIdentifier()
              + ' '
              + retryCount
              + " iterations, it still exists - This must be an issue in the underlying storage system.");
    }

    return isDeleted;

  }
INFO [pool-2-thread-15] (BaseFlow.java:1287) - [...] at least one sink is marked for delete
 INFO [pool-2-thread-15] (BaseFlow.java:1287) - [...] sink oldest modified date: Wed Dec 31 23:59:59 UTC 1969
 INFO [pool-2-thread-15] (HiveSinkTap.java:148) - Now I will sleep 1 seconds while trying to delete s3n://... - attempt: 1
 INFO [pool-2-thread-15] (HiveSinkTap.java:130) - Deleting resource s3n://...
 INFO [pool-2-thread-15] (HiveSinkTap.java:133) - Hfs Sink Tap isDeleted is true for s3n://...
 ERROR [pool-2-thread-15] (HiveSinkTap.java:175) - We didn't succeed to delete the resource s3n://... Throwing now a runtime exception.
 WARN [pool-2-thread-15] (Cascade.java:706) - [...] flow failed: ...
 java.lang.RuntimeException: Although we waited to delete the resource for s3n://... 0 iterations, it still exists - This must be an issue in the underlying storage system.
    at com.qubit.hive.tap.HiveSinkTap.deleteResource(HiveSinkTap.java:179)
    at com.qubit.hive.tap.HiveSinkTap.deleteResource(HiveSinkTap.java:40)
    at cascading.flow.BaseFlow.deleteSinksIfNotUpdate(BaseFlow.java:971)
    at cascading.flow.BaseFlow.prepare(BaseFlow.java:733)
    at cascading.cascade.Cascade$CascadeJob.call(Cascade.java:761)
    at cascading.cascade.Cascade$CascadeJob.call(Cascade.java:710)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
    at java.util.concurrent.FutureTask.run(FutureTask.java:138)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:619)