Java 如何使用storm将数据持久化到HDFS中_Java_Hadoop_Hdfs_Apache Storm

Java 如何使用storm将数据持久化到HDFS中

java hadoop apache-storm

Java 如何使用storm将数据持久化到HDFS中,java,hadoop,hdfs,apache-storm,Java,Hadoop,Hdfs,Apache Storm,我有一个简单的螺栓，从卡夫卡喷口读取数据，然后将数据写入HDFS目录。问题是，直到集群停止，螺栓才会写入。我如何确保当螺栓从卡夫卡喷口读取元组时，它会立即将其写入HDFS，或者至少写入每个“n”条目。（我正在使用CDH4.4和Hadoop2.0）螺栓的java： public class PrinterBolt10 extends BaseRichBolt{ private OutputCollector collector; private String values;

我有一个简单的螺栓，从卡夫卡喷口读取数据，然后将数据写入HDFS目录。问题是，直到集群停止，螺栓才会写入。我如何确保当螺栓从卡夫卡喷口读取元组时，它会立即将其写入HDFS，或者至少写入每个“n”条目。（我正在使用CDH4.4和Hadoop2.0）

螺栓的java：

public class PrinterBolt10 extends BaseRichBolt{  
    private OutputCollector collector;
    private String values;
    Configuration configuration = null;
    FileSystem hdfs = null;
    FSDataOutputStream outputStream=null;
    BufferedWriter br = null; 
    List<String> valList;
    String machineValue;
    int upTime;
    int downTime;
    int idleTime; 

    public void prepare(Map config, TopologyContext context,OutputCollector collector) {
        upTime=0;
        downTime=0;
        idleTime=0;
        this.collector = collector;
        String timeStamp = new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime());
        try{
            configuration = new Configuration();
            configuration.set("fs.defaultFS", "hdfs://localhost.localdomain:8020");
            hdfs =FileSystem.get(configuration);
            outputStream = hdfs.create(new Path("/tmp/storm/StormHdfs/machine10_"+timeStamp+".txt"));
            br = new BufferedWriter( new OutputStreamWriter( outputStream , "UTF-8" ) );
            br.flush(); 
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }

    public void execute(Tuple tuple) {  
        values = tuple.toString();
        int start = values.indexOf('[');
        int end = values.indexOf(']'); 
        machineValue=values.substring(start+1,end); 
        String machine=machineValue.substring(0,machineValue.indexOf(','));
        String code = machineValue.substring(machineValue.indexOf(',')+1);
        int codeInt = Integer.parseInt(code);
        if(codeInt==0) idleTime+=30;
        elseif(codeInt==1) upTime+=30;
        else downTime+=30; 
        String finalMessage = machine + " "+ "upTime(s) :" + upTime+" "+ "idleTime(s): "+idleTime+" "+"downTime: "+downTime;  
        try {
            br.write(finalMessage);  // *This is the writing part into HDFS*
            br.write('\n'); 
            br.flush();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    public void declareOutputFields(OutputFieldsDeclarer declarer) {
        // this bolt does not emit anything
    }

    public void cleanup() {}
}

public类PrinterBolt10扩展BaseRichBolt{
专用输出采集器；
私有字符串值；
配置=空；
文件系统hdfs=null；
FSDataOutputStream outputStream=null；
BufferedWriter br=null；
列表列表；
字符串机值；
正常运行时间；
停机时间；
int空闲时间；
public void prepare（地图配置、拓扑上下文、OutputCollector）{
正常运行时间=0；
停机时间=0；
空闲时间=0；
this.collector=收集器；
字符串时间戳=新的SimpleDataFormat（“yyyyMMdd_HHmmss”）.format（Calendar.getInstance（）.getTime（））；
试一试{
配置=新配置（）；
configuration.set（“fs.defaultFS”hdfs://localhost.localdomain:8020");
hdfs=FileSystem.get（配置）；
outputStream=hdfs.create（新路径（“/tmp/storm/StormHdfs/machine10_“+timeStamp+”.txt”）；
br=新的BufferedWriter（新的OutputStreamWriter（outputStream，UTF-8））；
br.flush（）；
}捕获（IOE异常）{
//TODO自动生成的捕捉块
e、 printStackTrace（）；
}
}
公共无效执行（元组）{
values=tuple.toString（）；
int start=values.indexOf（'['）；
int end=values.indexOf（']'）；
machineValue=值。子字符串（开始+1，结束）；
字符串machine=machineValue.substring（0，machineValue.indexOf（'，'）；
字符串代码=machineValue.substring（machineValue.indexOf（'，'）+1）；
int codeInt=Integer.parseInt（代码）；
如果（codeInt==0）空闲时间+=30；
elseif（codeInt==1）正常运行时间+=30；
否则停机时间+=30；
字符串finalMessage=machine++++“正常运行时间”：“+upTime++”“idleTime:”+idleTime++“停机时间”：“+downTime；
试一试{
br.写入（最终消息）；//*这是HDFS中的写入部分*
br.写入（'\n'）；
br.flush（）；
}捕获（IOE异常）{
e、 printStackTrace（）；
}
}
公共无效申报输出字段（OutputFields申报器申报器）{
//这个螺栓什么也不发出
}
public void cleanup（）{}
}

编辑：完全更改了我的答案

您需要使用

HdfsBolt

，而不是自己编写文件。使用

HdfsBolt

消除了计算何时刷新文件、打开缓冲流等的所有复杂性。请参阅，但您感兴趣的位有：

// Use pipe as record boundary
RecordFormat format = new DelimitedRecordFormat().withFieldDelimiter("|");

//Synchronize data buffer with the filesystem every 1000 tuples
SyncPolicy syncPolicy = new CountSyncPolicy(1000);

// Rotate data files when they reach five MB
FileRotationPolicy rotationPolicy = new FileSizeRotationPolicy(5.0f, Units.MB);

// Use default, Storm-generated file names
FileNameFormat fileNameFormat = new DefaultFileNameFormat().withPath("/foo");

// Instantiate the HdfsBolt
HdfsBolt bolt = new HdfsBolt()
     .withFsURL("hdfs://localhost:54310")
     .withFileNameFormat(fileNameFormat)
     .withRecordFormat(format)
     .withRotationPolicy(rotationPolicy)
     .withSyncPolicy(syncPolicy);

然后只需将数据从当前螺栓传递到此螺栓。

您应该使用HdfsBolt将数据插入HDFS。使用作者描述的配置。您不应该将SyncPolicy count设置为1000，而应该将其设置为某个最小值（例如10-20），以进行测试。因为这个数字表示喷口发出了多少个元组，这些元组应该写在HDFS上。例如，如果您配置了

SyncPolicy syncPolicy = new CountSyncPolicy(10);

然后，您将能够在10条消息后看到插入卡夫卡的数据。

是的，我的目的是强制它将输出写入所需的目标。我再次查看了这一点，意识到我错了。结果我改变了我的答案。我看到了hdfs螺栓，但它导入了一个特定的类“org.apache.hadoop.hdfs.client.HdfsDataOutputStream.SyncFlag”，我觉得它在CDH 4.4 hadoop中不可用-hdfs@benwatsondata我已经看到了你答案中给出的例子，但我很好奇这到底是怎么回事。我是否需要让单独的bolt扩展HhdfsBolt类，然后将其放入execute（）方法？这个代码是有意义的，我只是不知道它去哪里。虽然这不是你的问题的直接答案，你也可以考虑把风暴输出写回到卡夫卡（你已经从卡夫卡读取，所以基础设施存在），然后使用一个工具，如LinkedIn加缪来照顾（批量）从卡夫卡加载数据到HDFS。这种方法也可能更安全，因为从Storm直接写入HDFS可能会导致重复数据，请参阅。