Java 问:将Avro转换为记忆中的拼花地板
我收到卡夫卡的Avro唱片。我想把这些记录转换成拼花文件。以下是我的博客: 到目前为止,代码大致如下所示:Java 问:将Avro转换为记忆中的拼花地板,java,hadoop,avro,parquet,Java,Hadoop,Avro,Parquet,我收到卡夫卡的Avro唱片。我想把这些记录转换成拼花文件。以下是我的博客: 到目前为止,代码大致如下所示: final String fileName SinkRecord record, final AvroData avroData final Schema avroSchema = avroData.fromConnectSchema(record.valueSchema()); CompressionCodecName compressionCodecName = Compressi
final String fileName
SinkRecord record,
final AvroData avroData
final Schema avroSchema = avroData.fromConnectSchema(record.valueSchema());
CompressionCodecName compressionCodecName = CompressionCodecName.SNAPPY;
int blockSize = 256 * 1024 * 1024;
int pageSize = 64 * 1024;
Path path = new Path(fileName);
writer = new AvroParquetWriter<>(path, avroSchema, compressionCodecName, blockSize, pageSize);
最终字符串文件名
创纪录,
最终AvroData AvroData
最终模式avroSchema=avroData.fromConnectSchema(record.valueSchema());
CompressionCodecName CompressionCodecName=CompressionCodecName.SNAPPY;
int blockSize=256*1024*1024;
int pageSize=64*1024;
路径路径=新路径(文件名);
writer=新的AvroParquetWriter(路径、avroSchema、压缩编码名称、块大小、页面大小);
现在,这将完成Avro到拼花地板的转换,但它会将拼花文件写入磁盘。我想知道是否有更简单的方法将文件保存在内存中,这样我就不必管理磁盘上的临时文件。多谢各位
"but it will write the Parquet file to the disk"
"if there was an easier way to just keep the file in memory"
从您的查询中,我了解到您不想将部分文件写入parquet。如果您希望将完整文件以拼花格式写入磁盘,并将临时文件写入内存,则可以使用内存映射文件和拼花格式的组合
将数据写入内存映射文件,完成写入后,将字节转换为拼花格式并存储到磁盘
看看。请查看我的博客,必要时翻译成英文
package yanbin.blog;
import org.apache.parquet.io.DelegatingPositionOutputStream;
import org.apache.parquet.io.OutputFile;
import org.apache.parquet.io.PositionOutputStream;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.OutputStream;
public class InMemoryOutputFile implements OutputFile {
private final ByteArrayOutputStream baos = new ByteArrayOutputStream();
@Override
public PositionOutputStream create(long blockSizeHint) throws IOException { // Mode.CREATE calls this method
return new InMemoryPositionOutputStream(baos);
}
@Override
public PositionOutputStream createOrOverwrite(long blockSizeHint) throws IOException {
return null;
}
@Override
public boolean supportsBlockSize() {
return false;
}
@Override
public long defaultBlockSize() {
return 0;
}
public byte[] toArray() {
return baos.toByteArray();
}
private static class InMemoryPositionOutputStream extends DelegatingPositionOutputStream {
public InMemoryPositionOutputStream(OutputStream outputStream) {
super(outputStream);
}
@Override
public long getPos() throws IOException {
return ((ByteArrayOutputStream) this.getStream()).size();
}
}
}
public static <T extends SpecificRecordBase> void writeToParquet(List<T> avroObjects) throws IOException {
Schema avroSchema = avroObjects.get(0).getSchema();
GenericData genericData = GenericData.get();
genericData.addLogicalTypeConversion(new TimeConversions.DateConversion());
InMemoryOutputFile outputFile = new InMemoryOutputFile();
try (ParquetWriter<Object> writer = AvroParquetWriter.builder(outputFile)
.withDataModel(genericData)
.withSchema(avroSchema)
.withCompressionCodec(CompressionCodecName.SNAPPY)
.withWriteMode(ParquetFileWriter.Mode.CREATE)
.build()) {
avroObjects.forEach(r -> {
try {
writer.write(r);
} catch (IOException ex) {
throw new UncheckedIOException(ex);
}
});
} catch (IOException e) {
e.printStackTrace();
}
// dump memory data to file for testing
Files.write(Paths.get("./users-memory.parquet"), outputFile.toArray());
}
$ parquet-tools cat --json users-memory.parquet
$ parquet-tools schema users-memory.parquet