Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/java/313.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何在没有spark的情况下从S3读取拼花地板文件?JAVA_Java_Apache Spark_Hadoop_Amazon S3_Parquet - Fatal编程技术网

如何在没有spark的情况下从S3读取拼花地板文件?JAVA

如何在没有spark的情况下从S3读取拼花地板文件?JAVA,java,apache-spark,hadoop,amazon-s3,parquet,Java,Apache Spark,Hadoop,Amazon S3,Parquet,目前,我正在使用Apache ParquetReader读取本地拼花文件, 看起来是这样的: ParquetReader<GenericData.Record> reader = null; Path path = new Path("userdata1.parquet"); try { reader = AvroParquetReader.<GenericData.Record>builder(path).withConf(new Con

目前,我正在使用Apache ParquetReader读取本地拼花文件, 看起来是这样的:

ParquetReader<GenericData.Record> reader = null;
    Path path = new Path("userdata1.parquet");
    try {
        reader = AvroParquetReader.<GenericData.Record>builder(path).withConf(new Configuration()).build();
        GenericData.Record record;
        while ((record = reader.read()) != null) {
            System.out.println(record);
ParquetReader读取器=null;
路径路径=新路径(“userdata1.parquet”);
试一试{
reader=AvroParquetReader.builder(path).withConf(新配置()).build();
一般数据。记录;
while((record=reader.read())!=null){
系统输出打印项次(记录);

但是,我试图通过S3访问拼花文件,而不下载它。有没有办法直接用拼花读取器解析Inputstream?

是的,最新版本的hadoop包括对S3文件系统的支持。使用
hadoop aws
库中的
s3a
客户端直接访问S3文件系统

HadoopInputFile
路径应构造为
s3a://bucket name/prefix/key
,以及使用属性配置的身份验证凭据
access\u key
secret\u key

  • fs.s3a.access.key
  • fs.s3a.secret.key
此外,您还需要这些依赖库

  • hadoop通用
    JAR
  • aws java sdk捆绑包JAR

阅读更多信息:

只需在@franklinsijo上添加,对于刚开始学习S3的新手,请注意,Hadoop配置设置了访问密钥和密钥: 下面是一段可能有用的代码:

public static void main(String[] args) throws IOException {
String PATH_SCHEMA = "s3a://xxx/xxxx/userdata1.parquet";
Path path = new Path(PATH_SCHEMA);
Configuration conf = new Configuration();
conf.set("fs.s3a.access.key", "xxxxx");
conf.set("fs.s3a.secret.key", "xxxxx");
InputFile file = HadoopInputFile.fromPath(path, conf);
ParquetReader<GenericRecord> reader = AvroParquetReader.<GenericRecord>builder(file).build();
GenericRecord record;
while ((record = reader.read()) != null) {
  System.out.println(record.toString());
}

我用它来处理以下依赖项

compile 'org.slf4j:slf4j-api:1.7.5'
compile 'org.slf4j:slf4j-log4j12:1.7.5'
compile 'org.apache.parquet:parquet-avro:1.12.0'
compile 'org.apache.avro:avro:1.10.2'
compile 'com.google.guava:guava:11.0.2'
compile 'org.apache.hadoop:hadoop-client:2.4.0'
compile 'org.apache.hadoop:hadoop-aws:3.3.0'   
compile 'org.apache.hadoop:hadoop-common:3.3.0'      
compile 'com.amazonaws:aws-java-sdk-core:1.11.563'
compile 'com.amazonaws:aws-java-sdk-s3:1.11.563'
范例

Path path = new Path("s3a://yours3path");
Configuration conf = new Configuration();
conf.set("fs.s3a.access.key", "KEY");
conf.set("fs.s3a.secret.key", "SECRET");
conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem");
conf.setBoolean("fs.s3a.path.style.access", true);
conf.setBoolean(org.apache.parquet.avro.AvroReadSupport.READ_INT96_AS_FIXED, true);

InputFile file = HadoopInputFile.fromPath(path, conf);
ParquetReader<GenericRecord> reader = AvroParquetReader.<GenericRecord>builder(file).build();
GenericRecord record;
while ((record = reader.read()) != null) {
  System.out.println(record);
}
Path Path=新路径(“s3a://yours3path”);
Configuration conf=新配置();
conf.set(“fs.s3a.access.key”,“key”);
conf.set(“fs.s3a.secret.key”、“secret”);
conf.set(“fs.s3a.impl”、“org.apache.hadoop.fs.s3a.S3AFileSystem”);
conf.setBoolean(“fs.s3a.path.style.access”,true);
conf.setBoolean(org.apache.parquet.avro.AvroReadSupport.READ_INT96_AS_FIXED,true);
InputFile=HadoopInputFile.fromPath(path,conf);
ParquetReader reader=AvroParquetReader.builder(file.build();
一般记录;
while((record=reader.read())!=null){
系统输出打印项次(记录);
}

您应该使用aws java sdk包,而不是sdk包;避免jackson、httpclient类路径问题。hadoop 3.3.0中还有一些代码只链接到着色JAR(在即将发布的版本中修复)
Path path = new Path("s3a://yours3path");
Configuration conf = new Configuration();
conf.set("fs.s3a.access.key", "KEY");
conf.set("fs.s3a.secret.key", "SECRET");
conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem");
conf.setBoolean("fs.s3a.path.style.access", true);
conf.setBoolean(org.apache.parquet.avro.AvroReadSupport.READ_INT96_AS_FIXED, true);

InputFile file = HadoopInputFile.fromPath(path, conf);
ParquetReader<GenericRecord> reader = AvroParquetReader.<GenericRecord>builder(file).build();
GenericRecord record;
while ((record = reader.read()) != null) {
  System.out.println(record);
}