Hadoop Impala：如何查询具有不同模式的多个拼花文件_Hadoop_Apache Spark Sql_Parquet_Impala

Hadoop Impala：如何查询具有不同模式的多个拼花文件

hadoop

Hadoop Impala：如何查询具有不同模式的多个拼花文件,hadoop,apache-spark-sql,parquet,impala,Hadoop,Apache Spark Sql,Parquet,Impala,在Spark 2.1中，我经常使用 df=spark.read.parquet（/path/to/my/files/*.parquet）加载拼花文件文件夹，即使使用不同的模式。然后，我使用SparkSQL对数据帧执行一些SQL查询现在我想试试黑斑羚，因为我读了，里面有这样的句子： ApacheImpala是一种开源的大规模并行处理（MPP）SQL 在运行ApacheHadoop[…]的计算机集群中存储数据的查询引擎读取Hadoop文件格式，包括文本、LZO、SequenceFile、A

在Spark 2.1中，我经常使用

df=spark.read.parquet（/path/to/my/files/*.parquet）

加载拼花文件文件夹，即使使用不同的模式。然后，我使用SparkSQL对数据帧执行一些SQL查询

现在我想试试黑斑羚，因为我读了，里面有这样的句子：

ApacheImpala是一种开源的大规模并行处理（MPP）SQL 在运行ApacheHadoop[…]的计算机集群中存储数据的查询引擎

读取Hadoop文件格式，包括文本、LZO、SequenceFile、Avro、RCFile和拼花地板

因此，它听起来似乎也适合我的用例（并且执行速度可能更快）

但当我尝试以下事情时：

创建外部表格摄取\u拼花\u文件，如拼花
“/path/to/my/files/*.parquet”
作为拼花地板储存
位置'/tmp'；

我有个例外

AnalysisException:无法推断架构，路径不是文件

所以现在我的问题是：是否有可能阅读一个文件夹，其中包含有黑斑羚拼花文件？Impala会像spark一样执行模式合并吗？执行此操作需要什么查询？在谷歌上找不到任何关于它的信息。（总是一个坏兆头…）

谢谢

据我所知，你有一些拼花文件，你想通过黑斑羚表看到它们吗？下面是我的解释

您可以创建一个外部表，并将位置设置为拼花文件目录，如下所示

CREATE EXTERNAL TABLE ingest_parquet_files(col1 string, col2 string) LOCATION "/path/to/my/files/" STORED AS PARQUET;

创建表格后，您可以选择加载拼花地板文件

LOAD DATA INPATH "Your/HDFS/PATH" INTO TABLE schema.ingest_parquet_files;

您正在尝试的操作也会起作用，您必须删除通配符，因为它需要类似拼花地板之后的路径，并在该位置查找文件

CREATE EXTERNAL TABLE ingest_parquet_files LIKE PARQUET 
'/path/to/my/files/'
STORED AS PARQUET
LOCATION '/tmp';

下面是您可以参考的模板，该模板来自Cloudera impala

请注意，您使用的用户应该对您提供给impala的任何路径具有读写权限。您可以通过执行以下步骤来实现

#Login as hive superuser to perform the below steps
create role <role_name_x>;

#For granting to database
grant all on database to role <role_name_x>;

#For granting to HDFS path
grant all on URI '/hdfs/path' to role <role_name_x>;

#Granting the role to the user you will use to run the impala job
grant role <role_name_x> to group <your_user_name>;

#After you perform the below steps you can validate with the below commands
#grant role should show the URI or database access when you run the grant role check on the role name as below

show grant role <role_name_x>;

#Now to validate if the user has access to the role

show role grant group <your_user_name>;

#以配置单元超级用户身份登录以执行以下步骤
创造角色；
#用于授予数据库
将数据库上的所有资源授予角色；
#用于授予HDFS路径
将URI“/hdfs/path”上的所有内容授予角色；
#将角色授予将用于运行impala作业的用户
将角色授予团队；
#执行以下步骤后，可以使用以下命令进行验证
#当您对角色名称运行授权角色检查时，授权角色应显示URI或数据库访问权限，如下所示
发挥赠款作用；
#现在验证用户是否有权访问该角色
显示角色授权组；

关于角色和权限是如何设置的，我个人会在黑斑羚之前进行演练。只是因为安装并不简单。如果您还没有使用Cloudera CDHHad，那么即使是Hive或Pig也会更快地尝试，这也可能有助于解决一些相关问题：如果我删除通配符并执行上面的代码片段，我还会得到一个AnalysisException。。。“无法推断架构，路径不是文件”您是否授予角色访问您试图读取文件的URI的权限？@Fabian我已在我的回答中添加了角色授予部分，请检查它，如果您遇到任何问题，请告诉我..似乎我有一个旧版本。但目前我没有机会用更新的版本测试它，所以我会接受你的答案，并在有机会的时候尝试。谢谢

#Login as hive superuser to perform the below steps
create role <role_name_x>;

#For granting to database
grant all on database to role <role_name_x>;

#For granting to HDFS path
grant all on URI '/hdfs/path' to role <role_name_x>;

#Granting the role to the user you will use to run the impala job
grant role <role_name_x> to group <your_user_name>;

#After you perform the below steps you can validate with the below commands
#grant role should show the URI or database access when you run the grant role check on the role name as below

show grant role <role_name_x>;

#Now to validate if the user has access to the role

show role grant group <your_user_name>;