Hadoop 将配置单元表标记为已复制/小

Hadoop 将配置单元表标记为已复制/小,hadoop,hive,Hadoop,Hive,是否可以告诉hive某个表是“小”的,即它应该复制到所有节点并在RAM中操作?尝试以下提示: /*+ MAPJOIN(small_table) */ 更新顺便说一句,还有其他选项,如排序合并桶加入。但是,它们要求对输入表所做的更改在相同的列上扣接 以下是Hortonworks文档中关于地图连接限制/能力的一些信息 为了方便起见,这里有一段关于mapjoins的摘录 MAPJOINs are processed by loading the smaller table into an in

是否可以告诉
hive
某个表是“小”的,即它应该复制到所有节点并在RAM中操作?

尝试以下提示:

/*+ MAPJOIN(small_table) */  
更新顺便说一句,还有其他选项,如排序合并桶加入。但是,它们要求对输入表所做的更改在相同的列上扣接

以下是Hortonworks文档中关于地图连接限制/能力的一些信息

为了方便起见,这里有一段关于mapjoins的摘录

MAPJOINs are processed by loading the smaller table into an in-memory hash map and matching keys with the larger table as they are streamed through.

Local work:
read records via standard table scan (includes filters and projections) from source on local machine
build hashtable in memory
write hashtable to local disk
upload hashtable to dfs
add hashtable to distributed cache
Map task
read hashtable from local disk (distributed cache) into memory
match records? keys against hashtable
combine matches and write to output
No reduce task
Limitations of Current Implementation

The current MAPJOIN implementation has the following limitations:

The mapjoin operator can only handle one key at a time; that is, it can perform a multi-table join, but only if all the tables are joined on the same key. (Typical star schema joins do not fall into this category.)
Hints are cumbersome for users to apply correctly and auto conversion doesn't have enough logic to consistently predict if a MAPJOIN will fit into memory or not.
A chain of MAPJOINs is not coalesced into a single map-only job, unless the query is written as a cascading sequence of mapjoin(table, subquery(mapjoin(table, subquery....). Auto conversion will never produce a single map-only job.
The hashtable for the mapjoin operator has to be generated for each run of the query, which involves downloading all the data to the Hive client machine as well as uploading the generated hashtable files.

“在RAM中操作”是什么意思?映射联接用法?@dimamah:作为哈希表加载到RAM中