Hadoop 蜂巢连接优化_Hadoop_Join_Hive_Hiveql

Hadoop 蜂巢连接优化

hadoop join hive

Hadoop 蜂巢连接优化,hadoop,join,hive,hiveql,Hadoop,Join,Hive,Hiveql,我有两组数据都存储在一个S3存储桶中，我需要在配置单元中处理这些数据，并将输出存储回S3。每个数据集中的样本行如下所示： DataSet 1: {"requestId":"TADS6152JHGJH5435", "customerId":"ASJHAGSJH","sessionId":"172356126"} DataSet2: {"requestId":"TADS6152JHGJH5435","userAgent":"Mozilla"} 我需要根据requestId将这两个数据集连接起来，

我有两组数据都存储在一个S3存储桶中，我需要在配置单元中处理这些数据，并将输出存储回S3。每个数据集中的样本行如下所示：

DataSet 1: {"requestId":"TADS6152JHGJH5435", "customerId":"ASJHAGSJH","sessionId":"172356126"}

DataSet2: {"requestId":"TADS6152JHGJH5435","userAgent":"Mozilla"}

我需要根据

requestId

将这两个数据集连接起来，并输出一个组合行，如下所示：

Output:  {"requestId":"TADS6152JHGJH5435", "customerId":"ASJHAGSJH","sessionId":"172356126","userAgent":"Mozilla"}

数据集1中的RequestId是数据集2中RequestId的适当子集。我正在使用

左外部联接

获取输出。以下是我的配置单元脚本的简化版本：

CREATE EXTERNAL TABLE dataset1 (
     requestId string,
     customerId string,
     sessionId string
 )
LOCATION 's3://path_to_dataset1/';

CREATE EXTERNAL TABLE dataset2 (
     requestId string,
     userAgent string
 )
LOCATION 's3://path_to_dataset2/';

CREATE EXTERNAL TABLE output (
     requestId string,
     customerId string,
     sessionId string,
     userAgent string
 )
LOCATION 's3://path_to_output/';

INSERT OVERWRITE TABLE output
  SELECT d1.requestId, d1.customerId, d1.sessionId, d2.userAgent
  FROM dataset1 d1 LEFT OUTER JOIN dataset2 d2
  ON (d1.requestId=d2.requestId);

我的问题是:

是否有机会优化此加入？我可以使用表的分区/bucketing来更快地运行连接吗？我已在脚本中将

hive.auto.convert.join设置为true
。我应该设置哪些其他配置单元属性来为上述查询获得更好的性能
1. Optimize Joins

我们可以通过启用自动转换映射联接和优化倾斜联接来提高联接的性能
Auto Map Joins

自动映射联接是在联接大表和小表时非常有用的功能。如果启用此功能，小表将保存在每个节点的本地缓存中，然后在映射阶段与大表合并。启用自动映射联接有两个优点。首先，将一个小表加载到缓存中可以节省每个数据节点的读取时间。其次，它避免了配置单元查询中的倾斜联接，因为联接操作已经在映射阶段为每个数据块完成
Skew Joins

我们可以通过在hive shell或hive-site.xml文件中的SET命令将hive.optimize.skewjoin属性设置为true来启用倾斜联接的优化，即不平衡联接
  <property>
    <name>hive.optimize.skewjoin</name>
    <value>true</value>
    <description>
      Whether to enable skew join optimization. 
      The algorithm is as follows: At runtime, detect the keys with a large skew. Instead of
      processing those keys, store them temporarily in an HDFS directory. In a follow-up map-reduce
      job, process those skewed keys. The same key need not be skewed for all the tables, and so,
      the follow-up map-reduce job (for the skewed keys) would be much faster, since it would be a
      map-join.
    </description>
  </property>
  <property>
    <name>hive.skewjoin.key</name>
    <value>100000</value>
    <description>
      Determine if we get a skew key in join. If we see more than the specified number of rows with the same key in join operator,
      we think the key as a skew join key. 
    </description>
  </property>
  <property>
    <name>hive.skewjoin.mapjoin.map.tasks</name>
    <value>10000</value>
    <description>
      Determine the number of map task used in the follow up map join job for a skew join.
      It should be used together with hive.skewjoin.mapjoin.min.split to perform a fine grained control.
    </description>
  </property>
  <property>
    <name>hive.skewjoin.mapjoin.min.split</name>
    <value>33554432</value>
    <description>
      Determine the number of map task at most used in the follow up map join job for a skew join by specifying 
      the minimum split size. It should be used together with hive.skewjoin.mapjoin.map.tasks to perform a fine grained control.
    </description>
  </property>

2. Enable Bucketed Map Joins

hive> set hive.vectorized.execution.enabled = true;
hive> set hive.vectorized.execution.reduce.enabled = true;
hive> set hive.vectorized.execution.reduce.groupby.enabled = true;

  <property>
    <name>hive.cbo.enable</name>
    <value>true</value>
    <description>Flag to control enabling Cost Based Optimizations using Calcite framework.</description>
  </property>
  <property>
    <name>hive.compute.query.using.stats</name>
    <value>true</value>
    <description>
      When set to true Hive will answer a few queries like count(1) purely using stats
      stored in metastore. For basic stats collection turn on the config hive.stats.autogather to true.
      For more advanced stats collection need to run analyze table queries.
    </description>
  </property>
  <property>
    <name>hive.stats.fetch.partition.stats</name>
    <value>true</value>
    <description>
      Annotation of operator tree with statistics information requires partition level basic
      statistics like number of rows, data size and file size. Partition statistics are fetched from
      metastore. Fetching partition statistics for each needed partition can be expensive when the
      number of partitions is high. This flag can be used to disable fetching of partition statistics
      from metastore. When this flag is disabled, Hive will make calls to filesystem to get file sizes
      and will estimate the number of rows from row schema.
    </description>
  </property>
  <property>
    <name>hive.stats.fetch.column.stats</name>
    <value>true</value>
    <description>
      Annotation of operator tree with statistics information requires column statistics.
      Column statistics are fetched from metastore. Fetching column statistics for each needed column
      can be expensive when the number of columns is high. This flag can be used to disable fetching
      of column statistics from metastore.
    </description>
  </property>
  <property>
    <name>hive.stats.autogather</name>
    <value>true</value>
    <description>A flag to gather statistics automatically during the INSERT OVERWRITE command.</description>
  </property>
  <property>
    <name>hive.stats.dbclass</name>
    <value>fs</value>
    <description>
      Expects one of the pattern in [jdbc(:.*), hbase, counter, custom, fs].
      The storage that stores temporary Hive statistics. In filesystem based statistics collection ('fs'), 
      each task writes statistics it has collected in a file on the filesystem, which will be aggregated 
      after the job has finished. Supported values are fs (filesystem), jdbc:database (where database 
      can be derby, mysql, etc.), hbase, counter, and custom as defined in StatsSetupConst.java.
    </description>
  </property>

我们不需要在古老的MapReduce引擎上运行配置单元查询，而可以通过在Tez执行引擎上运行，将配置单元查询的性能至少提高100%到300%。我们可以从HiveShell启用具有以下属性的Tez引擎
hive> set hive.execution.engine=tez;


配置单元将查询转换为一个或多个阶段。阶段可以是MapReduce阶段、采样阶段、合并阶段和限制阶段。默认情况下，配置单元一次执行一个阶段。一个特定的作业可能由一些相互不依赖的阶段组成，并且可以按顺序执行
并行，可能允许整个作业更快地完成。可以通过设置以下属性来启用并行执行
  <property>
    <name>hive.exec.parallel</name>
    <value>true</value>
    <description>Whether to execute jobs in parallel</description>
  </property>
  <property>
    <name>hive.exec.parallel.thread.number</name>
    <value>8</value>
    <description>How many jobs at most can be executed in parallel</description>
  </property>

矢量化功能仅在hive-0.13.1版本中首次引入hive。通过矢量化查询执行，我们可以提高扫描、聚合、筛选和联接等操作的性能，一次执行1024行，而不是每次执行一行
我们可以通过在hive shell或hive-site.xml文件中设置以下三个属性来启用矢量化查询执行
  <property>
    <name>hive.optimize.skewjoin</name>
    <value>true</value>
    <description>
      Whether to enable skew join optimization. 
      The algorithm is as follows: At runtime, detect the keys with a large skew. Instead of
      processing those keys, store them temporarily in an HDFS directory. In a follow-up map-reduce
      job, process those skewed keys. The same key need not be skewed for all the tables, and so,
      the follow-up map-reduce job (for the skewed keys) would be much faster, since it would be a
      map-join.
    </description>
  </property>
  <property>
    <name>hive.skewjoin.key</name>
    <value>100000</value>
    <description>
      Determine if we get a skew key in join. If we see more than the specified number of rows with the same key in join operator,
      we think the key as a skew join key. 
    </description>
  </property>
  <property>
    <name>hive.skewjoin.mapjoin.map.tasks</name>
    <value>10000</value>
    <description>
      Determine the number of map task used in the follow up map join job for a skew join.
      It should be used together with hive.skewjoin.mapjoin.min.split to perform a fine grained control.
    </description>
  </property>
  <property>
    <name>hive.skewjoin.mapjoin.min.split</name>
    <value>33554432</value>
    <description>
      Determine the number of map task at most used in the follow up map join job for a skew join by specifying 
      the minimum split size. It should be used together with hive.skewjoin.mapjoin.map.tasks to perform a fine grained control.
    </description>
  </property>

2. Enable Bucketed Map Joins

hive> set hive.vectorized.execution.enabled = true;
hive> set hive.vectorized.execution.reduce.enabled = true;
hive> set hive.vectorized.execution.reduce.groupby.enabled = true;

  <property>
    <name>hive.cbo.enable</name>
    <value>true</value>
    <description>Flag to control enabling Cost Based Optimizations using Calcite framework.</description>
  </property>
  <property>
    <name>hive.compute.query.using.stats</name>
    <value>true</value>
    <description>
      When set to true Hive will answer a few queries like count(1) purely using stats
      stored in metastore. For basic stats collection turn on the config hive.stats.autogather to true.
      For more advanced stats collection need to run analyze table queries.
    </description>
  </property>
  <property>
    <name>hive.stats.fetch.partition.stats</name>
    <value>true</value>
    <description>
      Annotation of operator tree with statistics information requires partition level basic
      statistics like number of rows, data size and file size. Partition statistics are fetched from
      metastore. Fetching partition statistics for each needed partition can be expensive when the
      number of partitions is high. This flag can be used to disable fetching of partition statistics
      from metastore. When this flag is disabled, Hive will make calls to filesystem to get file sizes
      and will estimate the number of rows from row schema.
    </description>
  </property>
  <property>
    <name>hive.stats.fetch.column.stats</name>
    <value>true</value>
    <description>
      Annotation of operator tree with statistics information requires column statistics.
      Column statistics are fetched from metastore. Fetching column statistics for each needed column
      can be expensive when the number of columns is high. This flag can be used to disable fetching
      of column statistics from metastore.
    </description>
  </property>
  <property>
    <name>hive.stats.autogather</name>
    <value>true</value>
    <description>A flag to gather statistics automatically during the INSERT OVERWRITE command.</description>
  </property>
  <property>
    <name>hive.stats.dbclass</name>
    <value>fs</value>
    <description>
      Expects one of the pattern in [jdbc(:.*), hbase, counter, custom, fs].
      The storage that stores temporary Hive statistics. In filesystem based statistics collection ('fs'), 
      each task writes statistics it has collected in a file on the filesystem, which will be aggregated 
      after the job has finished. Supported values are fs (filesystem), jdbc:database (where database 
      can be derby, mysql, etc.), hbase, counter, and custom as defined in StatsSetupConst.java.
    </description>
  </property>


最近的Hive版本提供了基于成本的优化功能，可以根据查询成本实现进一步的优化，从而产生潜在的不同决策：如何排序联接、执行哪种联接类型、并行度等
通过在hive-site.xml文件中设置以下属性，可以启用基于成本的优化
  <property>
    <name>hive.optimize.skewjoin</name>
    <value>true</value>
    <description>
      Whether to enable skew join optimization. 
      The algorithm is as follows: At runtime, detect the keys with a large skew. Instead of
      processing those keys, store them temporarily in an HDFS directory. In a follow-up map-reduce
      job, process those skewed keys. The same key need not be skewed for all the tables, and so,
      the follow-up map-reduce job (for the skewed keys) would be much faster, since it would be a
      map-join.
    </description>
  </property>
  <property>
    <name>hive.skewjoin.key</name>
    <value>100000</value>
    <description>
      Determine if we get a skew key in join. If we see more than the specified number of rows with the same key in join operator,
      we think the key as a skew join key. 
    </description>
  </property>
  <property>
    <name>hive.skewjoin.mapjoin.map.tasks</name>
    <value>10000</value>
    <description>
      Determine the number of map task used in the follow up map join job for a skew join.
      It should be used together with hive.skewjoin.mapjoin.min.split to perform a fine grained control.
    </description>
  </property>
  <property>
    <name>hive.skewjoin.mapjoin.min.split</name>
    <value>33554432</value>
    <description>
      Determine the number of map task at most used in the follow up map join job for a skew join by specifying 
      the minimum split size. It should be used together with hive.skewjoin.mapjoin.map.tasks to perform a fine grained control.
    </description>
  </property>

2. Enable Bucketed Map Joins

hive> set hive.vectorized.execution.enabled = true;
hive> set hive.vectorized.execution.reduce.enabled = true;
hive> set hive.vectorized.execution.reduce.groupby.enabled = true;

  <property>
    <name>hive.cbo.enable</name>
    <value>true</value>
    <description>Flag to control enabling Cost Based Optimizations using Calcite framework.</description>
  </property>
  <property>
    <name>hive.compute.query.using.stats</name>
    <value>true</value>
    <description>
      When set to true Hive will answer a few queries like count(1) purely using stats
      stored in metastore. For basic stats collection turn on the config hive.stats.autogather to true.
      For more advanced stats collection need to run analyze table queries.
    </description>
  </property>
  <property>
    <name>hive.stats.fetch.partition.stats</name>
    <value>true</value>
    <description>
      Annotation of operator tree with statistics information requires partition level basic
      statistics like number of rows, data size and file size. Partition statistics are fetched from
      metastore. Fetching partition statistics for each needed partition can be expensive when the
      number of partitions is high. This flag can be used to disable fetching of partition statistics
      from metastore. When this flag is disabled, Hive will make calls to filesystem to get file sizes
      and will estimate the number of rows from row schema.
    </description>
  </property>
  <property>
    <name>hive.stats.fetch.column.stats</name>
    <value>true</value>
    <description>
      Annotation of operator tree with statistics information requires column statistics.
      Column statistics are fetched from metastore. Fetching column statistics for each needed column
      can be expensive when the number of columns is high. This flag can be used to disable fetching
      of column statistics from metastore.
    </description>
  </property>
  <property>
    <name>hive.stats.autogather</name>
    <value>true</value>
    <description>A flag to gather statistics automatically during the INSERT OVERWRITE command.</description>
  </property>
  <property>
    <name>hive.stats.dbclass</name>
    <value>fs</value>
    <description>
      Expects one of the pattern in [jdbc(:.*), hbase, counter, custom, fs].
      The storage that stores temporary Hive statistics. In filesystem based statistics collection ('fs'), 
      each task writes statistics it has collected in a file on the filesystem, which will be aggregated 
      after the job has finished. Supported values are fs (filesystem), jdbc:database (where database 
      can be derby, mysql, etc.), hbase, counter, and custom as defined in StatsSetupConst.java.
    </description>
  </property>


hive.cbo.enable
真的
控制使用方解石框架实现基于成本的优化的标志。
hive.compute.query.using.stats
真的
当设置为true时，Hive将纯粹使用统计信息回答一些查询，如count（1）
存储在元存储中。对于基本统计数据集合，将配置hive.stats.autogather设置为true。
对于更高级的统计数据集合，需要运行分析表查询。
hive.stats.fetch.partition.stats
真的
使用统计信息注释运算符树需要分区级别的基本
行数、数据大小和文件大小等统计信息。分区统计信息是从
元存储。当
分区数很高。此标志可用于禁用分区统计信息的获取
来自metastore。禁用此标志后，配置单元将调用文件系统以获取文件大小
并将根据行模式估计行数。
hive.stats.fetch.column.stats
真的
使用统计信息注释运算符树需要列统计信息。
列统计信息是从metastore获取的。获取每个所需列的列统计信息
当列数较多时，成本可能较高。此标志可用于禁用提取
metastore中的列统计信息。
hive.stats.autogather
真的
插入覆盖命令期间自动收集统计信息的标志。
hive.stats.dbclass
财政司司长
需要[jdbc（：*）、hbase、计数器、自定义、fs]中的模式之一。
存储临时配置单元统计信息的存储器。在基于文件系统的统计信息集合（“fs”）中，
每个任务都将收集的统计数据写入文件系统上的一个文件中，该文件将被聚合
工作完成后。支持的值是fs（文件系统）、jdbc:database（其中database
可以是derby、mysql等）、hbase、计数器和statsetupconst.java中定义的自定义。
很好的解释