Hive 并行计算列的统计信息_Hive_Pyspark_Statistics

Hive 并行计算列的统计信息

hive pyspark statistics

Hive 并行计算列的统计信息,hive,pyspark,statistics,Hive,Pyspark,Statistics,本文展示了如何在表上运行聚合（distinct、min、max），如下所示： for colName in df.columns: dt = cd[[colName]].distinct().count() mx = cd.agg({colName: "max"}).collect()[0][0] mn = cd.agg({colName: "min"}).collect()[0][0] print(colName, dt, mx, mn) 这可以通过计算统计数

本文展示了如何在表上运行聚合（distinct、min、max），如下所示：

for colName in df.columns:
    dt = cd[[colName]].distinct().count()
    mx = cd.agg({colName: "max"}).collect()[0][0]
    mn = cd.agg({colName: "min"}).collect()[0][0]
    print(colName, dt, mx, mn)

这可以通过计算统计数据轻松完成。Hive和spark的统计信息不同：

配置单元提供-不同、最大、最小、空、长度、版本
火花给出-计数、平均值、标准差、最小值、最大值

看起来有很多统计数据是经过计算的。如何使用一个命令获取所有列的所有属性

然而，我有1000个列，连续地做这个很慢。假设我想计算另一个函数，比如每个列上的标准偏差-如何并行计算？

在收集表达式时，可以将任意多个表达式放入一个

agg

，它们都会立即计算出来。结果是包含所有值的单行。下面是一个例子：

from pyspark.sql.functions import min, max, countDistinct

r = df.agg(
  min(df.col1).alias("minCol1"),
  max(df.col1).alias("maxCol1"),
  (max(df.col1) - min(df.col1)).alias("diffMinMax"),
  countDistinct(df.col2).alias("distinctItemsInCol2"))
r.printSchema()
# root
#  |-- minCol1: long (nullable = true)
#  |-- maxCol1: long (nullable = true)
#  |-- diffMinMax: long (nullable = true)
#  |-- distinctItemsInCol2: long (nullable = false)

row = r.collect()[0]
print(row.distinctItemsInCol2, row.diffMinMax)
# (10, 9)

您也可以在此处使用字典语法，但对于更复杂的内容，管理起来更困难。

您可以在收集表达式时，将任意多个表达式放入

agg

，然后立即计算它们。结果是包含所有值的单行。下面是一个例子：

from pyspark.sql.functions import min, max, countDistinct

r = df.agg(
  min(df.col1).alias("minCol1"),
  max(df.col1).alias("maxCol1"),
  (max(df.col1) - min(df.col1)).alias("diffMinMax"),
  countDistinct(df.col2).alias("distinctItemsInCol2"))
r.printSchema()
# root
#  |-- minCol1: long (nullable = true)
#  |-- maxCol1: long (nullable = true)
#  |-- diffMinMax: long (nullable = true)
#  |-- distinctItemsInCol2: long (nullable = false)

row = r.collect()[0]
print(row.distinctItemsInCol2, row.diffMinMax)
# (10, 9)

您也可以在此处使用字典语法，但对于更复杂的内容更难管理。

您可以使用来获取适用于此类统计信息的所有列的聚合统计信息，如计数、平均值、最小值、最大值和标准差。（如果不传入任何参数，默认情况下将返回所有列的统计信息）

df=spark.createDataFrame(
[（1，“a”），（2，“b”），（3，“a”），（4，无），（无），（c）），[“id”，“name”]
)
df.descripe（）.show（）
#+-------+------------------+----+
#|摘要| id |名称|
#+-------+------------------+----+
#|计数| 4 | 4|
#|平均值| 2.5 |零|
#|STDEV | 1.2909944487358056 |空|
#|最小1 | a|
#|最高4摄氏度|
#+-------+------------------+----+

如您所见，这些统计信息忽略任何

null

值

如果您使用的是spark 2.3版，则还有支持以下聚合的：

计数-平均值-标准差-最小值-最大值-指定为百分比的任意近似百分位数（例如，75%）

df.summary（“count”、“min”、“max”）.show（）
#+-------+------------------+----+
#|摘要| id |名称|
#+-------+------------------+----+
#|计数| 4 | 4|
#|最小1 | a|
#|最高4摄氏度|
#+-------+------------------+----+

如果您想要所有列的其他聚合统计信息，还可以使用列表理解。例如，如果您想复制您所说的Hive提供的内容（distinct、max、min和nulls-我不确定长度和版本的含义）：

导入pyspark.sql.f函数
来自itertools进口链
agg_distinct=[f.countDistinct（c）.alias（“distinct_”+c）表示df.columns中的c]
agg_max=[f.max（c）.alias（“max_”+c）表示df.columns中的c]
agg_min=[f.min（c）.alias（“min_”+c）表示df.columns中的c]
agg_nulls=[f.count（f.when（f.isnull（c），c））.alias（“nulls_”+c）表示df.columns中的c]
df.agg(
*（链自可编辑（[agg_distinct，agg_max，agg_min，agg_nulls]））
).show（）
#+-----------+-------------+------+--------+------+--------+--------+----------+
#|独特的|独特的|名字|最大的| id |最大的|名字|最小的| id |最小的|名字|空的|名字|
#+-----------+-------------+------+--------+------+--------+--------+----------+
#|4 | 3 | 4 | c | 1 | a | 1 | 1|
#+-----------+-------------+------+--------+------+--------+--------+----------+

尽管此方法将返回一行，而不是像

descripe（）

和

summary（）

那样返回每个统计信息的一行。

您可以使用这些统计信息获取所有列的聚合统计信息，如计数、平均值、最小值、最大值和标准偏差。（如果不传入任何参数，默认情况下将返回所有列的统计信息）

df=spark.createDataFrame(
[（1，“a”），（2，“b”），（3，“a”），（4，无），（无），（c）），[“id”，“name”]
)
df.descripe（）.show（）
#+-------+------------------+----+
#|摘要| id |名称|
#+-------+------------------+----+
#|计数| 4 | 4|
#|平均值| 2.5 |零|
#|STDEV | 1.2909944487358056 |空|
#|最小1 | a|
#|最高4摄氏度|
#+-------+------------------+----+

如您所见，这些统计信息忽略任何

null

值

如果您使用的是spark 2.3版，则还有支持以下聚合的：

计数-平均值-标准差-最小值-最大值-指定为百分比的任意近似百分位数（例如，75%）

df.summary（“count”、“min”、“max”）.show（）
#+-------+------------------+----+
#|摘要| id |名称|
#+-------+------------------+----+
#|计数| 4 | 4|
#|最小1 | a|
#|最高4摄氏度|
#+-------+------------------+----+

导入pyspark.sql.f函数
来自itertools进口链
agg_distinct=[f.countDistinct（c）.alias（“distinct_”+c）表示df.columns中的c]
agg_max=[f.max（c）.alias（“max_”+c）表示df.columns中的c]
agg_min=[f.min（c）.alias（“min_”+c）表示df.columns中的c]
agg_nulls=[f.count（f.when（f.isnull（c），c））.alias（“nulls_”+c）表示df.columns中的c]
df.agg(
*（链自可编辑（[agg_distinct，agg_max，agg_min，agg_nulls]））
).show（）
#+-----------+-------------+------+--------+------+--------+--------+----------+
#|独特的|独特的|名字|最大的| id |最大的|名字|最小的| id |最小的|名字|空的|名字|
#+-----------+-------------+------+--------+----