Apache spark 获取聚合和函数以正确计数元素时出错
我的工作是解析http日志请求,最后一条语句是查找一个名为Apache spark 获取聚合和函数以正确计数元素时出错,apache-spark,pyspark,Apache Spark,Pyspark,我的工作是解析http日志请求,最后一条语句是查找一个名为controller\u type的字段,以查看它是否与某些条件类似,然后检查它是否isNotNull。如果是这种情况,它将为其指定一个值1,否则为0,然后创建一个包含这些1和0的总和列。问题是,我的工作是计算它们是否符合controller\u type标准,而不是真正关注isNotNull部分。我是否有一个逻辑或语法错误,或者我在如何构造这个表达式时做了一些错误的事情 df = df.groupby( fn.t
controller\u type
的字段,以查看它是否与某些条件类似,然后检查它是否isNotNull
。如果是这种情况,它将为其指定一个值1,否则为0,然后创建一个包含这些1和0的总和列。问题是,我的工作是计算它们是否符合controller\u type
标准,而不是真正关注isNotNull
部分。我是否有一个逻辑或语法错误,或者我在如何构造这个表达式时做了一些错误的事情
df = df.groupby(
fn.trunc(df['request_timestamp'], 'mon').alias(
'request_timestamp'),
df['account_id'],
df['account_guid'],
df['cluster_id'],
df['shard_id'],
df['unique_id'],
df['context_id'],
df['controller_type'],
df['controller_context_id'],
df['concat_user_id'],
df['user_id']) \
.agg(
fn.count(df['account_id']).alias('num_page_views'),
fn.sum(
fn.when(
((df['controller_type'].like('pages%')) &
(df['controller_context_id'].isNotNull())),
fn.lit(1))
.otherwise(fn.lit(0))
).alias('num_page_views_pages'),
fn.sum(
fn.when(
((df['controller_type'].like('files%')) &
(df['controller_context_id'].isNotNull())),
fn.lit(1))
.otherwise(fn.lit(0))
).alias('num_page_views_files'),
fn.sum(
fn.when(
((df['controller_type'].like('modules%')) &
(df['controller_context_id'].isNotNull())),
fn.lit(1))
.otherwise(fn.lit(0))
).alias('num_page_views_modules'),
fn.sum(
fn.when(
((df['controller_type'].like('assignments%')) &
(df['controller_context_id'].isNotNull())),
fn.lit(1))
.otherwise(fn.lit(0))
).alias('num_page_views_assignments'),
fn.sum(
fn.when(
((df['controller_type'].like('quizzes%')) &
(df['controller_context_id'].isNotNull())),
fn.lit(1))
.otherwise(fn.lit(0))
).alias('num_page_views_quizzes'),
fn.sum(
fn.when(
((df['controller_type'].like('discussion_topics%')) &
(df['controller_context_id'].isNotNull())),
fn.lit(1))
.otherwise(fn.lit(0))
).alias('num_page_views_discussion_topics'),
fn.sum(
fn.when(
((df['controller_type'].like('outcome%')) &
(df['controller_context_id'].isNotNull())),
fn.lit(1))
.otherwise(fn.lit(0))
).alias('num_page_views_outcomes'),
fn.countDistinct(df['user_id']).alias('num_distinct_user_logins'),
fn.countDistinct(df['session_id']).alias('num_sessions')
)
下面是等效的SQL
语句:
SELECT
TRUNC(request_timestamp, 'month') AS request_timestamp,
account_id,
account_guid,
cluster_id,
shard_id,
unique_id,
context_id,
controller_type,
controller_context_id,
concat_user_id,
user_id,
COUNT(account_id) AS num_page_views,
SUM(CASE
WHEN controller_type LIKE 'pages%' AND
controller_context_id <> '' AND
controller_context_id IS NOT NULL
THEN 1
ELSE 0 END) AS num_page_views_pages,
SUM(CASE
WHEN controller_type LIKE 'files%' AND
controller_context_id <> '' AND
controller_context_id IS NOT NULL
THEN 1
ELSE 0 END) AS num_page_views_files,
SUM(CASE
WHEN controller_type LIKE 'modules%' AND
controller_context_id <> '' AND
controller_context_id IS NOT NULL
THEN 1
ELSE 0 END) AS num_page_views_modules,
SUM(CASE
WHEN controller_type LIKE 'assignments%' AND
controller_context_id <> '' AND
controller_context_id IS NOT NULL
THEN 1
ELSE 0 END) AS num_page_views_assignments,
SUM(CASE
WHEN controller_type LIKE 'quizzes%' AND
controller_context_id <> '' AND
controller_context_id IS NOT NULL
THEN 1
ELSE 0 END) AS num_page_views_quizzes,
SUM(CASE
WHEN controller_type LIKE 'discussion_topics%' AND
controller_context_id <> '' AND
controller_context_id IS NOT NULL
THEN 1
ELSE 0 END) AS num_page_views_discussion_topics,
SUM(CASE
WHEN controller_type LIKE 'outcome%' AND
controller_context_id <> '' AND
controller_context_id IS NOT NULL
THEN 1
ELSE 0 END) AS num_page_views_outcomes,
COUNT(DISTINCT session_id) AS num_sessions
FROM requests
GROUP BY
TRUNC(request_timestamp, 'month'),
account_id,
account_guid,
cluster_id,
shard_id,
unique_id,
context_id,
context_id,
controller_type,
controller_context_id,
concat_user_id,
user_id
我的结果是:
+---------+---------+---------+------------------------------------------------------+
| a| b| c|sum(CASE WHEN isnotnull(a) THEN 1 ELSE 0 AS sum_col#3)|
+---------+---------+---------+------------------------------------------------------+
|something|something|something| 1|
|something| null|something| 1|
| null|something|something| 1|
+---------+---------+---------+------------------------------------------------------+
您的方法不适用于玩具数据,因为字符串“null”
不是null
,所以无法将其过滤掉。如果要检查字段是否包含“null”
,请使用相等=
。让我们用一个简单的例子来说明这一点
df=sc.parallelize([
(1,“空”,),
(二,无),,
(3,“foo”,)
]).toDF([“id”,“x”])
df.选择(“*”,
fn.col(“x”).isNull(),#检查值是否为NULL-确定
fn.col(“x”)=“null”,检查值是否为“null”-此处无效
fn.col(“x”)==None#检查value=NULL-错误-始终为NULL!
##fn.col(“x”)为None#检查列是否为None-错误!
).show()
## +---+----+---------+----------+----------+
##| id | x | isnull(x)|(x=null)|(x=null)|
## +---+----+---------+----------+----------+
##|1 | null | false | true | null | string=“null”但不为null
##| 2 | null | true | null | null | null是null,但是!='空'
##| 3 | foo | false | false | null | not null
## +---+----+---------+----------+----------+
此外,您可以轻松地将所有条件简化为以下内容:
检查=[
('pages%','num_page_views_assignments'),
(“测验%”、“页面浏览量”,
...
]
def计数(模式、标签):
条件=(
fn.col('controller_type')。类似(模式)&
fn.col('controller\u context\u id')。isNotNull()
)
#Count将仅计数而不为NULL。否则我们可以省略
#并选择任意值
返回fn.count(fn.when(cond,1).alias(label))
(df)
.groupBy(…)
.agg(*[支票中p,l的类(p,l)计数])
缺少的主要部分是'null'不是null
。您应该在Python端使用None
。@zero323所以我不应该使用。isNotNull/isNull
?或者您是否介意澄清在何处使用None
?然后使用Python简化其余部分None
转换为SQLNULL
。字符串null
与其他字符串一样,只是一个字符串。不,您不能。您仍然必须使用isNull
<代码>非无将始终为真,因为它检查对象标识。
+---------+---------+---------+------------------------------------------------------+
| a| b| c|sum(CASE WHEN isnotnull(a) THEN 1 ELSE 0 AS sum_col#3)|
+---------+---------+---------+------------------------------------------------------+
|something|something|something| 1|
|something| null|something| 1|
| null|something|something| 1|
+---------+---------+---------+------------------------------------------------------+