Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 获取聚合和函数以正确计数元素时出错_Apache Spark_Pyspark - Fatal编程技术网

Apache spark 获取聚合和函数以正确计数元素时出错

Apache spark 获取聚合和函数以正确计数元素时出错,apache-spark,pyspark,Apache Spark,Pyspark,我的工作是解析http日志请求,最后一条语句是查找一个名为controller\u type的字段,以查看它是否与某些条件类似,然后检查它是否isNotNull。如果是这种情况,它将为其指定一个值1,否则为0,然后创建一个包含这些1和0的总和列。问题是,我的工作是计算它们是否符合controller\u type标准,而不是真正关注isNotNull部分。我是否有一个逻辑或语法错误,或者我在如何构造这个表达式时做了一些错误的事情 df = df.groupby( fn.t

我的工作是解析http日志请求,最后一条语句是查找一个名为
controller\u type
的字段,以查看它是否与某些条件类似,然后检查它是否
isNotNull
。如果是这种情况,它将为其指定一个值1,否则为0,然后创建一个包含这些1和0的总和列。问题是,我的工作是计算它们是否符合
controller\u type
标准,而不是真正关注
isNotNull
部分。我是否有一个逻辑或语法错误,或者我在如何构造这个表达式时做了一些错误的事情

df = df.groupby(
            fn.trunc(df['request_timestamp'], 'mon').alias(
                'request_timestamp'),
            df['account_id'],
            df['account_guid'],
            df['cluster_id'],
            df['shard_id'],
            df['unique_id'],
            df['context_id'],
            df['controller_type'],
            df['controller_context_id'],
            df['concat_user_id'],
            df['user_id']) \
            .agg(
            fn.count(df['account_id']).alias('num_page_views'),
            fn.sum(
                fn.when(
                    ((df['controller_type'].like('pages%')) &
                    (df['controller_context_id'].isNotNull())),
                    fn.lit(1))
                    .otherwise(fn.lit(0))
            ).alias('num_page_views_pages'),
            fn.sum(
                fn.when(
                    ((df['controller_type'].like('files%')) &
                    (df['controller_context_id'].isNotNull())),
                    fn.lit(1))
                    .otherwise(fn.lit(0))
            ).alias('num_page_views_files'),
            fn.sum(
                fn.when(
                    ((df['controller_type'].like('modules%')) &
                    (df['controller_context_id'].isNotNull())),
                    fn.lit(1))
                    .otherwise(fn.lit(0))
            ).alias('num_page_views_modules'),
            fn.sum(
                fn.when(
                    ((df['controller_type'].like('assignments%')) &
                    (df['controller_context_id'].isNotNull())),
                    fn.lit(1))
                    .otherwise(fn.lit(0))
            ).alias('num_page_views_assignments'),
            fn.sum(
                fn.when(
                    ((df['controller_type'].like('quizzes%')) &
                    (df['controller_context_id'].isNotNull())),
                    fn.lit(1))
                    .otherwise(fn.lit(0))
            ).alias('num_page_views_quizzes'),
            fn.sum(
                fn.when(
                    ((df['controller_type'].like('discussion_topics%')) &
                    (df['controller_context_id'].isNotNull())),
                    fn.lit(1))
                    .otherwise(fn.lit(0))
            ).alias('num_page_views_discussion_topics'),
            fn.sum(
                fn.when(
                    ((df['controller_type'].like('outcome%')) &
                    (df['controller_context_id'].isNotNull())),
                    fn.lit(1))
                    .otherwise(fn.lit(0))
            ).alias('num_page_views_outcomes'),
            fn.countDistinct(df['user_id']).alias('num_distinct_user_logins'),
            fn.countDistinct(df['session_id']).alias('num_sessions')
        )
下面是等效的
SQL
语句:

SELECT
            TRUNC(request_timestamp, 'month') AS request_timestamp,
            account_id,
            account_guid,
            cluster_id,
            shard_id,
            unique_id,
            context_id,
            controller_type,
            controller_context_id,
            concat_user_id,
            user_id,
            COUNT(account_id) AS num_page_views,
            SUM(CASE
                    WHEN controller_type LIKE 'pages%' AND
                         controller_context_id <> '' AND
                         controller_context_id IS NOT NULL
                    THEN 1
                    ELSE 0 END) AS num_page_views_pages,
            SUM(CASE
                    WHEN controller_type LIKE 'files%' AND
                         controller_context_id <> '' AND
                         controller_context_id IS NOT NULL
                    THEN 1
                    ELSE 0 END) AS num_page_views_files,
            SUM(CASE
                    WHEN controller_type LIKE 'modules%' AND
                         controller_context_id <> '' AND
                         controller_context_id IS NOT NULL
                    THEN 1
                    ELSE 0 END) AS num_page_views_modules,
            SUM(CASE
                    WHEN controller_type LIKE 'assignments%' AND
                         controller_context_id <> '' AND
                         controller_context_id IS NOT NULL
                    THEN 1
                    ELSE 0 END) AS num_page_views_assignments,
            SUM(CASE
                    WHEN controller_type LIKE 'quizzes%' AND
                         controller_context_id <> '' AND
                         controller_context_id IS NOT NULL
                    THEN 1
                    ELSE 0 END) AS num_page_views_quizzes,
            SUM(CASE
                    WHEN controller_type LIKE 'discussion_topics%' AND
                         controller_context_id <> '' AND
                         controller_context_id IS NOT NULL
                    THEN 1
                    ELSE 0 END) AS num_page_views_discussion_topics,
            SUM(CASE
                    WHEN controller_type LIKE 'outcome%' AND
                         controller_context_id <> '' AND
                         controller_context_id IS NOT NULL
                    THEN 1
                    ELSE 0 END) AS num_page_views_outcomes,
            COUNT(DISTINCT session_id) AS num_sessions
        FROM requests
        GROUP BY
          TRUNC(request_timestamp, 'month'),
          account_id,
          account_guid,
          cluster_id,
          shard_id,
          unique_id,
          context_id,
          context_id,
          controller_type,
          controller_context_id,
          concat_user_id,
          user_id
我的结果是:

+---------+---------+---------+------------------------------------------------------+
|        a|        b|        c|sum(CASE WHEN isnotnull(a) THEN 1 ELSE 0 AS sum_col#3)|
+---------+---------+---------+------------------------------------------------------+
|something|something|something|                                                     1|
|something|     null|something|                                                     1|
|     null|something|something|                                                     1|
+---------+---------+---------+------------------------------------------------------+

您的方法不适用于玩具数据,因为字符串“null”
不是null
,所以无法将其过滤掉。如果要检查字段是否包含
“null”
,请使用相等
=
。让我们用一个简单的例子来说明这一点

df=sc.parallelize([
(1,“空”,),
(二,无),,
(3,“foo”,)
]).toDF([“id”,“x”])
df.选择(“*”,
fn.col(“x”).isNull(),#检查值是否为NULL-确定
fn.col(“x”)=“null”,检查值是否为“null”-此处无效
fn.col(“x”)==None#检查value=NULL-错误-始终为NULL!
##fn.col(“x”)为None#检查列是否为None-错误!
).show()
## +---+----+---------+----------+----------+
##| id | x | isnull(x)|(x=null)|(x=null)|
## +---+----+---------+----------+----------+
##|1 | null | false | true | null | string=“null”但不为null
##| 2 | null | true | null | null | null是null,但是!='空'
##| 3 | foo | false | false | null | not null
## +---+----+---------+----------+----------+
此外,您可以轻松地将所有条件简化为以下内容:

检查=[
('pages%','num_page_views_assignments'),
(“测验%”、“页面浏览量”,
...
]
def计数(模式、标签):
条件=(
fn.col('controller_type')。类似(模式)&
fn.col('controller\u context\u id')。isNotNull()
)
#Count将仅计数而不为NULL。否则我们可以省略
#并选择任意值
返回fn.count(fn.when(cond,1).alias(label))
(df)
.groupBy(…)
.agg(*[支票中p,l的类(p,l)计数])

缺少的主要部分是
'null'不是null
。您应该在Python端使用
None
。@zero323所以我不应该使用
。isNotNull/isNull
?或者您是否介意澄清在何处使用
None
?然后使用Python简化其余部分
None
转换为SQL
NULL
。字符串
null
与其他字符串一样,只是一个字符串。不,您不能。您仍然必须使用
isNull
<代码>非无将始终为真,因为它检查对象标识。
+---------+---------+---------+------------------------------------------------------+
|        a|        b|        c|sum(CASE WHEN isnotnull(a) THEN 1 ELSE 0 AS sum_col#3)|
+---------+---------+---------+------------------------------------------------------+
|something|something|something|                                                     1|
|something|     null|something|                                                     1|
|     null|something|something|                                                     1|
+---------+---------+---------+------------------------------------------------------+