Sql 在Google BigQuery中查找用户的旅程

Sql 在Google BigQuery中查找用户的旅程,sql,google-bigquery,Sql,Google Bigquery,我想在一个特定的网站上找到用户的旅程。我的数据集的架构与Google商品商店相同,可在以下位置找到: 从GoogleBigQueryCookbook中,我实现并修改了提供的SQL代码,以获得每个客户的点击顺序 SELECT fullVisitorId AS id, visitId AS visitid, visitNumber AS visitnumber, h.hitNumber AS hitNumber, CASE WHEN h.eventInfo.eventAc

我想在一个特定的网站上找到用户的旅程。我的数据集的架构与Google商品商店相同,可在以下位置找到:

从GoogleBigQueryCookbook中,我实现并修改了提供的SQL代码,以获得每个客户的点击顺序

SELECT
  fullVisitorId AS id,
  visitId AS visitid,
  visitNumber AS visitnumber,
  h.hitNumber AS hitNumber,
  CASE
    WHEN h.eventInfo.eventAction = "Lead" THEN "Lead"
    WHEN h.eventInfo.eventAction = "Homepage" THEN "Homepage"
    WHEN h.eventInfo.eventAction = "Search" THEN "Search"
    WHEN h.eventInfo.eventAction = "High Intent Use" THEN "High Intent Use"
    WHEN h.eventInfo.eventAction = "Listing Page" THEN "Listing Page"
  END AS journey
FROM
  `dataset`,
  UNNEST(hits) AS h
WHERE
  h.type="PAGE"
  OR h.type="EVENT"
ORDER BY
  fullVisitorId,
  visitId,
  visitNumber,
  hitNumber
我得到的结果片段如下:

fullVisitorId visitId visitNumber hitnumber  journey
    001        1001       1           1      Homepage
    001        1001       1           2      Search
    001        1001       1           3      null
    001        1001       1           4      Search
    001        1001       1           5      Listing Page
    001        1001       1           6      Lead
    001        1001       1           2      Search
    001        1001       1           7      Lead
    002        1002       1           1      Search
    ...
我需要的是获得另一个列,显示每个访问者在第一个线索之前的旅程,同时忽略重复的内容,例如,如果访问者背靠背搜索5页,那么旅程应该只显示一次搜索 即,对于访问1001的访客001,该列将显示:

Homepage -> Search -> Listing Page -> Lead
我希望问题是清楚的。感谢您的帮助!:

我建议使用创建一系列旅程步骤,在您的选择中添加DISTINCT只会为每个用户显示一次单独的旅程步骤

比如:

字符串\u aggdistinctTourney,'->'作为倾向\u带\u子集


然后,您可以在第一个“潜在客户”之后使用一些正则表达式进行剪裁,除非有人能在原始字符串聚合中建议更好的方法来执行此操作?

下面是针对BigQuery标准SQL的,并将额外的逻辑应用于您现有/当前的查询

#standardSQL
SELECT 
  fullVisitorId, visitId, 
  STRING_AGG(journey, ' -> ' ORDER BY visitNumber, hitnumber) journey_path
FROM (
  SELECT 
    fullVisitorId, visitId, 
    MIN(visitNumber) visitNumber, MIN(hitnumber) hitnumber, journey
  FROM (
    SELECT *, COUNTIF(journey = 'Lead') OVER(win) grp
    FROM `your_current_query`
    WINDOW win AS (
      PARTITION BY fullVisitorId, visitId 
      ORDER BY visitNumber, hitnumber 
      ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
    )
  )
  WHERE grp = 0
  GROUP BY fullVisitorId, visitId, journey
)
GROUP BY fullVisitorId, visitId
因此,您可以使用现有的查询,如下所示

#standardSQL
WITH `your_current_query` AS (
  SELECT
    fullVisitorId AS id,
    visitId AS visitid,
    visitNumber AS visitnumber,
    h.hitNumber AS hitNumber,
    CASE
      WHEN h.eventInfo.eventAction = "Lead" THEN "Lead"
      WHEN h.eventInfo.eventAction = "Homepage" THEN "Homepage"
      WHEN h.eventInfo.eventAction = "Search" THEN "Search"
      WHEN h.eventInfo.eventAction = "High Intent Use" THEN "High Intent Use"
      WHEN h.eventInfo.eventAction = "Listing Page" THEN "Listing Page"
    END AS journey
  FROM
    `dataset`,
    UNNEST(hits) AS h
  WHERE
    h.type="PAGE"
    OR h.type="EVENT"
)
SELECT 
  fullVisitorId, visitId, 
  STRING_AGG(journey, ' -> ' ORDER BY visitNumber, hitnumber) journey_path
FROM (
  SELECT 
    fullVisitorId, visitId, 
    MIN(visitNumber) visitNumber, MIN(hitnumber) hitnumber, journey
  FROM (
    SELECT *, COUNTIF(journey = 'Lead') OVER(win) grp
    FROM `your_current_query`
    WINDOW win AS (
      PARTITION BY fullVisitorId, visitId 
      ORDER BY visitNumber, hitnumber 
      ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
    )
  )
  WHERE grp = 0
  GROUP BY fullVisitorId, visitId, journey
)
GROUP BY fullVisitorId, visitId
--- ORDER BY fullVisitorId, visitId    
如果要遵循您的结果示例-上面的应生成下面的结果

Row fullVisitorId   visitId     journey_path     
1   001             1001        Homepage -> Search -> Listing Page -> Lead   
2   002             1002        Search   

我采用了Mikhails的优秀方法,为那些拥有大量数据的用户提供了一个更具伸缩性的版本。这个想法是相同的,但适用于hits数组上的子查询

SELECT
  fullVisitorId AS id,
  visitId AS visitid,
  visitNumber AS visitnumber,
  ARRAY(
   (SELECT AS STRUCT * 
    FROM 
      (SELECT AS STRUCT
         hitNumber,
         page.pagePath, -- pagePath instead of CASE-WHEN with events
         count(page.pagePath) over (win) elNumber
       FROM t.hits 
       WHERE type IN ('PAGE', 'EVENT') 
       WINDOW win AS (
        PARTITION BY page.pagePath
        ORDER BY hitnumber 
        ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
        )
       ORDER BY hitNumber)
    WHERE elNumber=0
      -- instead of 'Lead' I used '/signin.html' 
      AND hitNumber < (SELECT MIN(hitNumber) FROM t.hits WHERE page.pagePath='/signin.html')
    )
  ) AS journey
FROM
    `bigquery-public-data.google_analytics_sample.ga_sessions_20170801` t
limit 1000
我使用了实际的示例数据,但在那里找不到示例中的事件,所以我只使用了页面路径。但它应该很容易被采用

此外,此函数返回嵌套数据,而不是平面表,这在将结果保存为表时再次节省空间,并且在对其执行查询时速度更快


也不涉及分组-子查询中的所有事情都只发生在数组上,由于并行化,允许非常快速的处理

啊,谢谢@Ben,这帮助我走到了下一步!:我正在考虑将strpos与其他函数(可能是substr)结合起来进行剪辑。如果我能得到结果,将再次更新这里。STRING_AGG与DISTINCT一起不能保证正确的订单:使用DISTINCT,您只能通过旅程订购,而不能通过hitNumber订购,但是如果没有订单,则不保证任何订单。嗨@mikhail,谢谢您的回答!你能解释一下'CountifTravely='Lead'OVERwin grp'是做什么的吗?为什么要过滤'grp=0'?啊,好吧,我仔细研究了这个查询,发现'grp=0'捕获了第一个Lead之前的所有点击。总是从你身上学到新东西@Mikhail!谢谢你:当然可以。对的这是对每个潜在客户之前的所有事件进行分组。因为您只希望在第一个lead之前有事件-'grp=0'正好做到了这一点。很高兴你自己得到了这个答案:这是一个类似的问题: