SQL中的条件概率

SQL中的条件概率,sql,sql-server,probability,Sql,Sql Server,Probability,我想我已经有点死胡同了 假设我有一个数据集,这相当容易- person_id和book_id。这是一个非常真实的表格,上面说person X买了A、B和C三本书 我知道如何找出有多少人一起买了书X和书Y。 这是 这也是我的大脑决定关闭的地方。我知道我可能需要这样做 countb.person\u id/所有购买A*100书的人 但我不能完全肯定 我希望我说得够清楚了 EDIT1:我目前正在使用SQL Server 2017,所以我认为正确答案是T-SQL?。 最后,格式应该与此类似。此外,也没有

我想我已经有点死胡同了

假设我有一个数据集,这相当容易- person_id和book_id。这是一个非常真实的表格,上面说person X买了A、B和C三本书

我知道如何找出有多少人一起买了书X和书Y。 这是

这也是我的大脑决定关闭的地方。我知道我可能需要这样做 countb.person\u id/所有购买A*100书的人 但我不能完全肯定

我希望我说得够清楚了

EDIT1:我目前正在使用SQL Server 2017,所以我认为正确答案是T-SQL?。 最后,格式应该与此类似。此外,也没有人可以买三本X书

Book1 Book2 HowManyPeopleBoughtBook2
1     2     50%
1     3     7%
2     3     15%
2     1     40%
3     1     60%
3     2     20%

EDIT2:假设数据库中有数十万行。是的,这与我正在学习的一门数据科学课程有关,因此数据量巨大。

如果您想生成一起购买的书籍对的所有可能组合以及购买该组合的人的百分比,下面可以提供帮助

create table data1(book_id int, person_id int)

insert into data1
select *
from (values(1,300)
           ,(2,300)
           ,(2,301)
           ,(1,301)
           ,(3,301)
     )t(book_id,person_id)  

with books
  as (select distinct book_id
        from data1 a
      )
   ,tot_persons
    as (select count(distinct person_id) as tot_cnt
          from data1 
        )
   ,pairs
    as ( 
   select a.book_id as col1 /* This block generates all possible pair combinations of books*/
         ,b.book_id as col2
     from books a
     join books b
       on a.book_id<b.book_id
       )
       select a.col1,a.col2
              ,count(b.person_id)*100/(select tot_cnt from tot_persons) as percent_of_persons_buying_both
         from pairs a
         join data1 b 
           on a.col1=b.book_id  
        where exists(select 1
                       from data1 b1
                      where b.person_id=b1.person_id
                        and a.col2=b1.book_id)
        group by a.col1,a.col2                 

在我的手机上,为打字错误道歉

SELECT
  SUM(bought_b) * 100.0 / COUNT(*)
FROM
(
  SELECT
    person_id, 
    MAX(CASE WHEN book_id = 'A' THEN 1 END)   AS bought_a,
    MAX(CASE WHEN book_id = 'B' THEN 1 END)   AS bought_b
  FROM
    data
  WHERE
    book_id IN ('A', 'B')
  GROUP BY
    person_id
)
  person_stats
WHERE
  bought_a = 1

在我的手机上,为打字错误道歉

SELECT
  SUM(bought_b) * 100.0 / COUNT(*)
FROM
(
  SELECT
    person_id, 
    MAX(CASE WHEN book_id = 'A' THEN 1 END)   AS bought_a,
    MAX(CASE WHEN book_id = 'B' THEN 1 END)   AS bought_b
  FROM
    data
  WHERE
    book_id IN ('A', 'B')
  GROUP BY
    person_id
)
  person_stats
WHERE
  bought_a = 1
编辑:刚刚看到你想要所有的组合,只有一组组合

WITH
  book AS
(
  SELECT DISTINCT book_id FROM data
) 
SELECT
  book_a_id,
  book_b_id,
  bought_b * 100.0 / bought_b
FROM
(
  SELECT
    book_a.book_id    AS book_a_id,
    book_b.book_id    AS book_b_id,
    COUNT(DISTINCT data_a.person_id)    AS bought_a, 
    COUNT(DISTINCT data_b.person_id)    AS bought_b
  FROM
    book    AS book_a
  CROSS JOIN
    book    AS book_b
  INNER JOIN
    data    AS data_a
      ON data_a.book_id = book_a.book_id
  LEFT JOIN
    data    AS data_b
      ON data_b.book_id = book_b.book_id
  GROUP BY
    book_a.book_id,
    book_b.book_id
)
  stats

您可以扩展逻辑以执行此操作:

select a.book_id as B1, b.book_id as B2,
       count(b.book_id) as bought_second_book,
       count(b.book_id) * 1.0 / book_cnt as ratio_Bought_Together
from (select a.*, count(*) over (partition by a.book_id) as book_cnt
      from dbo.data a
     ) a left join
     dbo.data b
     on a.person_id = b.person_id and a.book_id <> b.book_id
group by a.book_id, b.book_id, a.book_cnt;

这假设人们只买一本书。如果存在重复项,则countdistinct将对此进行调整。

告诉我们您使用的数据库名称和版本将help@CaiusJard谢谢提醒。更新了,该死,刚刚看到编辑。你有列出所有书的桌子吗?没错。我有一整张长表,其中所有的ID都是整数。这样一来,id为1的人买了id为0、1、2、7的书。。跟着那个买书的人。。。表本身有780k行长。我正试图把你的答案融入到我目前正在尝试的答案中。我理解你的问题的真正含义,也许它能帮助我更进一步。但是。。。。。。。是否有一个表列出了维度表中的所有书籍。您可以从中生成兴趣组合?谢谢。这正是我想要的。事实上,我之前也有类似的答案,但我认为他们错了。这个百分比是错误的。op想要。。在那些买了第一本书的人中,有多大比例的人也买了第二本书。分母取决于有多少人买了第一本书,而不是总人数。