SQL中的条件概率
我想我已经有点死胡同了 假设我有一个数据集,这相当容易- person_id和book_id。这是一个非常真实的表格,上面说person X买了A、B和C三本书 我知道如何找出有多少人一起买了书X和书Y。 这是 这也是我的大脑决定关闭的地方。我知道我可能需要这样做 countb.person\u id/所有购买A*100书的人 但我不能完全肯定 我希望我说得够清楚了 EDIT1:我目前正在使用SQL Server 2017,所以我认为正确答案是T-SQL?。 最后,格式应该与此类似。此外,也没有人可以买三本X书SQL中的条件概率,sql,sql-server,probability,Sql,Sql Server,Probability,我想我已经有点死胡同了 假设我有一个数据集,这相当容易- person_id和book_id。这是一个非常真实的表格,上面说person X买了A、B和C三本书 我知道如何找出有多少人一起买了书X和书Y。 这是 这也是我的大脑决定关闭的地方。我知道我可能需要这样做 countb.person\u id/所有购买A*100书的人 但我不能完全肯定 我希望我说得够清楚了 EDIT1:我目前正在使用SQL Server 2017,所以我认为正确答案是T-SQL?。 最后,格式应该与此类似。此外,也没有
Book1 Book2 HowManyPeopleBoughtBook2
1 2 50%
1 3 7%
2 3 15%
2 1 40%
3 1 60%
3 2 20%
EDIT2:假设数据库中有数十万行。是的,这与我正在学习的一门数据科学课程有关,因此数据量巨大。如果您想生成一起购买的书籍对的所有可能组合以及购买该组合的人的百分比,下面可以提供帮助
create table data1(book_id int, person_id int)
insert into data1
select *
from (values(1,300)
,(2,300)
,(2,301)
,(1,301)
,(3,301)
)t(book_id,person_id)
with books
as (select distinct book_id
from data1 a
)
,tot_persons
as (select count(distinct person_id) as tot_cnt
from data1
)
,pairs
as (
select a.book_id as col1 /* This block generates all possible pair combinations of books*/
,b.book_id as col2
from books a
join books b
on a.book_id<b.book_id
)
select a.col1,a.col2
,count(b.person_id)*100/(select tot_cnt from tot_persons) as percent_of_persons_buying_both
from pairs a
join data1 b
on a.col1=b.book_id
where exists(select 1
from data1 b1
where b.person_id=b1.person_id
and a.col2=b1.book_id)
group by a.col1,a.col2
在我的手机上,为打字错误道歉
SELECT
SUM(bought_b) * 100.0 / COUNT(*)
FROM
(
SELECT
person_id,
MAX(CASE WHEN book_id = 'A' THEN 1 END) AS bought_a,
MAX(CASE WHEN book_id = 'B' THEN 1 END) AS bought_b
FROM
data
WHERE
book_id IN ('A', 'B')
GROUP BY
person_id
)
person_stats
WHERE
bought_a = 1
在我的手机上,为打字错误道歉
SELECT
SUM(bought_b) * 100.0 / COUNT(*)
FROM
(
SELECT
person_id,
MAX(CASE WHEN book_id = 'A' THEN 1 END) AS bought_a,
MAX(CASE WHEN book_id = 'B' THEN 1 END) AS bought_b
FROM
data
WHERE
book_id IN ('A', 'B')
GROUP BY
person_id
)
person_stats
WHERE
bought_a = 1
编辑:刚刚看到你想要所有的组合,只有一组组合
WITH
book AS
(
SELECT DISTINCT book_id FROM data
)
SELECT
book_a_id,
book_b_id,
bought_b * 100.0 / bought_b
FROM
(
SELECT
book_a.book_id AS book_a_id,
book_b.book_id AS book_b_id,
COUNT(DISTINCT data_a.person_id) AS bought_a,
COUNT(DISTINCT data_b.person_id) AS bought_b
FROM
book AS book_a
CROSS JOIN
book AS book_b
INNER JOIN
data AS data_a
ON data_a.book_id = book_a.book_id
LEFT JOIN
data AS data_b
ON data_b.book_id = book_b.book_id
GROUP BY
book_a.book_id,
book_b.book_id
)
stats
您可以扩展逻辑以执行此操作:
select a.book_id as B1, b.book_id as B2,
count(b.book_id) as bought_second_book,
count(b.book_id) * 1.0 / book_cnt as ratio_Bought_Together
from (select a.*, count(*) over (partition by a.book_id) as book_cnt
from dbo.data a
) a left join
dbo.data b
on a.person_id = b.person_id and a.book_id <> b.book_id
group by a.book_id, b.book_id, a.book_cnt;
这假设人们只买一本书。如果存在重复项,则countdistinct将对此进行调整。告诉我们您使用的数据库名称和版本将help@CaiusJard谢谢提醒。更新了,该死,刚刚看到编辑。你有列出所有书的桌子吗?没错。我有一整张长表,其中所有的ID都是整数。这样一来,id为1的人买了id为0、1、2、7的书。。跟着那个买书的人。。。表本身有780k行长。我正试图把你的答案融入到我目前正在尝试的答案中。我理解你的问题的真正含义,也许它能帮助我更进一步。但是。。。。。。。是否有一个表列出了维度表中的所有书籍。您可以从中生成兴趣组合?谢谢。这正是我想要的。事实上,我之前也有类似的答案,但我认为他们错了。这个百分比是错误的。op想要。。在那些买了第一本书的人中,有多大比例的人也买了第二本书。分母取决于有多少人买了第一本书,而不是总人数。