MySQL/PHP:通过标记/分类法查找相似/相关的项
我有一张像这样的桌子MySQL/PHP:通过标记/分类法查找相似/相关的项,php,mysql,relationship,tagging,Php,Mysql,Relationship,Tagging,我有一张像这样的桌子 |id| Name | |1 | Paris | |2 | London | |3 | New York| |id| tag | |1 | Europe | |2 | North America | |3 | River | | id | Name | | 1 | Paris | | 2 | Florence | | 3 | New York | | 4 | São
|id| Name |
|1 | Paris |
|2 | London |
|3 | New York|
|id| tag |
|1 | Europe |
|2 | North America |
|3 | River |
| id | Name |
| 1 | Paris |
| 2 | Florence |
| 3 | New York |
| 4 | São Paulo |
| 5 | London |
| city_id | tag_id |
| 1 | 1 |
| 1 | 3 |
| 2 | 1 |
| 2 | 3 |
| 3 | 1 |
| 3 | 2 |
| 4 | 2 |
| 5 | 1 |
| 5 | 2 |
| 5 | 3 |
我有一张像这样的标签表
|id| Name |
|1 | Paris |
|2 | London |
|3 | New York|
|id| tag |
|1 | Europe |
|2 | North America |
|3 | River |
| id | Name |
| 1 | Paris |
| 2 | Florence |
| 3 | New York |
| 4 | São Paulo |
| 5 | London |
| city_id | tag_id |
| 1 | 1 |
| 1 | 3 |
| 2 | 1 |
| 2 | 3 |
| 3 | 1 |
| 3 | 2 |
| 4 | 2 |
| 5 | 1 |
| 5 | 2 |
| 5 | 3 |
和一个城市标签表:
|id| city_id | tag_id |
|1 | 1 | 1 |
|2 | 1 | 3 |
|3 | 2 | 1 |
|4 | 2 | 3 |
|5 | 3 | 2 |
|6 | 3 | 3 |
我如何计算哪一个城市关系最密切?例如如果我看城市1(巴黎),结果应该是:伦敦(2),纽约(3)
我已经找到了解决方案,但我不确定如何最好地实现这一点 这会是一个正确的方向吗
SELECT cities.name, (
SELECT cities.id FROM cities
JOIN cities_tags ON cities.id=cities_tags.city_id
WHERE tags.id IN(
SELECT cities_tags.tag_id
FROM cites_tags
WHERE cities_tags.city_id=cites.id
)
GROUP BY cities.id
HAVING count(*) > 0
) as matchCount
FROM cities
HAVING matchCount >0
我尝试的是:
//查找城市名称:从matchCount>0的城市获取city.names(子查询)作为matchCount //子查询:
选择城市(子子查询)也具有的标记数量 //子子查询
选择原始名称具有的标记的id
select c.name, cnt.val/(select count(*) from cities) as jaccard_index
from cities c
inner join
(
select city_id, count(*) as val
from cities_tags
where tag_id in (select tag_id from cities_tags where city_id=1)
and not city_id in (1)
group by city_id
) as cnt
on c.id=cnt.city_id
order by jaccard_index desc
此查询静态引用的是city\u id=1
,因此您必须在where tag\u id in
子句和not city\u id in
子句中将其作为变量
如果我正确理解了Jaccard索引,那么它也会返回按“最密切相关”排序的值。我们示例中的结果如下所示:
|name |jaccard_index |
|London |0.6667 |
|New York |0.3333 |
select jaccard.city,
jaccard.intersect,
jaccard.union,
jaccard.intersect/jaccard.union as 'jaccard index'
from
(select
c2.name as city
,count(ct2.tag_id) as 'intersect'
,(select count(distinct ct3.tag_id)
from cities_tags ct3
where ct3.city_id in(c1.id, c2.id)) as 'union'
from
cities as c1
inner join cities as c2 on c1.id != c2.id
left join cities_tags as ct1 on ct1.city_id = c1.id
left join cities_tags as ct2 on ct2.city_id = c2.id and ct1.tag_id = ct2.tag_id
where c1.id = 1
group by c1.id, c2.id) as jaccard
order by jaccard.intersect/jaccard.union desc
编辑 更好地了解如何实施Jaccard索引: 在阅读了wikipedia上关于Jaccard索引的更多内容后,我想出了一种更好的方法来实现对示例数据集的查询。基本上,我们将独立地将我们选择的城市与列表中的其他城市进行比较,并使用公共标记计数除以两个城市之间选择的不同总标记计数
select c.name,
case -- when this city's tags are a subset of the chosen city's tags
when not_in.cnt is null
then -- then the union count is the chosen city's tag count
intersection.cnt/(select count(tag_id) from cities_tags where city_id=1)
else -- otherwise the union count is the chosen city's tag count plus everything not in the chosen city's tag list
intersection.cnt/(not_in.cnt+(select count(tag_id) from cities_tags where city_id=1))
end as jaccard_index
-- Jaccard index is defined as the size of the intersection of a dataset, divided by the size of the union of a dataset
from cities c
inner join
(
-- select the count of tags for each city that match our chosen city
select city_id, count(*) as cnt
from cities_tags
where tag_id in (select tag_id from cities_tags where city_id=1)
and city_id!=1
group by city_id
) as intersection
on c.id=intersection.city_id
left join
(
-- select the count of tags for each city that are not in our chosen city's tag list
select city_id, count(tag_id) as cnt
from cities_tags
where city_id!=1
and not tag_id in (select tag_id from cities_tags where city_id=1)
group by city_id
) as not_in
on c.id=not_in.city_id
order by jaccard_index desc
这个查询有点长,我不知道它的伸缩性如何,但它确实实现了一个真正的Jaccard索引,正如问题中所要求的那样。以下是新查询的结果:
+----------+---------------+
| name | jaccard_index |
+----------+---------------+
| London | 1.0000 |
| New York | 0.3333 |
+----------+---------------+
再次编辑以向查询中添加注释,并考虑当前城市的标记是所选城市标记的子集的情况?例如如果我看的是城市1(巴黎),结果应该是:伦敦(2)、纽约(3),根据您提供的数据集,只有一件事需要关联,那就是城市之间的公共标记,因此共享公共标记的城市将是下面最接近的城市,即查找城市的子查询(提供的用于查找其最近城市的标记除外)共享公共标记
SELECT * FROM `cities` WHERE id IN (
SELECT city_id FROM `cities_tags` WHERE tag_id IN (
SELECT tag_id FROM `cities_tags` WHERE city_id=1) AND city_id !=1 )
工作
我假设您将输入一个城市id或名称来查找最近的一个,在我的例子中,“Paris”有一个id
SELECT tag_id FROM `cities_tags` WHERE city_id=1
它将找到巴黎当时拥有的所有标签id
SELECT city_id FROM `cities_tags` WHERE tag_id IN (
SELECT tag_id FROM `cities_tags` WHERE city_id=1) AND city_id !=1 )
它将吸引除巴黎以外的所有城市,这些城市的标签与巴黎相同
这是你的
在阅读有关Jaccard相似性/索引的文章时,我们发现了一些东西来理解术语的实际含义,让我们以这个例子为例,我们有两组A和B
集合A={A,B,C,D,E}
集合B={I,H,G,F,E,D}
计算jaccard相似性的公式为JS=(A)/(B)/(A
工会B)
A相交B={D,E}=2
A并集B={A,B,C,D,E,I,H,G,F}=9
JS=2/9=0.2222
现在进入您的场景
Paris有tag_ID 1,3,所以我们制作了这个集合并称为我们的集合
P={欧洲,河流}
伦敦有tag_ID 1,3,所以我们制作了一套,并调用我们的
集合L={欧洲,河流}
纽约的标签号是2,3,所以我们制作了一套,并称之为
Set NW={北美,河流}
用伦敦JSPL=p相交L/p联合L计算巴黎JS,
JSPL=2/2=1
计算JS巴黎与纽约JSPNW=p相交NW/p
北方联盟,JSPNW=1/3=0.3333
这是到目前为止的查询,其中包括完美的jaccard索引,您可以看到下面的小提琴示例
SELECT a.*,
( (CASE WHEN a.`intersect` =0 THEN a.`union` ELSE a.`intersect` END ) /a.`union`) AS jaccard_index
FROM (
SELECT q.* ,(q.sets + q.parisset) AS `union` ,
(q.sets - q.parisset) AS `intersect`
FROM (
SELECT cities.`id`, cities.`name` , GROUP_CONCAT(tag_id SEPARATOR ',') sets ,
(SELECT GROUP_CONCAT(tag_id SEPARATOR ',') FROM `cities_tags` WHERE city_id= 1)AS parisset
FROM `cities_tags`
LEFT JOIN `cities` ON (cities_tags.`city_id` = cities.`id`)
GROUP BY city_id ) q
) a ORDER BY jaccard_index DESC
在上面的查询中,我已将结果集导出为两个子选择,以便获得自定义计算的别名
您可以在上面的查询中添加过滤器,而不计算与自身的相似性
SELECT a.*,
( (CASE WHEN a.`intersect` =0 THEN a.`union` ELSE a.`intersect` END ) /a.`union`) AS jaccard_index
FROM (
SELECT q.* ,(q.sets + q.parisset) AS `union` ,
(q.sets - q.parisset) AS `intersect`
FROM (
SELECT cities.`id`, cities.`name` , GROUP_CONCAT(tag_id SEPARATOR ',') sets ,
(SELECT GROUP_CONCAT(tag_id SEPARATOR ',') FROM `cities_tags` WHERE city_id= 1)AS parisset
FROM `cities_tags`
LEFT JOIN `cities` ON (cities_tags.`city_id` = cities.`id`) WHERE cities.`id` !=1
GROUP BY city_id ) q
) a ORDER BY jaccard_index DESC
因此,结果表明,巴黎与伦敦密切相关,然后与纽约密切相关
此查询没有任何奇特的函数,甚至没有子查询。它速度很快。只需确保cities.id、cities\u tags.id、cities\u tags.city\u id和cities\u tags.tag\u id有索引即可 查询返回的结果包含:city1、city2以及city1和city2共有多少个标记的计数
select
c1.name as city1
,c2.name as city2
,count(ct2.tag_id) as match_count
from
cities as c1
inner join cities as c2 on
c1.id != c2.id -- change != into > if you dont want duplicates
left join cities_tags as ct1 on -- use inner join to filter cities with no match
ct1.city_id = c1.id
left join cities_tags as ct2 on -- use inner join to filter cities with no match
ct2.city_id = c2.id
and ct1.tag_id = ct2.tag_id
group by
c1.id
,c2.id
order by
c1.id
,match_count desc
,c2.id
将!=
更改为
,以避免每个城市返回两次。这意味着城市将不再在第一列和第二列显示一次
如果您不想看到没有标记匹配的城市组合,请将两个
左连接
更改为内连接
。为时已晚,但我认为没有一个答案是完全正确的。我得到了每个答案的最佳部分,并将所有内容组合在一起,以得出我自己的答案:
- @m-khalid-junaid的解释非常有趣和正确,但是
(q.set+q.parisset)作为
联合
和(q.set-q.parisset)作为
相交
的实现是非常错误的 - @n-lx的版本是正确的,但是需要Jaccard索引,这一点非常重要,如果一个城市有2个标签,并且用3个标签匹配另一个城市的两个标签,那么结果将与另一个城市只有相同两个标签的匹配结果相同。我认为完全匹配是最相关的
城市
这样的表格
|id| Name |
|1 | Paris |
|2 | London |
|3 | New York|
|id| tag |
|1 | Europe |
|2 | North America |
|3 | River |
| id | Name |
| 1 | Paris |
| 2 | Florence |
| 3 | New York |
| 4 | São Paulo |
| 5 | London |
| city_id | tag_id |
| 1 | 1 |
| 1 | 3 |
| 2 | 1 |
| 2 | 3 |
| 3 | 1 |
| 3 | 2 |
| 4 | 2 |
| 5 | 1 |
| 5 | 2 |
| 5 | 3 |
cities\u标签
这样的表格
|id| Name |
|1 | Paris |
|2 | London |
|3 | New York|
|id| tag |
|1 | Europe |
|2 | North America |
|3 | River |
| id | Name |
| 1 | Paris |
| 2 | Florence |
| 3 | New York |
| 4 | São Paulo |
| 5 | London |
| city_id | tag_id |
| 1 | 1 |
| 1 | 3 |
| 2 | 1 |
| 2 | 3 |
| 3 | 1 |
| 3 | 2 |
| 4 | 2 |
| 5 | 1 |
| 5 | 2 |
| 5 | 3 |
根据该样本数据,佛罗伦萨拥有完整的matc