MySQL/PHP：通过标记/分类法查找相似/相关的项_Php_Mysql_Relationship_Tagging

MySQL/PHP：通过标记/分类法查找相似/相关的项

php mysql

MySQL/PHP：通过标记/分类法查找相似/相关的项,php,mysql,relationship,tagging,Php,Mysql,Relationship,Tagging,我有一张像这样的桌子 |id| Name | |1 | Paris | |2 | London | |3 | New York| |id| tag | |1 | Europe | |2 | North America | |3 | River | | id | Name | | 1 | Paris | | 2 | Florence | | 3 | New York | | 4 | São

我有一张像这样的桌子

|id| Name    |
|1 | Paris   |
|2 | London  |
|3 | New York|

|id| tag            |
|1 | Europe         |
|2 | North America  |   
|3 | River          |

| id | Name      |
| 1  | Paris     |
| 2  | Florence  |
| 3  | New York  |
| 4  | São Paulo |
| 5  | London    |

| city_id | tag_id |
| 1       | 1      | 
| 1       | 3      | 
| 2       | 1      |
| 2       | 3      | 
| 3       | 1      |     
| 3       | 2      |
| 4       | 2      |     
| 5       | 1      |
| 5       | 2      |
| 5       | 3      |

我有一张像这样的标签表

|id| Name    |
|1 | Paris   |
|2 | London  |
|3 | New York|

|id| tag            |
|1 | Europe         |
|2 | North America  |   
|3 | River          |

| id | Name      |
| 1  | Paris     |
| 2  | Florence  |
| 3  | New York  |
| 4  | São Paulo |
| 5  | London    |

| city_id | tag_id |
| 1       | 1      | 
| 1       | 3      | 
| 2       | 1      |
| 2       | 3      | 
| 3       | 1      |     
| 3       | 2      |
| 4       | 2      |     
| 5       | 1      |
| 5       | 2      |
| 5       | 3      |

和一个城市标签表：

|id| city_id | tag_id |
|1 | 1       | 1      | 
|2 | 1       | 3      | 
|3 | 2       | 1      |
|4 | 2       | 3      | 
|5 | 3       | 2      |     
|6 | 3       | 3      |

我如何计算哪一个城市关系最密切？例如如果我看城市1（巴黎），结果应该是：伦敦（2），纽约（3）

我已经找到了解决方案，但我不确定如何最好地实现这一点

这会是一个正确的方向吗

SELECT cities.name, ( 
                    SELECT cities.id FROM cities
                    JOIN cities_tags ON cities.id=cities_tags.city_id
                    WHERE tags.id IN(
                                     SELECT cities_tags.tag_id
                                     FROM cites_tags
                                     WHERE cities_tags.city_id=cites.id
                                     )
                    GROUP BY cities.id
                    HAVING count(*) > 0
                    ) as matchCount 
FROM cities
HAVING matchCount >0

我尝试的是：

//查找城市名称：
从matchCount>0的城市获取city.names（子查询）作为matchCount

//子查询：
选择城市（子子查询）也具有的标记数量

//子子查询
选择原始名称具有的标记的id

select c.name, cnt.val/(select count(*) from cities) as jaccard_index
from cities c 
inner join 
  (
  select city_id, count(*) as val 
  from cities_tags 
  where tag_id in (select tag_id from cities_tags where city_id=1) 
  and not city_id in (1)
  group by city_id
  ) as cnt 
on c.id=cnt.city_id
order by jaccard_index desc

此查询静态引用的是

city\u id=1

，因此您必须在

where tag\u id in

子句和

not city\u id in

子句中将其作为变量

如果我正确理解了Jaccard索引，那么它也会返回按“最密切相关”排序的值。我们示例中的结果如下所示：

|name      |jaccard_index  |
|London    |0.6667         |
|New York  |0.3333         |

select jaccard.city, 
       jaccard.intersect, 
       jaccard.union, 
       jaccard.intersect/jaccard.union as 'jaccard index'
from 
(select
    c2.name as city
    ,count(ct2.tag_id) as 'intersect' 
    ,(select count(distinct ct3.tag_id) 
      from cities_tags ct3 
      where ct3.city_id in(c1.id, c2.id)) as 'union'
from
    cities as c1
    inner join cities as c2 on c1.id != c2.id
    left join cities_tags as ct1 on ct1.city_id = c1.id
    left join cities_tags as ct2 on ct2.city_id = c2.id and ct1.tag_id = ct2.tag_id
where c1.id = 1
group by c1.id, c2.id) as jaccard
order by jaccard.intersect/jaccard.union desc

编辑更好地了解如何实施Jaccard索引：

在阅读了wikipedia上关于Jaccard索引的更多内容后，我想出了一种更好的方法来实现对示例数据集的查询。基本上，我们将独立地将我们选择的城市与列表中的其他城市进行比较，并使用公共标记计数除以两个城市之间选择的不同总标记计数

select c.name, 
  case -- when this city's tags are a subset of the chosen city's tags
    when not_in.cnt is null 
  then -- then the union count is the chosen city's tag count
    intersection.cnt/(select count(tag_id) from cities_tags where city_id=1) 
  else -- otherwise the union count is the chosen city's tag count plus everything not in the chosen city's tag list
    intersection.cnt/(not_in.cnt+(select count(tag_id) from cities_tags where city_id=1)) 
  end as jaccard_index
  -- Jaccard index is defined as the size of the intersection of a dataset, divided by the size of the union of a dataset
from cities c 
inner join 
  (
    --  select the count of tags for each city that match our chosen city
    select city_id, count(*) as cnt 
    from cities_tags 
    where tag_id in (select tag_id from cities_tags where city_id=1) 
    and city_id!=1
    group by city_id
  ) as intersection
on c.id=intersection.city_id
left join
  (
    -- select the count of tags for each city that are not in our chosen city's tag list
    select city_id, count(tag_id) as cnt
    from cities_tags
    where city_id!=1
    and not tag_id in (select tag_id from cities_tags where city_id=1)
    group by city_id
  ) as not_in
on c.id=not_in.city_id
order by jaccard_index desc

这个查询有点长，我不知道它的伸缩性如何，但它确实实现了一个真正的Jaccard索引，正如问题中所要求的那样。以下是新查询的结果：

+----------+---------------+
| name     | jaccard_index |
+----------+---------------+
| London   |        1.0000 |
| New York |        0.3333 |
+----------+---------------+

再次编辑以向查询中添加注释，并考虑当前城市的标记是所选城市标记的子集的情况？例如如果我看的是城市1（巴黎），结果应该是：伦敦（2）、纽约（3），根据您提供的数据集，只有一件事需要关联，那就是城市之间的公共标记，因此共享公共标记的城市将是下面最接近的城市，即查找城市的子查询（提供的用于查找其最近城市的标记除外）共享公共标记

SELECT * FROM `cities`  WHERE id IN (
SELECT city_id FROM `cities_tags` WHERE tag_id IN (
SELECT tag_id FROM `cities_tags` WHERE city_id=1) AND city_id !=1 )

工作我假设您将输入一个城市id或名称来查找最近的一个，在我的例子中，“Paris”有一个id

 SELECT tag_id FROM `cities_tags` WHERE city_id=1

它将找到巴黎当时拥有的所有标签id

SELECT city_id FROM `cities_tags` WHERE tag_id IN (
    SELECT tag_id FROM `cities_tags` WHERE city_id=1) AND city_id !=1 )

它将吸引除巴黎以外的所有城市，这些城市的标签与巴黎相同

这是你的

在阅读有关Jaccard相似性/索引的文章时，我们发现了一些东西来理解术语的实际含义，让我们以这个例子为例，我们有两组A和B

集合A={A，B，C，D，E}

集合B={I，H，G，F，E，D}

计算jaccard相似性的公式为JS=（A）/（B）/（A 工会B）

A相交B={D，E}=2

A并集B={A，B，C，D，E，I，H，G，F}=9

JS=2/9=0.2222

现在进入您的场景

Paris有tag_ID 1,3，所以我们制作了这个集合并称为我们的集合 P={欧洲，河流}

伦敦有tag_ID 1,3，所以我们制作了一套，并调用我们的集合L={欧洲，河流}

纽约的标签号是2,3，所以我们制作了一套，并称之为 Set NW={北美，河流}

用伦敦JSPL=p相交L/p联合L计算巴黎JS， JSPL=2/2=1

计算JS巴黎与纽约JSPNW=p相交NW/p 北方联盟，JSPNW=1/3=0.3333

这是到目前为止的查询，其中包括完美的jaccard索引，您可以看到下面的小提琴示例

SELECT a.*, ( (CASE WHEN a.`intersect` =0 THEN a.`union` ELSE a.`intersect` END ) /a.`union`) AS jaccard_index FROM ( SELECT q.* ,(q.sets + q.parisset) AS `union` , (q.sets - q.parisset) AS `intersect` FROM ( SELECT cities.`id`, cities.`name` , GROUP_CONCAT(tag_id SEPARATOR ',') sets , (SELECT GROUP_CONCAT(tag_id SEPARATOR ',') FROM `cities_tags` WHERE city_id= 1)AS parisset FROM `cities_tags` LEFT JOIN `cities` ON (cities_tags.`city_id` = cities.`id`) GROUP BY city_id ) q ) a ORDER BY jaccard_index DESC
在上面的查询中，我已将结果集导出为两个子选择，以便获得自定义计算的别名

您可以在上面的查询中添加过滤器，而不计算与自身的相似性

SELECT a.*, ( (CASE WHEN a.`intersect` =0 THEN a.`union` ELSE a.`intersect` END ) /a.`union`) AS jaccard_index FROM ( SELECT q.* ,(q.sets + q.parisset) AS `union` , (q.sets - q.parisset) AS `intersect` FROM ( SELECT cities.`id`, cities.`name` , GROUP_CONCAT(tag_id SEPARATOR ',') sets , (SELECT GROUP_CONCAT(tag_id SEPARATOR ',') FROM `cities_tags` WHERE city_id= 1)AS parisset FROM `cities_tags` LEFT JOIN `cities` ON (cities_tags.`city_id` = cities.`id`) WHERE cities.`id` !=1 GROUP BY city_id ) q ) a ORDER BY jaccard_index DESC
因此，结果表明，巴黎与伦敦密切相关，然后与纽约密切相关

此查询没有任何奇特的函数，甚至没有子查询。它速度很快。只需确保cities.id、cities\u tags.id、cities\u tags.city\u id和cities\u tags.tag\u id有索引即可
查询返回的结果包含：city1、city2以及city1和city2共有多少个标记的计数

select c1.name as city1 ,c2.name as city2 ,count(ct2.tag_id) as match_count from cities as c1 inner join cities as c2 on c1.id != c2.id -- change != into > if you dont want duplicates left join cities_tags as ct1 on -- use inner join to filter cities with no match ct1.city_id = c1.id left join cities_tags as ct2 on -- use inner join to filter cities with no match ct2.city_id = c2.id and ct1.tag_id = ct2.tag_id group by c1.id ,c2.id order by c1.id ,match_count desc ,c2.id
将
！=
更改为
，以避免每个城市返回两次。这意味着城市将不再在第一列和第二列显示一次

如果您不想看到没有标记匹配的城市组合，请将两个
左连接
更改为
内连接
。
为时已晚，但我认为没有一个答案是完全正确的。我得到了每个答案的最佳部分，并将所有内容组合在一起，以得出我自己的答案：

@m-khalid-junaid的解释非常有趣和正确，但是
（q.set+q.parisset）作为联合和（q.set-q.parisset）作为相交的实现是非常错误的
@n-lx的版本是正确的，但是需要Jaccard索引，这一点非常重要，如果一个城市有2个标签，并且用3个标签匹配另一个城市的两个标签，那么结果将与另一个城市只有相同两个标签的匹配结果相同。我认为完全匹配是最相关的
我的答覆是：城市这样的表格 |id| Name | |1 | Paris | |2 | London | |3 | New York| |id| tag | |1 | Europe | |2 | North America | |3 | River | | id | Name | | 1 | Paris | | 2 | Florence | | 3 | New York | | 4 | São Paulo | | 5 | London | | city_id | tag_id | | 1 | 1 | | 1 | 3 | | 2 | 1 | | 2 | 3 | | 3 | 1 | | 3 | 2 | | 4 | 2 | | 5 | 1 | | 5 | 2 | | 5 | 3 | cities\u标签这样的表格 |id| Name | |1 | Paris | |2 | London | |3 | New York| |id| tag | |1 | Europe | |2 | North America | |3 | River | | id | Name | | 1 | Paris | | 2 | Florence | | 3 | New York | | 4 | São Paulo | | 5 | London | | city_id | tag_id | | 1 | 1 | | 1 | 3 | | 2 | 1 | | 2 | 3 | | 3 | 1 | | 3 | 2 | | 4 | 2 | | 5 | 1 | | 5 | 2 | | 5 | 3 | 根据该样本数据，佛罗伦萨拥有完整的matc