Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/ajax/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
MySQL/PHP:通过标记/分类法查找相似/相关的项_Php_Mysql_Relationship_Tagging - Fatal编程技术网

MySQL/PHP:通过标记/分类法查找相似/相关的项

MySQL/PHP:通过标记/分类法查找相似/相关的项,php,mysql,relationship,tagging,Php,Mysql,Relationship,Tagging,我有一张像这样的桌子 |id| Name | |1 | Paris | |2 | London | |3 | New York| |id| tag | |1 | Europe | |2 | North America | |3 | River | | id | Name | | 1 | Paris | | 2 | Florence | | 3 | New York | | 4 | São

我有一张像这样的桌子

|id| Name    |
|1 | Paris   |
|2 | London  |
|3 | New York|
|id| tag            |
|1 | Europe         |
|2 | North America  |   
|3 | River          |
| id | Name      |
| 1  | Paris     |
| 2  | Florence  |
| 3  | New York  |
| 4  | São Paulo |
| 5  | London    |
| city_id | tag_id |
| 1       | 1      | 
| 1       | 3      | 
| 2       | 1      |
| 2       | 3      | 
| 3       | 1      |     
| 3       | 2      |
| 4       | 2      |     
| 5       | 1      |
| 5       | 2      |
| 5       | 3      |
我有一张像这样的标签表

|id| Name    |
|1 | Paris   |
|2 | London  |
|3 | New York|
|id| tag            |
|1 | Europe         |
|2 | North America  |   
|3 | River          |
| id | Name      |
| 1  | Paris     |
| 2  | Florence  |
| 3  | New York  |
| 4  | São Paulo |
| 5  | London    |
| city_id | tag_id |
| 1       | 1      | 
| 1       | 3      | 
| 2       | 1      |
| 2       | 3      | 
| 3       | 1      |     
| 3       | 2      |
| 4       | 2      |     
| 5       | 1      |
| 5       | 2      |
| 5       | 3      |
和一个城市标签表:

|id| city_id | tag_id |
|1 | 1       | 1      | 
|2 | 1       | 3      | 
|3 | 2       | 1      |
|4 | 2       | 3      | 
|5 | 3       | 2      |     
|6 | 3       | 3      |
我如何计算哪一个城市关系最密切?例如如果我看城市1(巴黎),结果应该是:伦敦(2),纽约(3)


我已经找到了解决方案,但我不确定如何最好地实现这一点

这会是一个正确的方向吗

SELECT cities.name, ( 
                    SELECT cities.id FROM cities
                    JOIN cities_tags ON cities.id=cities_tags.city_id
                    WHERE tags.id IN(
                                     SELECT cities_tags.tag_id
                                     FROM cites_tags
                                     WHERE cities_tags.city_id=cites.id
                                     )
                    GROUP BY cities.id
                    HAVING count(*) > 0
                    ) as matchCount 
FROM cities
HAVING matchCount >0
我尝试的是:

//查找城市名称:
从matchCount>0的城市获取city.names(子查询)作为matchCount

//子查询:
选择城市(子子查询)也具有的标记数量

//子子查询
选择原始名称具有的标记的id

select c.name, cnt.val/(select count(*) from cities) as jaccard_index
from cities c 
inner join 
  (
  select city_id, count(*) as val 
  from cities_tags 
  where tag_id in (select tag_id from cities_tags where city_id=1) 
  and not city_id in (1)
  group by city_id
  ) as cnt 
on c.id=cnt.city_id
order by jaccard_index desc
此查询静态引用的是
city\u id=1
,因此您必须在
where tag\u id in
子句和
not city\u id in
子句中将其作为变量

如果我正确理解了Jaccard索引,那么它也会返回按“最密切相关”排序的值。我们示例中的结果如下所示:

|name      |jaccard_index  |
|London    |0.6667         |
|New York  |0.3333         |
select jaccard.city, 
       jaccard.intersect, 
       jaccard.union, 
       jaccard.intersect/jaccard.union as 'jaccard index'
from 
(select
    c2.name as city
    ,count(ct2.tag_id) as 'intersect' 
    ,(select count(distinct ct3.tag_id) 
      from cities_tags ct3 
      where ct3.city_id in(c1.id, c2.id)) as 'union'
from
    cities as c1
    inner join cities as c2 on c1.id != c2.id
    left join cities_tags as ct1 on ct1.city_id = c1.id
    left join cities_tags as ct2 on ct2.city_id = c2.id and ct1.tag_id = ct2.tag_id
where c1.id = 1
group by c1.id, c2.id) as jaccard
order by jaccard.intersect/jaccard.union desc

编辑 更好地了解如何实施Jaccard索引:

在阅读了wikipedia上关于Jaccard索引的更多内容后,我想出了一种更好的方法来实现对示例数据集的查询。基本上,我们将独立地将我们选择的城市与列表中的其他城市进行比较,并使用公共标记计数除以两个城市之间选择的不同总标记计数

select c.name, 
  case -- when this city's tags are a subset of the chosen city's tags
    when not_in.cnt is null 
  then -- then the union count is the chosen city's tag count
    intersection.cnt/(select count(tag_id) from cities_tags where city_id=1) 
  else -- otherwise the union count is the chosen city's tag count plus everything not in the chosen city's tag list
    intersection.cnt/(not_in.cnt+(select count(tag_id) from cities_tags where city_id=1)) 
  end as jaccard_index
  -- Jaccard index is defined as the size of the intersection of a dataset, divided by the size of the union of a dataset
from cities c 
inner join 
  (
    --  select the count of tags for each city that match our chosen city
    select city_id, count(*) as cnt 
    from cities_tags 
    where tag_id in (select tag_id from cities_tags where city_id=1) 
    and city_id!=1
    group by city_id
  ) as intersection
on c.id=intersection.city_id
left join
  (
    -- select the count of tags for each city that are not in our chosen city's tag list
    select city_id, count(tag_id) as cnt
    from cities_tags
    where city_id!=1
    and not tag_id in (select tag_id from cities_tags where city_id=1)
    group by city_id
  ) as not_in
on c.id=not_in.city_id
order by jaccard_index desc
这个查询有点长,我不知道它的伸缩性如何,但它确实实现了一个真正的Jaccard索引,正如问题中所要求的那样。以下是新查询的结果:

+----------+---------------+
| name     | jaccard_index |
+----------+---------------+
| London   |        1.0000 |
| New York |        0.3333 |
+----------+---------------+

再次编辑以向查询中添加注释,并考虑当前城市的标记是所选城市标记的子集的情况?例如如果我看的是城市1(巴黎),结果应该是:伦敦(2)、纽约(3),根据您提供的数据集,只有一件事需要关联,那就是城市之间的公共标记,因此共享公共标记的城市将是下面最接近的城市,即查找城市的子查询(提供的用于查找其最近城市的标记除外)共享公共标记

SELECT * FROM `cities`  WHERE id IN (
SELECT city_id FROM `cities_tags` WHERE tag_id IN (
SELECT tag_id FROM `cities_tags` WHERE city_id=1) AND city_id !=1 )
工作 我假设您将输入一个城市id或名称来查找最近的一个,在我的例子中,“Paris”有一个id

 SELECT tag_id FROM `cities_tags` WHERE city_id=1
它将找到巴黎当时拥有的所有标签id

SELECT city_id FROM `cities_tags` WHERE tag_id IN (
    SELECT tag_id FROM `cities_tags` WHERE city_id=1) AND city_id !=1 )
它将吸引除巴黎以外的所有城市,这些城市的标签与巴黎相同

这是你的

在阅读有关Jaccard相似性/索引的文章时,我们发现了一些东西来理解术语的实际含义,让我们以这个例子为例,我们有两组A和B

集合A={A,B,C,D,E}

集合B={I,H,G,F,E,D}

计算jaccard相似性的公式为JS=(A)/(B)/(A 工会B)

A相交B={D,E}=2

A并集B={A,B,C,D,E,I,H,G,F}=9

JS=2/9=0.2222

现在进入您的场景

Paris有tag_ID 1,3,所以我们制作了这个集合并称为我们的集合 P={欧洲,河流}

伦敦有tag_ID 1,3,所以我们制作了一套,并调用我们的 集合L={欧洲,河流}

纽约的标签号是2,3,所以我们制作了一套,并称之为 Set NW={北美,河流}

用伦敦JSPL=p相交L/p联合L计算巴黎JS, JSPL=2/2=1

计算JS巴黎与纽约JSPNW=p相交NW/p 北方联盟,JSPNW=1/3=0.3333

这是到目前为止的查询,其中包括完美的jaccard索引,您可以看到下面的小提琴示例

SELECT a.*, 
( (CASE WHEN a.`intersect` =0 THEN a.`union` ELSE a.`intersect` END ) /a.`union`) AS jaccard_index 
 FROM (
SELECT q.* ,(q.sets + q.parisset) AS `union` , 
(q.sets - q.parisset) AS `intersect`
FROM (
SELECT cities.`id`, cities.`name` , GROUP_CONCAT(tag_id SEPARATOR ',') sets ,
(SELECT  GROUP_CONCAT(tag_id SEPARATOR ',')  FROM `cities_tags` WHERE city_id= 1)AS parisset

FROM `cities_tags` 
LEFT JOIN `cities` ON (cities_tags.`city_id` = cities.`id`)
GROUP BY city_id ) q
) a ORDER BY jaccard_index DESC 
在上面的查询中,我已将结果集导出为两个子选择,以便获得自定义计算的别名

您可以在上面的查询中添加过滤器,而不计算与自身的相似性

SELECT a.*, 
( (CASE WHEN a.`intersect` =0 THEN a.`union` ELSE a.`intersect` END ) /a.`union`) AS jaccard_index 
 FROM (
SELECT q.* ,(q.sets + q.parisset) AS `union` , 
(q.sets - q.parisset) AS `intersect`
FROM (
SELECT cities.`id`, cities.`name` , GROUP_CONCAT(tag_id SEPARATOR ',') sets ,
(SELECT  GROUP_CONCAT(tag_id SEPARATOR ',')  FROM `cities_tags` WHERE city_id= 1)AS parisset

FROM `cities_tags` 
LEFT JOIN `cities` ON (cities_tags.`city_id` = cities.`id`) WHERE  cities.`id` !=1
GROUP BY city_id ) q
) a ORDER BY jaccard_index DESC
因此,结果表明,巴黎与伦敦密切相关,然后与纽约密切相关


此查询没有任何奇特的函数,甚至没有子查询。它速度很快。只需确保cities.id、cities\u tags.id、cities\u tags.city\u id和cities\u tags.tag\u id有索引即可

查询返回的结果包含:city1city2以及city1和city2共有多少个标记的计数

select
    c1.name as city1
    ,c2.name as city2
    ,count(ct2.tag_id) as match_count
from
    cities as c1
    inner join cities as c2 on
        c1.id != c2.id              -- change != into > if you dont want duplicates
    left join cities_tags as ct1 on -- use inner join to filter cities with no match
        ct1.city_id = c1.id
    left join cities_tags as ct2 on -- use inner join to filter cities with no match
        ct2.city_id = c2.id
        and ct1.tag_id = ct2.tag_id
group by
    c1.id
    ,c2.id
order by
    c1.id
    ,match_count desc
    ,c2.id
!=
更改为
,以避免每个城市返回两次。这意味着城市将不再在第一列和第二列显示一次


如果您不想看到没有标记匹配的城市组合,请将两个
左连接
更改为
内连接

为时已晚,但我认为没有一个答案是完全正确的。我得到了每个答案的最佳部分,并将所有内容组合在一起,以得出我自己的答案:

  • @m-khalid-junaid的解释非常有趣和正确,但是
    (q.set+q.parisset)作为
    联合
    (q.set-q.parisset)作为
    相交
    的实现是非常错误的
  • @n-lx的版本是正确的,但是需要Jaccard索引,这一点非常重要,如果一个城市有2个标签,并且用3个标签匹配另一个城市的两个标签,那么结果将与另一个城市只有相同两个标签的匹配结果相同。我认为完全匹配是最相关的
我的答覆是:
城市
这样的表格

|id| Name    |
|1 | Paris   |
|2 | London  |
|3 | New York|
|id| tag            |
|1 | Europe         |
|2 | North America  |   
|3 | River          |
| id | Name      |
| 1  | Paris     |
| 2  | Florence  |
| 3  | New York  |
| 4  | São Paulo |
| 5  | London    |
| city_id | tag_id |
| 1       | 1      | 
| 1       | 3      | 
| 2       | 1      |
| 2       | 3      | 
| 3       | 1      |     
| 3       | 2      |
| 4       | 2      |     
| 5       | 1      |
| 5       | 2      |
| 5       | 3      |
cities\u标签
这样的表格

|id| Name    |
|1 | Paris   |
|2 | London  |
|3 | New York|
|id| tag            |
|1 | Europe         |
|2 | North America  |   
|3 | River          |
| id | Name      |
| 1  | Paris     |
| 2  | Florence  |
| 3  | New York  |
| 4  | São Paulo |
| 5  | London    |
| city_id | tag_id |
| 1       | 1      | 
| 1       | 3      | 
| 2       | 1      |
| 2       | 3      | 
| 3       | 1      |     
| 3       | 2      |
| 4       | 2      |     
| 5       | 1      |
| 5       | 2      |
| 5       | 3      |
根据该样本数据,佛罗伦萨拥有完整的matc