Mysql 获取最相似的行并通过相似性对其排序-性能改进_Mysql_Sql_Database_Database Performance_Query Performance

Mysql 获取最相似的行并通过相似性对其排序-性能改进

mysql sql database

Mysql 获取最相似的行并通过相似性对其排序-性能改进,mysql,sql,database,database-performance,query-performance,Mysql,Sql,Database,Database Performance,Query Performance,我有项表，结构与此类似： id user_id feature_1 feature_2 feature_3 ... feature_20 大多数功能…字段都是数字，其中3-4个包含文本现在，我需要为给定的项目找到最相似的项目（具有完全相同的字段和一些权重），并按相似性排序我可以这样做： select (IF (feature_1 = 'xxx1', 100, 0) + IF (feature_2 = 'xxx2', 100, 0) + IF (fea

我有

项

表，结构与此类似：

id
user_id
feature_1 
feature_2
feature_3
...
feature_20

大多数

功能…

字段都是数字，其中3-4个包含文本

现在，我需要为给定的项目找到最相似的项目（具有完全相同的字段和一些权重），并按相似性排序

我可以这样做：

select (IF (feature_1 = 'xxx1', 100, 0) +  
        IF (feature_2 = 'xxx2', 100, 0) + 
        IF (feature_3 = 'xxx3', 100, 0) + 
        IF (feature_4 = 'xxx4', 1, 0) + 
        ...  + 
        IF (feature_20 = 'xxx20', 1, 0)) 
        AS score, id from `items` where `id` <> 'yyy' 
        group by `id` having `score` > '0' order by `score` desc;

现在需要1-2秒才能得到相同的结果。我错过什么了吗

CREATE TABLE IF NOT EXISTS `features` (
`id` int(10) unsigned NOT NULL,
  `name` varchar(100) COLLATE utf8_unicode_ci NOT NULL,
  `weight` tinyint(3) unsigned NOT NULL,
  `created_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
  `updated_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00'
) ENGINE=InnoDB AUTO_INCREMENT=26 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

CREATE TABLE IF NOT EXISTS `feature_watch` (
`id` int(10) unsigned NOT NULL,
  `feature_id` int(10) unsigned NOT NULL,
  `watch_id` int(10) unsigned NOT NULL,
  `user_id` int(10) unsigned NOT NULL,
  `feature_value` varchar(150) COLLATE utf8_unicode_ci DEFAULT NULL
) ENGINE=InnoDB AUTO_INCREMENT=2142999 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

ALTER TABLE `features`
 ADD PRIMARY KEY (`id`), ADD UNIQUE KEY `features_name_unique` (`name`), ADD KEY `weight` (`weight`);

ALTER TABLE `feature_watch`
 ADD PRIMARY KEY (`id`), ADD KEY `feature_watch_user_id_foreign` (`user_id`), ADD KEY `feature_id` (`feature_id`,`feature_value`), ADD KEY `watch_id` (`watch_id`);

ALTER TABLE `features`
MODIFY `id` int(10) unsigned NOT NULL AUTO_INCREMENT,AUTO_INCREMENT=26;

ALTER TABLE `feature_watch`
MODIFY `id` int(10) unsigned NOT NULL AUTO_INCREMENT,AUTO_INCREMENT=2142999;

ALTER TABLE `feature_watch`
ADD CONSTRAINT `feature_watch_feature_id_foreign` FOREIGN KEY (`feature_id`) REFERENCES `features` (`id`),
ADD CONSTRAINT `feature_watch_user_id_foreign` FOREIGN KEY (`user_id`) REFERENCES `users` (`id`) ON DELETE CASCADE,
ADD CONSTRAINT `feature_watch_watch_id_foreign` FOREIGN KEY (`watch_id`) REFERENCES `watches` (`id`) ON DELETE CASCADE;

EDIT2

对于以下查询：

select if2.watch_id, sum(f.weight) AS `sum` from feature_watch if1 inner join feature_watch if2 on if1.feature_id = if2.feature_id and if1.feature_value = if2.feature_value and if1.watch_id <> if2.watch_id inner join features f on if2.feature_id = f.id where if1.watch_id = 71 AND if2.`user_id` in (select `id` from `users` where `is_private` = '0') and if2.`user_id` <> '1' group by if2.watch_id ORDER BY sum DESC

上面的查询在

0.5s

上执行，如果我想运行它超过记录id 71（例如10个记录id），它的执行速度将慢x倍（10个id大约5秒）

我建议您重新组织表结构，如下所示：

create table items (id integer primary key auto_increment);

create table features (
  id integer primary key auto_increment,
  feature_name varchar(25),
  feature_weight integer
);

create table item_features (  
  item_id integer,
  feature_id integer,  
  feature_value varchar(25)
);

这将允许您运行一个相对简单的查询，通过对特征的权重求和来计算基于特征的相似性

select if2.item_id, sum(f.feature_weight)
  from item_features if1
    inner join item_features if2
      on if1.feature_id = if2.feature_id
        and if1.feature_value = if2.feature_value
        and if1.item_id <> if2.item_id
    inner join features f
      on if2.feature_id = f.id
   where if1.item_id = 1
   group by if2.item_id

选择if2.item\u id，sum（f.feature\u weight）
从项目功能if1
内部联接项功能if2
在if1.feature\u id=if2.feature\u id上
和if1.feature\u value=if2.feature\u value
和if1.item\u id if2.item\u id
内连接特征f
在if2.feature_id=f.id上
其中if1.item_id=1
按if2.item\u id分组

这里有一个演示：

我知道它与问题中的表定义不匹配，但在表中重复这样的值是通向黑暗面的一条道路。正常化确实让生活更轻松

通过对

项目\u要素（要素id、要素值）

和

要素（要素名称）

的索引，查询速度应该相当快

以下是我对您所需内容的理解。请告诉我我是否猜对了

根据

user\u id

确定，有许多项目属于多个用户。在本例中，我们有3个用户：

CREATE TABLE items (
id int, 
`user_id` int, `f1` int, `f2` int, `f3` int,
primary key(id),
key(user_id));

INSERT INTO items
    (id, `user_id`, `f1`, `f2`, `f3`)
VALUES
    (1, 1, 2, 22, 30),
    (2, 1, 1, 21, 40),
    (3, 1, 9, 25, 50),
    (4, 2, 1, 21, 30),
    (5, 2, 1, 22, 40),
    (6, 2, 2, 22, 35),
    (7, 3, 9, 22, 31),
    (8, 3, 8, 20, 55),
    (9, 3, 7, 20, 55),
    (10, 3, 5, 26, 30)
;

user\u id

是查询的一个参数。对于给定的

用户\u id

您希望查找属于此用户的所有项目，然后对于每个找到的项目，您希望计算定义此项目与每个其他项目（不仅来自此用户，还包括每个其他项目）之间“距离”的分数。然后，您希望显示按分数排序的结果的所有行。不仅仅是一个最相似的项目，而是所有项目

使用这两个项目的特征值计算一对项目的得分。没有与所有项目进行比较的恒定特征值集，每对项目可能有自己的分数

计算分数时，每个特征都有一个权重。这些权重是预定义的且恒定的（不取决于项目）。让我们在本例中使用这些常量：

weight for f1 is 1
weight for f2 is 3
weight for f3 is 5

下面是一种在一次查询中获得结果的方法（对于

user\u id=1

）：

如果这真的是你想要的，恐怕没有神奇的方法让它快速工作。对于用户的每个项目，您需要将其与其他项目进行比较以计算分数。因此，如果给定用户的

items

表中有

行和

项，则必须计算分数

N*M

次。然后你必须过滤掉零分并对结果进行排序。您无法避免阅读整个

项目

表

次

只有当有一些关于数据的外部知识时，也许你可以以某种方式“欺骗”，而不是每次都阅读整个

项目表
例如，如果您知道特征K的值分布非常不均匀：99%的值是X，1%是其他值。可以利用这些知识来减少计算量
另一个例子是，如果项目以某种方式聚集在一起（在度量/距离/分数的意义上）。如果可以预先计算这些簇，然后，不是每次使用适当的索引只读取属于同一个簇的小子集，而是通过整个项目表阅读。
我不认为“修复表设计”是一个选项。@ PalaI.如果它是解决方案，我可以考虑它，但以防万一“<代码>特性< /代码>字段不保存任何内容。相互连接的值，例如feature_1可以保存颜色，feature_2可以保存大小，那些feature_
字段的确切名称是不同的（例如颜色、大小等）。我真的认为这是一个很好的主意我认为@pala_u有点意思！！让我们先关注原始的项表。id
字段是否唯一？请添加有关“用户有100项”的更多详细信息。如何确定用户拥有哪些项目？仅凭用户id
？处理用户的所有项目时，是否对所有项目使用相同的特征值和权重集（xxx1、xxx2、…、xxx20），或者每个项目都有自己的特征和权重集进行比较？我认为可以提出一个查询而不是100个查询。一个包含5个功能、10个项目、2个用户和预期结果的简化示例数据将非常有帮助。感谢您的回复。我已经测试了您的解决方案，但我的数据库需要更多的时间（1-2秒）。请看我问题中的编辑。也许我遗漏了什么？你创建了必要的索引了吗？是的，我已经在上面的问题编辑中包含了所有的索引。问题似乎出在group by上——如果没有它，查询几乎不需要时间，但是在添加group by if2.watch_id
（在watch_id
上有一个索引）执行时间会随着超过1秒的时间而增加。将item_id添加到复合索引中可能会有所帮助。我们要处理多少行？我刚刚添加了，它使0.3s
的速度更快。正如我在编辑后的问题中所写，目前表中有超过200万行。我还需要将user\u id
添加到item\u features表和条件和if2中。
user\u id`in（从users
中选择id
，其中CREATE TABLE items (
id int, 
`user_id` int, `f1` int, `f2` int, `f3` int,
primary key(id),
key(user_id));

INSERT INTO items
    (id, `user_id`, `f1`, `f2`, `f3`)
VALUES
    (1, 1, 2, 22, 30),
    (2, 1, 1, 21, 40),
    (3, 1, 9, 25, 50),
    (4, 2, 1, 21, 30),
    (5, 2, 1, 22, 40),
    (6, 2, 2, 22, 35),
    (7, 3, 9, 22, 31),
    (8, 3, 8, 20, 55),
    (9, 3, 7, 20, 55),
    (10, 3, 5, 26, 30)
;

weight for f1 is 1
weight for f2 is 3
weight for f3 is 5

SELECT *
FROM
  (
    SELECT
      UserItems.id AS UserItemID
      ,AllItems.id AS AllItemID
      ,IF(AllItems.f1 = UserItems.f1, 1, 0)+
      IF(AllItems.f2 = UserItems.f2, 3, 0)+
      IF(AllItems.f3 = UserItems.f3, 5, 0) AS Score
    FROM
      (
        SELECT id, f1, f2, f3
        FROM items
        WHERE items.user_id = 1
      ) AS UserItems
      CROSS JOIN
      (
        SELECT id, f1, f2, f3
        FROM items
      ) AS AllItems
  ) AS Scores
WHERE
  UserItemID <> AllItemID
  AND Score > 0
ORDER BY UserItemID, Score desc

| UserItemID | AllItemID | Score |
|------------|-----------|-------|
|          1 |        10 |     5 |
|          1 |         4 |     5 |
|          1 |         6 |     4 |
|          1 |         5 |     3 |
|          1 |         7 |     3 |
|          2 |         5 |     6 |
|          2 |         4 |     4 |
|          3 |         7 |     1 |