Php 数百万张图片？_Php_Mysql_Compare_Similarity_Image

Php 数百万张图片？

php mysql image

Php 数百万张图片？,php,mysql,compare,similarity,image,Php,Mysql,Compare,Similarity,Image,这是关于Frank Denis先生的php（）的libpuzzle库。我试图理解如何在mysql数据库中索引和存储数据。矢量的生成绝对没有问题例如： # Compute signatures for two images $cvec1 = puzzle_fill_cvec_from_file('img1.jpg'); $cvec2 = puzzle_fill_cvec_from_file('img2.jpg'); # Compute the distance between both sig

这是关于Frank Denis先生的php（）的libpuzzle库。我试图理解如何在mysql数据库中索引和存储数据。矢量的生成绝对没有问题

例如：

# Compute signatures for two images
$cvec1 = puzzle_fill_cvec_from_file('img1.jpg');
$cvec2 = puzzle_fill_cvec_from_file('img2.jpg');

# Compute the distance between both signatures
$d = puzzle_vector_normalized_distance($cvec1, $cvec2);

# Are pictures similar?
if ($d < PUZZLE_CVEC_SIMILARITY_LOWER_THRESHOLD) {
  echo "Pictures are looking similar\n";
} else {
  echo "Pictures are different, distance=$d\n";
}

我建议至少将“words”表拆分为多个表和/或服务器

默认情况下（lambas=9）签名的长度为544字节。为了节省存储空间，它们可以压缩到原来的1/3 通过puzzle_compress_cvec（）函数调整大小。使用前，它们必须使用puzzle_解压_cvec（）解压

我认为压缩是错误的，因为在比较之前我必须先解压缩每个向量

我现在的问题是——如何处理数百万张图片，如何快速有效地进行比较。我无法理解“向量切割”如何帮助我解决问题

非常感谢——也许我能在这里找到一个和利比亚人合作的人

干杯。

我以前用libpuzzle做过实验，已经做了很多了。没有真正开始正确的实现。也不清楚具体怎么做。（由于时间不够，放弃了这个项目——因此没有真正坚持下去）

无论如何，现在来看，我会尽力提供我的理解——也许我们之间可以解决：）

查询使用两个阶段的过程-

首先使用单词表。

取“参考”图像并计算出其签名

写出它的组成词

查阅单词表以查找所有可能的匹配项。这可以使用数据库引擎的“索引”进行高效查询

编译所有sig_ID的列表。（将在3中获得一些副本。）

然后查阅签名表

从签名中检索并解压缩所有可能的内容（因为您有一个经过预筛选的列表，所以数字应该相对较小）

使用拼图_向量_归一化的_距离来计算实际距离

根据需要对结果进行排序和排序

（即仅对签名表使用压缩。文字保持未压缩状态，因此可以在其上运行快速查询）

单词表是倒排索引的一种形式。事实上，我想改用words数据库表，因为它是专门设计为一个非常快速的反向索引

。。。理论上无论如何假设您有一个表，其中存储了与每个图像相关的信息（路径、名称、描述等）。在该表中，您将包括一个压缩签名字段，该字段在最初填充数据库时计算并存储。让我们这样定义该表：

CREATE TABLE images (
    image_id INTEGER NOT NULL PRIMARY KEY,
    name TEXT,
    description TEXT,
    file_path TEXT NOT NULL,
    url_path TEXT NOT NULL,
    signature TEXT NOT NULL
);

当您最初计算签名时，您还将根据签名计算一些单词：

// this will be run once for each image:
$cvec = puzzle_fill_cvec_from_file('img1.jpg');
$words = array();
$wordlen = 10; // this is $k from the example
$wordcnt = 100; // this is $n from the example
for ($i=0; $i<min($wordcnt, strlen($cvec)-$wordlen+1); $i++) {
    $words[] = substr($cvec, $i, $wordlen);
}

// the signature, along with all other data, has already been inserted into the images
// table, and $image_id has been populated with the resulting primary key
foreach ($words as $index => $word) {
    $sig_word = $index.'__'.$word;
    $dbobj->query("INSERT INTO img_sig_words (image_id, sig_word) VALUES ($image_id,
        '$sig_word')"); // figure a suitably defined db abstraction layer...
}

现在，您将插入到该表中，在找到单词的位置索引前加上前缀，以便知道单词何时与签名中的同一位置匹配：

// this will be run once for each image:
$cvec = puzzle_fill_cvec_from_file('img1.jpg');
$words = array();
$wordlen = 10; // this is $k from the example
$wordcnt = 100; // this is $n from the example
for ($i=0; $i<min($wordcnt, strlen($cvec)-$wordlen+1); $i++) {
    $words[] = substr($cvec, $i, $wordlen);
}

// the signature, along with all other data, has already been inserted into the images
// table, and $image_id has been populated with the resulting primary key
foreach ($words as $index => $word) {
    $sig_word = $index.'__'.$word;
    $dbobj->query("INSERT INTO img_sig_words (image_id, sig_word) VALUES ($image_id,
        '$sig_word')"); // figure a suitably defined db abstraction layer...
}

这样初始化数据，您就可以相对轻松地获取具有匹配单词的图像：

// $image_id is set to the base image that you are trying to find matches to
$dbobj->query("SELECT i.*, COUNT(isw.sig_word) as strength FROM images i JOIN img_sig_words
    isw ON i.image_id = isw.image_id JOIN img_sig_words isw_search ON isw.sig_word =
    isw_search.sig_word AND isw.image_id != isw_search.image_id WHERE
    isw_search.image_id = $image_id GROUP BY i.image_id, i.name, i.description,
    i.file_path, i.url_path, i.signature ORDER BY strength DESC");

您可以通过添加一个

HAVING

子句来改进查询，该子句需要最小的

强度

，从而进一步减少匹配集

我不保证这是最有效的设置，但它应该大致能够实现您所期望的功能

基本上，以这种方式拆分和存储单词可以让您进行粗略的距离检查，而无需对签名运行专门的函数。

我在GitHub上制作了一个libpuzzle演示项目：

该项目采用Jason提出的上述方法

数据库架构显示在：

我会提供更多关于libpuzzle签名的信息

现在我们有两张图片，让我计算它们的签名

奇数行用于图像1（左侧），偶数行用于图像2

您可以发现，在大多数情况下，相同位置的数字是相同的

我的英语很差，所以我无法表达我的想法继续…我想任何想要索引数百万张图像的人都应该检查我的GitHub repo of libpuzzle演示版。

我也在用php编写libpuzzle，我对如何从图像签名生成单词有些怀疑。 Jasons上面的回答似乎是正确的，但我对这部分有一个问题：

//这将为每个图像运行一次：
$cvec=puzzle_fill_cvec_from_文件（'img1.jpg'）；
$words=array（）；
$wordlen=10；//这是示例中的$k
$wordcnt=100；//这是示例中的$n
对于（$i=0；$iThat’s good information-谢谢。我只是想澄清一下，你是否真的尝试过这个方法？或者只是“理论上”而已？这不会影响赏金，但肯定有兴趣看到一个可行的实现。特别是，你的索引可能需要调整才能运行高效的查询。这是理论，我对libpuzzle没有直接的经验，我只是我不认为我会提供一些代码来扩展libpuzzle文档中的示例，主要是作为练习。请注意……我们实际上实现了（稍微修改）上面…效果很好！还有…低看，比运行拼图比较函数image vs image更准确一点…到目前为止，我们已经尝试了20的强度…并且对于我们400万强大的图像库，几乎得到了100%的准确结果…谢谢！！！我已经尝试了你的代码，但是返回了100个项目的$words数组，什么都没有ing value？！我很想帮你，但我不完全清楚你在说什么。是不是$words
数组包含100个条目，但每个条目都是空白字符串？如果是这样，我可能会输入错误，但我已经看了一年半了，不记得所有细节…无论如何，试试$words[]=substr