Php 存储重复的数组元素

Php 存储重复的数组元素,php,arrays,sorting,duplicates,Php,Arrays,Sorting,Duplicates,我拼命想克服以下问题:从一系列的句子/新闻标题中,我试图找到那些非常相似的句子/新闻标题(有3到4个共同的单词),并将它们放入一个新的数组中。因此,对于这个原始数组/列表: 'Title1: Hackers expose trove of snagged Snapchat images', 'Title2: New Jersey officials order symptom-less NBC News crew into Ebola quarantine', 'Title3: Family s

我拼命想克服以下问题:从一系列的句子/新闻标题中,我试图找到那些非常相似的句子/新闻标题(有3到4个共同的单词),并将它们放入一个新的数组中。因此,对于这个原始数组/列表:

'Title1: Hackers expose trove of snagged Snapchat images',
'Title2: New Jersey officials order symptom-less NBC News crew into Ebola quarantine',
'Title3: Family says goodbye at funeral for 16-year-old',
'Title4: New Jersey officials talk about Ebola quarantine',
'Title5: New Far Cry 4 Trailer Welcomes You to Kyrat Lowlands',
'Title6: Hackers expose Snapchat images'
结果应该是:

Array
(
    [0] => Title1: Hackers expose trove of snagged Snapchat images
    [1] => Array
        (
            [duplicate] => Title6: Hackers expose Snapchat images
        )

    [2] => Title2: New Jersey officials order symptom-less NBC News crew into Ebola quarantine
    [3] => Array
        (
            [duplicate] => Title4: New Jersey officials talk about Ebola quarantine
        )
    [4] => Title3: Family says goodbye at funeral for 16-year-old
    [5] => Title5: New Far Cry 4 Trailer Welcomes You to Kyrat Lowlands
)
这是我的代码:

    $titles = array(
    'Title1: Hackers expose trove of snagged Snapchat images',
    'Title2: New Jersey officials order symptom-less NBC News crew into Ebola quarantine',
    'Title3: Family says goodbye at funeral for 16-year-old',
    'Title4: New Jersey officials talk about Ebola quarantine',
    'Title5: New Far Cry 4 Trailer Welcomes You to Kyrat Lowlands',
    'Title6: Hackers expose Snapchat images'
    );
$z = 1;
foreach ($titles as $feed)
{
    $feed_A = explode(' ', $feed);
    for ($i=$z; $i<count($titles); $i++)
    {
        $feed_B = explode(' ', $titles[$i]);
        $intersect_A_B = array_intersect($feed_A, $feed_B);
        if(count($intersect_A_B)>3)
        {
            $titluri[] = $feed;
            $titluri[]['duplicate'] = $titles[$i]; 
        }
        else 
        {
            $titluri[] = $feed;
        }
    }
    $z++;
}

任何建议都将不胜感激

我认为这段代码可能就是您想要的(包括注释)。如果没有,让我知道-这是匆忙编写的,未经测试。此外,您可能还需要考虑一种替代方法—嵌套的foreach循环可能会在大型站点上导致性能问题

<?php

$titles = array(
    'Title1: Hackers expose trove of snagged Snapchat images',
    'Title2: New Jersey officials order symptom-less NBC News crew into Ebola quarantine',
    'Title3: Family says goodbye at funeral for 16-year-old',
    'Title4: New Jersey officials talk about Ebola quarantine',
    'Title5: New Far Cry 4 Trailer Welcomes You to Kyrat Lowlands',
    'Title6: Hackers expose Snapchat images'
    );
$titluri    =   array(); // unless it's declared elsewhere
// loop through each line of the array
foreach ($titles as $key => $originalFeed)
{
    $titluri[] = $originalFeed; // all feeds are listed in the new array
    $feed_A = explode(' ', $originalFeed);
    foreach ($titles as $newKey => $comparisonFeed)
    {
        // iterate through the array again and see if they intersect
        if ($key != $newKey) { // but don't compare same line against eachother!
            $feed_B = explode(' ', $comparisonFeed);
            $intersect_A_B = array_intersect($feed_A, $feed_B);
            // do they share three words?
            if(count($intersect_A_B)>3)
            {
                // yes, add a diplicate entry
                $titluri[]['duplicate'] = $comparisonFeed; 
            }
        }
    }
}

这是我的解决方案,灵感来自@DomWeldon,没有重复:

 <?php
$titles = array(
    'Title1: Hackers expose trove of snagged Snapchat images',
    'Title2: New Jersey officials order symptom-less NBC News crew into Ebola quarantine',
    'Title3: Family says goodbye at funeral for 16-year-old',
    'Title4: New Jersey officials talk about Ebola quarantine',
    'Title5: New Far Cry 4 Trailer Welcomes You to Kyrat Lowlands',
    'Title6: Hackers expose Snapchat images'
);
$titluri    =   array(); // unless it's declared elsewhere
$duplicateTitles = array();
// loop through each line of the array
foreach ($titles as $key => $originalFeed)
{
    if(!in_array($key, $duplicateTitles)){
        $titluri[] = $originalFeed; // all feeds are listed in the new array
        $feed_A = explode(' ', $originalFeed);
        foreach ($titles as $newKey => $comparisonFeed)
        {
            // iterate through the array again and see if they intersect
            if ($key != $newKey) { // but don't compare same line against eachother!
                $feed_B = explode(' ', $comparisonFeed);
                $intersect_A_B = array_intersect($feed_A, $feed_B);
                // do they share three words?
                if(count($intersect_A_B)>3)
                {
                    // yes, add a diplicate entry
                    $titluri[]['duplicate'] = $comparisonFeed;
                    $duplicateTitles[] = $newKey;
                }
            }
        }
    }
}

我为您提供了一些有用的链接,可以帮助您。您还可以在中查看
类似的\u文本
函数。虽然它非常脏,但您可以在循环后在
$titluri
上使用
数组(唯一)
,以获得所需的数组?@albanpmmeret,数组(唯一)将不起作用,已经试过了。刚刚用
$newKey
替换了
$i
,我觉得你的代码很好!不确定这是否有效,但效率不是很高,例如第一轮比较
Title1
Title4
,然后再次比较
Title4
Title1
,结果几乎相同(其他对也一样)。使用for循环(带计数器)应该更好。你说得对,@KingKing-这篇文章写得很快,请编辑!使用for循环当然在性能上更好,但在这种情况下实现起来更复杂(您可以保存对
array\u intersect
的一些调用)。我的评论是作为OP的提示,他可能想亲自尝试一下(可能真的需要一些测试)。@AlbanPommeret,你的代码完成了这项工作,但它复制了一些条目,如数组中所示(title1抓取title6,title2抓取title4,因为它们很相似,但title6下面也会有title1,title4下面也会有title2,这是重复的,我正在努力避免。请打印生成的数组以了解我的意思)。谢谢,它似乎做得很好。我会看看是否可以调整它以将代码集成到一个更大的方案中。祝你好运,Alban!这是另一个解决方法(
in_array
将执行一些内部循环),它当然比Dom Weldon的解决方案好,但我认为我们可以使用2 for循环(而不是2 foreach)然后性能就更好了。第一个循环:
$i
0
,第二个循环:
$j
$i+1
。然而,我们可能需要更多的调整才能使其工作(不仅仅是通过更改循环)@KingKing,正如你所说,我已经使用了2个for循环,但是在旧的非功能代码“$I”和“$j=$I+1”中。我将尝试在Alban的解决方案中使用它,但明天早上。谢谢你的提示!@VladAndrei只是再次扫描它,使用for循环仍然需要你检查是否已经使用了某些索引(作为副本),但是您应该使用专用数组来保存这些索引(而不是使用数组中的
来检查),这样可以更好地执行搜索(因为搜索基于键,而不是基于值),同时需要更多的内存,但这并不多。@VladAndrei以下是我使用2 fors:
$dup=array();for($I=0;$I
 <?php
$titles = array(
    'Title1: Hackers expose trove of snagged Snapchat images',
    'Title2: New Jersey officials order symptom-less NBC News crew into Ebola quarantine',
    'Title3: Family says goodbye at funeral for 16-year-old',
    'Title4: New Jersey officials talk about Ebola quarantine',
    'Title5: New Far Cry 4 Trailer Welcomes You to Kyrat Lowlands',
    'Title6: Hackers expose Snapchat images'
);
$titluri    =   array(); // unless it's declared elsewhere
$duplicateTitles = array();
// loop through each line of the array
foreach ($titles as $key => $originalFeed)
{
    if(!in_array($key, $duplicateTitles)){
        $titluri[] = $originalFeed; // all feeds are listed in the new array
        $feed_A = explode(' ', $originalFeed);
        foreach ($titles as $newKey => $comparisonFeed)
        {
            // iterate through the array again and see if they intersect
            if ($key != $newKey) { // but don't compare same line against eachother!
                $feed_B = explode(' ', $comparisonFeed);
                $intersect_A_B = array_intersect($feed_A, $feed_B);
                // do they share three words?
                if(count($intersect_A_B)>3)
                {
                    // yes, add a diplicate entry
                    $titluri[]['duplicate'] = $comparisonFeed;
                    $duplicateTitles[] = $newKey;
                }
            }
        }
    }
}