在PHP中快速比较两个数组_Php_Arrays_Csv_Sorting_Multidimensional Array

在PHP中快速比较两个数组

php arrays csv sorting

在PHP中快速比较两个数组,php,arrays,csv,sorting,multidimensional-array,Php,Arrays,Csv,Sorting,Multidimensional Array,所以我读了2个csv文件，大约有25k条记录。一个是旧CSV，一个是新CSV。我需要比较新CSV文件中的“主要联系人”字段是否与旧CSV记录不同，而旧CSV和新CSV中的“姓名”、“州”和“城市”字段是否相同新CSV: Array( [0] => Array ( [0] => ID [1] => NAME [2] => STATE [3] => CITY [4] =>

所以我读了2个csv文件，大约有25k条记录。一个是旧CSV，一个是新CSV。我需要比较新CSV文件中的“主要联系人”字段是否与旧CSV记录不同，而旧CSV和新CSV中的“姓名”、“州”和“城市”字段是否相同

新CSV:

Array(

[0] => Array
    (
        [0] => ID
        [1] => NAME
        [2] => STATE
        [3] => CITY
        [4] => COUNTY
        [5] => ADDRESS
        [6] => PHONE
        [7] => PRIMARY CONTACT
        [8] => POSITION
        [9] => EMAIL
    )

[1] => Array
    (
        [0] => 2002
        [1] => Abbeville Christian Academy
        [2] => Alabama
        [3] => Abbeville
        [4] => Henry
        [5] => Po Box 9 Abbeville, AL 36310-0009
        [6] => (334) 585-5100
        [7] => Ashley Carlisle
        [8] => Athletic Director
        [9] => acarlisle@acagenerals.org
    )

}

问题是，我做了两个foreach嵌套循环来比较，这对于smal记录来说很好，但是当我运行每个都包含25k条记录的新旧CSV文件时，这个过程需要花费很长时间才能完成

在两个CSV中都有一些重复项，所以我先删除它们

function multi_unique($data){
    $data = array_reverse($data);

    $result = array_reverse( // Reverse array to the initial order.
        array_values( // Get rid of string keys (make array indexed again).
            array_combine( // Create array taking keys from column and values from the base array.
                array_column($data, 1), 
                $data
            )
        )
    );

    return $result;
}

$old_csv=multi_unique($old_csv);
$new_csv=multi_unique($new_csv);

这是我的比较代码，我需要比这更快的东西

$name_index_no = 1;
$state_index_no = 2;
$city_index_no = 3;
$country_index_no = 4;
$address_index_no = 5;
$primary_contact_index_no = 7;
$new_export_records[] = $old_csv[0];

foreach($new_csv as $key=>$value){

    foreach($old_csv as $key1=>$value1){

        if( $old_csv[$key1][$state_index_no] == $new_csv[$key][$state_index_no] &&
            $old_csv[$key1][$city_index_no] == $new_csv[$key][$city_index_no] &&
            $old_csv[$key1][$name_index_no] == $new_csv[$key][$name_index_no] ){

            if($old_csv[$key1][$primary_contact_index_no] != 
               $new_csv[$key][$primary_contact_index_no]){

                $new_export_records[] = $new_csv[$key];

            }

            unset($old_csv[$key1]);
            break;
        }

    }

}

正如Michael指出的，您当前的解决方案运行了

n*m

次。每一个都是25k，这简直太多了。但是，如果您先运行旧数据，创建一个索引，然后运行新数据并对照该索引进行检查，您将在

m+n

迭代中完成

例如：

$name_index_no            = 1;
$state_index_no           = 2;
$city_index_no            = 3;
$country_index_no         = 4;
$address_index_no         = 5;
$primary_contact_index_no = 7;

$genKey = function ($row, $glue = '|') use ($state_index_no, $city_index_no, $name_index_no) {
    return implode($glue, [
        $row[$state_index_no],
        $row[$city_index_no],
        $row[$name_index_no],
    ]);
};


// create an index using the old data
$t = microtime(true);
$index = [];
foreach ($old_csv as $row) {
    $index[$genKey($row)] = $row;
}

printf('index generation: %.5fs', microtime(true) - $t);

// collect changed/new entries
$t = microtime(true);
$changed = [];
$new = [];
foreach ($new_csv as $row) {
    $key = $genKey($row);

    // key doesn't exist => new entry
    if (!isset($index[$key])) {
        $new[] = $row;
    }

    // primary contact differs => changed entry
    elseif ($row[$primary_contact_index_no] !== $index[$key][$primary_contact_index_no]) {
        $changed[] = $row;
    }
}

printf('comparison: %.5fs', microtime(true) - $t);
print_r($changed);
print_r($new);

正如Michael指出的，您当前的解决方案运行了

n*m

次。每一个都是25k，这简直太多了。但是，如果您先运行旧数据，创建一个索引，然后运行新数据并对照该索引进行检查，您将在

m+n

迭代中完成

例如：

$name_index_no            = 1;
$state_index_no           = 2;
$city_index_no            = 3;
$country_index_no         = 4;
$address_index_no         = 5;
$primary_contact_index_no = 7;

$genKey = function ($row, $glue = '|') use ($state_index_no, $city_index_no, $name_index_no) {
    return implode($glue, [
        $row[$state_index_no],
        $row[$city_index_no],
        $row[$name_index_no],
    ]);
};


// create an index using the old data
$t = microtime(true);
$index = [];
foreach ($old_csv as $row) {
    $index[$genKey($row)] = $row;
}

printf('index generation: %.5fs', microtime(true) - $t);

// collect changed/new entries
$t = microtime(true);
$changed = [];
$new = [];
foreach ($new_csv as $row) {
    $key = $genKey($row);

    // key doesn't exist => new entry
    if (!isset($index[$key])) {
        $new[] = $row;
    }

    // primary contact differs => changed entry
    elseif ($row[$primary_contact_index_no] !== $index[$key][$primary_contact_index_no]) {
        $changed[] = $row;
    }
}

printf('comparison: %.5fs', microtime(true) - $t);
print_r($changed);
print_r($new);

你认为什么是你需要的合理表现？你需要在瞬间得到答案吗？几秒钟？在当前的解决方案中，您的运行时复杂性为O（n^2），这不是很好：）您只是想知道这两个文件是否至少在其中一个字段中存在差异，还是想收集所有相关行（例如，在下一步中使用它）？@Michael，如果可以在一瞬间完成，那将非常棒。你有什么建议吗？@Yoshi，如果我说只有一个值不同，那么我如何快速比较并获得与该特定值不同的新CSV记录（“主要联系人”请参见上面的数组）？@再次。。这取决于你的需要，你需要像数据库一样思考这个问题，你有两个优势1。快速查询时间2。快速索引（或读取）时间。那么您需要读取文件并实时向用户显示输出吗？或者这是一个2个独立的过程？你认为什么是你的需要的合理表现？你需要在瞬间得到答案吗？几秒钟？在当前的解决方案中，您的运行时复杂性为O（n^2），这不是很好：）您只是想知道这两个文件是否至少在其中一个字段中存在差异，还是想收集所有相关行（例如，在下一步中使用它）？@Michael，如果可以在一瞬间完成，那将非常棒。你有什么建议吗？@Yoshi，如果我说只有一个值不同，那么我如何快速比较并获得与该特定值不同的新CSV记录（“主要联系人”请参见上面的数组）？@再次。。这取决于你的需要，你需要像数据库一样思考这个问题，你有两个优势1。快速查询时间2。快速索引（或读取）时间。那么您需要读取文件并实时向用户显示输出吗？或者这是一个2个独立的进程？我尝试了你的代码，但它只提供了9条记录$changed数组，但有2k+条记录是不同的，这是我从嵌套循环中获得的，在我的core i7机器上运行大约3分钟，在我的VPS主机上运行大约6分钟。检查的字段是否可能存在重复项（州、市、名）？我之前也删除了重复项，请参见我的更新后的上述回答。我将代码更改为使用字符串键（所需字段的串联），请尝试使用更新后的代码。我想我发现了问题，基本上我的互联网上传速度很慢（我正在上传CSV），因为我刚刚对第一行代码计时并在第一行之后退出，但这需要一些时间。但时间结果很快。我尝试了你的代码，但它只提供了9条记录$changed array，但有2k+条记录是不同的，这是我从嵌套循环中获得的，在我的core i7机器上运行大约3分钟，大约需要6分钟在我的VPS主机上，检查的字段（州、城市、名称）是否可能存在重复项？我之前删除了重复项，请参阅我的更新的上述回答我更改了代码以使用字符串键（所需字段的串联），请尝试使用更新的代码。我想我发现了问题，基本上我的互联网上传速度很慢（我正在上传CSV），因为我只是对第一行代码计时，然后在第一行之后退出，但这需要一些时间。但时间结果很快。