Php 从字符串中的匹配列表中排除域

Php 从字符串中的匹配列表中排除域,php,regex,Php,Regex,我有我的服务器的访问日志,它的扩展名是.log,它有大约150K行包含URL的代码,我想在一个单独的文本文件中分别输出这些URL,每个URL在一个新行中 我想排除像http://www.google.combot和http://www.example.com所有这些都添加到下面的数组中,我将在列表中添加更多内容。域将以example.com开头,但在它或简单域中也有不同的查询字符串 $string = ' 166.137.126.16 - - [06/May/2017:02:32:33 +053

我有我的服务器的访问日志,它的扩展名是
.log
,它有大约150K行包含URL的代码,我想在一个单独的文本文件中分别输出这些URL,每个URL在一个新行中

我想排除像
http://www.google.com
bot和
http://www.example.com
所有这些都添加到下面的数组中,我将在列表中添加更多内容。域将以example.com开头,但在它或简单域中也有不同的查询字符串

$string = '
166.137.126.16 - - [06/May/2017:02:32:33 +0530] "GET /files/adg3com_crypticpsyche2.mp3 HTTP/1.0" 200 906922 "http://paradiseconcertspresents.com/?dn=content&cn=page&sw=view&page_id=location"
66.249.92.82 - - [06/May/2017:02:32:36 +0530] "GET /wp/autotow/wp-content/uploads/sites/5/locations_bg2.jpg HTTP/1.0" 500 658 "-" "AdsBot-Google (+http://www.google.com/adsbot.html)"
100.6.157.102 - - [06/May/2017:02:32:36 +0530] "GET /files/food_icon1.png HTTP/1.0" 200 3681 "http://totopomex.com/" "Mozilla/5.0 (iPhone; CPU iPhone OS 10_2_1 like Mac OS X)"
100.6.157.102 - - [06/May/2017:02:32:36 +0530] "GET /files/food_icon3.png HTTP/1.0" 200 4028 "http://totopomex.com/" "Mozilla/5.0 (iPhone; CPU iPhone OS 10_2_1 like Mac OS X)"
97.83.34.133 - - [06/May/2017:02:32:38 +0530] "GET /files/1920x1200.jpg HTTP/1.0" 404 416 "http://thatsapizzami.com/odds-ends/"
77.49.52.0 - - [06/May/2017:02:32:40 +0530] "GET /files/favicon.png HTTP/1.0" 200 1239 "http://radionotios.gr/"
66.175.153.111 - - [06/May/2017:02:32:45 +0530] "GET /files/pixel_weave.png HTTP/1.0" 404 416 "http://www.mississippisportsmedicine.com/"
66.249.92.82 - - [06/May/2017:02:32:46 +0530] "GET /wp/wp-content/uploads/sites/5/subheader_bg.jpg HTTP/1.0" 500 658 "-" "AdsBot-Google (+http://www.google.com/adsbot.html)"
66.249.92.86 - - [06/May/2017:02:33:06 +0530] "GET /wp/autotow/wp-content/uploads/sites/5/locations_bg2.jpg HTTP/1.0" AdsBot-Google-Mobile; +http://www.google.com/mobile/adsbot.html)"
216.255.37.4 - - [06/May/2017:02:33:09 +0530] "GET /files/food_icon1.png HTTP/1.0" 200 3681 "http://spenglers.com/" "Mozilla/5.0 (iPad; CPU OS 9_3_5 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Mobile/13G36"
141.70.4.75 - - [06/May/2017:02:33:09 +0530] "GET /wp-includes/js/jquery/ui/core.min.js?ver=1.11.4 HTTP/1.0" 200 2251 "http://www.example.com/medical/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:53.0) Gecko/20100101 Firefox/53.0"
141.70.4.75 - - [06/May/2017:02:34:09 +0530] "GET /wp-includes/js/jquery/ui/core.min.js?ver=1.11.4 HTTP/1.0" 200 2251 "http://www.example.com/medical/standard-post/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:53.0) Gecko/20100101 Firefox/53.0"
';

// Match all the strings starting with http(s) or without www
preg_match_all('((?:https?:|www\.)[^\s]+)', $string, $match);

/**
 * Not exactly same domains as shown here but few may contain
 * different query strings as well but the domains starting with
 * domains with extension or something would be good
 */
$exlucde_domains = array(
    'google'   => 'http://www.google.com/adsbot.html',
    'example' => 'http://www.example.com/',
    'msn'      => 'http://www.msn.com/adsbot.html',
    'yandex'   => 'http://www.yandex.com/adsbot.html',
);

// Excludes duplicate entries
$unique_match = array_unique($match[0]);

// Return each match in a new line 
foreach ( $unique_match as $matchlink ){
    echo $matchlink ."\n";
}
我想做什么?
现在我想排除上面提到的几个域,但我不能,因为我对此一无所知,但我已经达到了一定程度。你可以像这样实现:

<?php

$string = '
166.137.126.16 - - [06/May/2017:02:32:33 +0530] "GET /files/adg3com_crypticpsyche2.mp3 HTTP/1.0" 200 906922 "http://paradiseconcertspresents.com/?dn=content&cn=page&sw=view&page_id=location"
...
';

$logEntries = explode("\n", $string);

foreach ($logEntries as $index => $logEntry) {
    if (preg_match("(www\.google\.com|www\.example\.com)", $logEntry) > 0) {
        unset($logEntries[$index]);
    }
}
// $logEntries now contains remaining entries that do not contain the filtered out domains
foreach ($logEntries as $logEntry) {
    echo $logEntry . "\n";
}
$exclude_domains = array(
    'google'   => 'http://www.google.com/adsbot.html',
    'example' => 'http://www.example.com/',
    'msn'      => 'http://www.msn.com/adsbot.html',
    'yandex'   => 'http://www.yandex.com/adsbot.html',
);

$regex = '~' . implode('|', array_map("preg_quote", $exclude_domains)) . '~';

// Excludes duplicate entries
$unique_match = array_unique($match[0]);

// Return each match in a new line 
foreach ( $unique_match as $matchlink ){
    if (!preg_match($regex, $matchlink)) {
        echo "$matchlink\n";
    }
}
在这里,将创建一个新的正则表达式,其中包含要排除的域(使用
preg_quote
before)。在foreach循环中,这是对其进行检查的。
另一种方法是在原始表达式中使用负lookaheads