Regex 使用Perl从apachedomlogs中提取特定的Useragent_Regex_Perl

Regex 使用Perl从apachedomlogs中提取特定的Useragent

regex perl

Regex 使用Perl从apachedomlogs中提取特定的Useragent,regex,perl,Regex,Perl,我目前正在构建一个正则表达式，它将能够提取访问站点的bot的useragent的名称。到目前为止，我已经能够得到匹配的表达式，但它没有返回我期望的值。请检查以下示例： #!/usr/bin/perl use strict; use warnings; while (<>) { #Remove any unseen whitespace chomp($_); my $i = 0; #Open every file in turn open(my $domlog, "<"

我目前正在构建一个正则表达式，它将能够提取访问站点的bot的useragent的名称。到目前为止，我已经能够得到匹配的表达式，但它没有返回我期望的值。请检查以下示例：

#!/usr/bin/perl

use strict; use warnings;

while (<>)
{
#Remove any unseen whitespace
chomp($_);

my $i = 0;


#Open every file in turn
open(my $domlog, "<", "$_") or die "cannot open file: $!";

#these were used for testing the open/closing of files
#print "Opened $_";
#print "Closed $_";

#for now confirm the file I'm searching through
print "Opened $_\n";

#Adding the name of the domain to the @domaind array for data processing later
push (@domain, $2) if $_ =~ m/(\/usr\/local\/apache\/domlogs\/.*\/)(.*)/;

#search through the currently opened domlog line by line
while (<$domlog>) {

#clear white space again
chomp $_;

#Print the the record in full, then print the IP address of the visitor and what should be the useragent name 
print "$_\n";
print "$1\n $2\n\n" if $_ =~ m/^(\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3})\s(.*)\s.*(\w+[crawl|bot|spider|yahoo|bing|google])?/i;

}

close $domlog;

}

#/usr/bin/perl
严格使用；使用警告；
而（）
{
#删除任何看不见的空白
咀嚼（美元）；
我的$i=0；
#依次打开每个文件
open（my$domlog，“如果没有示例预期输出，我只会猜测您可能想要实现的目标。但以下是关于您的脚本需要指出的一些事项：
push (@domain, $2) if $_ =~ m/(\/usr\/local\/apache\/domlogs\/.*\/)(.*)/;

您已经在使用m
运算符，使用它可以更改定界字符。此外，还有（？：…）
非匹配组，但在本例中，您甚至不需要该组。如果未与=~
一起使用，则bare中的正则表达式总是在$上匹配，因此您可以将其删除。在列表上下文中，它们将返回匹配组的内容。现在，所有这些正则表达式组合在一起：
push @domain, m~/usr/local/apache/domlogs/.*/(.*)~;

现在转到另一个表达式。如果事情变得复杂，应该使用/x
标志，它可以极大地提高可读性
是正则表达式中的一个特殊字符，它匹配任何字符，因此您可能希望对此进行转义。此外，对于ip地址匹配，您可以使用（？：…）
：
[…]
匹配制动器中的字符
[crawl|bot|spider|yahoo|bing|google]`

可以简化为
[abcdeghilnoprstwy|]

并且会做同样的事情，这显然不是你想要的，而是强调你错在哪里。你可能想要的是一个不匹配的组。如果你让它成为可选的，它很可能不匹配（所以去掉组后的？
）
（？：爬网|机器人|蜘蛛|雅虎|必应|谷歌）

这就是这个魔鬼的样子：
if (/^(\d{1,3}(?:\.\d{1,3}){3})                  # $1 - ip address
     \s(.*)\s*                                   # $2 - within spaces
     (\w*(?:crawl|bot|spider|yahoo|bing|google)) # $3 - some bot string
    /xi){                                        # end of regex
  print ("$1\n$2\n$3\n");
}

可能仍然不是您想要的，但我不知道那是什么。您可能希望将$2
的组设置为非贪婪（.*？
）。如果您想在括号内找到匹配项，也可以将一些括号转义
最后，看一看，因为可能有人已经为你做了这项工作
以下是相关文档（这些是perldoc
页面，因此如果您的系统上安装了perldoc
，您也可以执行perldoc perlretut
）：

正则表达式教程
正则表达式的文档
如果您至少已经阅读了perlretut
，那么此参考资料将派上用场
字符类只能包含字符[crawl | bot | spider | yahoo | bing | google]
不会像您认为的那样进行解释。相反，作为c
或r
或a
等，提供了一些示例输入以提高清晰度。\s（.*\s、 *
这里的\s
是无用的，因为贪婪的*
会消耗所有\s
如果有的话。\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}
可以简化为\d{1,3}（.\d{1,3}）
它不会返回我期望的值
-您应该与我们共享该值。日志示例将非常有用。您的注释“删除任何未看到的空白”和“再次清除空白”都不正确。chomp（）
删除当前值$/
（通常是换行）来自给定字符串。它对任何其他空白字符都没有影响。感谢您迄今为止的帮助，使用您迄今为止提供的修改过的代码，我提取了以下错误信息：188.165.15.208---[13/Jan/2015:09:20:49-0500]“GET/？page_id=2 HTTP/1.1“200 10574”-”Mozilla/5.0（兼容；AhrefsBot/5.0+http://ahrefs.com/robot/)“188.165.15.208 bot
我正在尝试获取此信息：188.165.15.208---[13/Jan/2015:09:20:49-0500]“GET/？page_id=2 HTTP/1.1“200 10574”-”Mozilla/5.0（兼容；AhrefsBot/5.0+http://ahrefs.com/robot/)“188.165.15.208 AhrefsBot/5.0；
所以这里的正则表达式可能不太准确。您仍然无法说出预期的输出是什么，因此没有人能够说出正确的解决方案。@JPeck89：如果您想要机器人名称（再次猜测），您可以与/\（？：.*）*（[^；]*（？：b（？：ot|ing）| crawl | yahoo | google spider）[^；]*)；/
。名称将位于$1
中。它使用排除匹配组[^；]
来匹配除；以外的任何内容。
if (/^(\d{1,3}(?:\.\d{1,3}){3})                  # $1 - ip address
     \s(.*)\s*                                   # $2 - within spaces
     (\w*(?:crawl|bot|spider|yahoo|bing|google)) # $3 - some bot string
    /xi){                                        # end of regex
  print ("$1\n$2\n$3\n");
}