Perl脚本，Web刮刀_Perl_Web_Web Scraping_Amazon

Perl脚本，Web刮刀

perl web web-scraping

Perl脚本，Web刮刀,perl,web,web-scraping,amazon,Perl,Web,Web Scraping,Amazon,我是Perl语言的新手，我有一个脚本，可以从amazon网站上获取评论。每次运行它时，我都会收到一个关于编译错误的错误。我想知道是否有人能解释一下它到底出了什么问题 #!/usr/bin/perl # get_reviews.pl # # A script to scrape Amazon, retrieve reviews, and write to a file # Usage: perl get_reviews.pl <asin> use strict; use warning

我是Perl语言的新手，我有一个脚本，可以从amazon网站上获取评论。每次运行它时，我都会收到一个关于编译错误的错误。我想知道是否有人能解释一下它到底出了什么问题

#!/usr/bin/perl
# get_reviews.pl
#
# A script to scrape Amazon, retrieve reviews, and write to a file
# Usage: perl get_reviews.pl <asin>
use strict;
use warnings;
use LWP::Simple;

# Take the asin from the command-line
my $asin = shift @ARGV or die "Usage: perl get_reviews.pl <asin>\n";

# Assemble the URL from the passed asin.
my $url = "http://amazon.com/o/tg/detail/-/$asin/?vi=customer-reviews";

# Set up unescape-HTML rules. Quicker than URI::Escape.
my %unescape = ('&quot;'=>'"', '&amp;'=>'&', '&nbsp;'=>' ');
my $unescape_re = join '|' => keys %unescape;

# Request the URL.
my $content = get($url);
die "Could not retrieve $url" unless $content;

#Remove everything before the reviews
$content =~ s!.*?Number of Reviews:!!ms;

# Loop through the HTML looking for matches
while ($content =~ m!<img.*?stars-(\d)-0.gif.*?>.*?<b>(.*?)</b>, (.*?)[RETURN]
\n.*?Reviewer:\n<b>\n(.*?)</b>.*?</table>\n(.*?)<br>\n<br>!mgis) {

my($rating,$title,$date,$reviewer,$review) = [RETURN] 
($1||'',$2||'',$3||'',$4||'',$5||'');
$reviewer =~ s!<.+?>!!g;   # drop all HTML tags
$reviewer =~ s!\(.+?\)!!g;   # remove anything in parenthesis
$reviewer =~ s!\n!!g;      # remove newlines
$review =~ s!<.+?>!!g;     # drop all HTML tags
$review =~ s/($unescape_re)/$unescape{$1}/migs; # unescape.

# Print the results
print "$title\n" . "$date\n" . "by $reviewer\n" .
      "$rating stars.\n\n" . "$review\n\n";

#/usr/bin/perl
#获取_reviews.pl
#
#用于抓取Amazon、检索评论并写入文件的脚本
#用法：perl get_reviews.pl
严格使用；
使用警告；
使用LWP：：Simple；
#从命令行获取asin
my$asin=shift@ARGV或die“用法：perl get_reviews.pl\n”；
#从传递的asin组装URL。
我的$url=”http://amazon.com/o/tg/detail/-/$asin/？vi=客户评论”；
#设置unescape HTML规则。比URI：：Escape更快。
我的%unescape=（“'”=>“，“&；”=>”&“，”=>”；
my$unescape_re=加入“|”=>键%unescape；
#请求URL。
我的$content=get（$url）；
除非有$content，否则die“无法检索$url”；
#在审查之前删除所有内容
$content=~s！*？评论数量：！！太太
#在HTML中循环查找匹配项
而（$content=~m！*？（.*），（.*）[RETURN]
\n、 *？审阅者：\n\n（.*）.*？\n（.*）
\n
！mgis）{
我的（$rating、$title、$date、$reviewer、$review）=[返回]
($1||'',$2||'',$3||'',$4||'',$5||'');
$reviewer=~s！！！g；#删除所有HTML标记
$reviewer=~s！\（.+？\）！！g；#删除括号中的任何内容
$reviewer=~s！\n！！g；#删除换行符
$review=~s！！！g；#删除所有HTML标记
$review=~s/（$unescape_re）/$unescape{$1}/migs；#unescape。
#打印结果
打印“$title\n”。$date\n”。$reviewer\n”。
$rating stars。\n\n“$review\n\n”；

}也许你应该试试Web:：Scraper（）。它将以更干净的方式完成这项工作

[编辑]无论如何，我检查了随机审查的HTML代码，发现您的模式已经过时。例如，审阅者的姓名由“by”引入，而不是由“reviewer”引入

语法错误似乎是由代码中出现两次的“[RETURN]”引起的。当我删除这些代码时，代码编译没有问题

亚马逊并不真的喜欢人们浏览他们的网站。这就是为什么他们提供了一个API，让你可以访问他们的内容。还有一个Perl模块，用于使用该API-。您应该使用它，而不是脆弱的web抓取技术。

您可以发布有关错误的更多信息吗？基本上，它说“全局符号”$reviewer“需要在C:\User\test.pl第25行显示包名。此外，在“0”附近还有一个语法错误("第36行。我知道全局符号错误是因为变量没有正确声明，但我没有得到语法错误。@user2916250第25行没有

$reviewer

，并且您发布的代码中没有语法错误。如果您想让任何人知道这是实际代码，我想$reviewer m需要在顶部定义权限。如“My$Reviewer”.Web:：Scraping是对手动正则表达式的明确改进，但它仍然是Web Scraping。使用Amazon的API是这里最好的解决方案。我不知道API。对于这个特定任务，它肯定是最好的解决方案。当没有API可用时，应该考虑Web Scraping模块……嗯，我认为Amazon已经摆脱了API f或者是客户评论？这已经改变了吗？感谢您提供信息！关于他们的API（）的页面上说：“产品广告API帮助您使用产品搜索和查找功能、产品信息和功能（如客户评论、类似产品、愿望列表以及新的和使用过的列表）为亚马逊产品做广告”