Php 从网页中提取标题和摘要_Php_String_Url_Meta

Php 从网页中提取标题和摘要

php string url

Php 从网页中提取标题和摘要,php,string,url,meta,Php,String,Url,Meta,我正试图从arXiv页面中提取标题和摘要，例如，我的代码目前看起来像 function get_title($url){ $str = file_get_contents($url); if(strlen($str)>0){ $str = trim(preg_replace('/\s+/', ' ', $str)); // supports line breaks inside <title> preg_match("/\<title\>(.

我正试图从arXiv页面中提取标题和摘要，例如，我的代码目前看起来像

function get_title($url){
  $str = file_get_contents($url);
  if(strlen($str)>0){
    $str = trim(preg_replace('/\s+/', ' ', $str)); // supports line breaks inside <title>
    preg_match("/\<title\>(.*)\<\/title\>/i",$str,$title); // ignore case
    return $title[1];
  }
}

echo get_title("http://arxiv.org/abs/1207.0102");

函数获取标题（$url）{
$str=文件获取内容（$url）；
如果（strlen（$str）>0）{
$str=trim（preg_replace（'/\s+/'，'$str））；//支持内部换行
preg\u match（“/\（.*）\/i”，$str，$title）；//忽略大小写
返回$title[1]；
}
}
echo获取标题（“http://arxiv.org/abs/1207.0102");

当我运行此代码时，会出现此错误

警告：文件\u获取\u内容（）：未能打开流：HTTP请求失败！HTTP/1.1 403在中被禁止 C:\wamp\www\mysite\Index.php

例如，当我尝试不同的URL时，这个问题不会发生

有人知道为什么会这样吗

此外，是否可以从此网页中提取摘要？

网站的响应不允许使用空用户代理：

HTTP/1.1 403 Forbidden

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head><title>403 Forbidden</title></head>
<body>
<h1>Access Denied</h1>

 <p>Sadly, your client does not supply a proper User-Agent,
 and is consequently excluded.</p>
 <p>We have an inordinate number of problems with automated scripts
 which do not supply a User-Agent, and violate the automated access
 guidelines posted at arxiv.org
 -- hence we now exclude them all.</p>
 <p>(In rare cases, we have found that accesses through proxy servers
 strip the User-Agent information. If this is the case, you need to contact
 the administrator of your proxy server to get it fixed.)</p>


<p>If you believe this determination to be in error, see
<b>http://arxiv.org/denied.html</b> for additional information.</p>
</body>
</html>

网站的响应不允许使用空用户代理：

HTTP/1.1 403 Forbidden

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head><title>403 Forbidden</title></head>
<body>
<h1>Access Denied</h1>

 <p>Sadly, your client does not supply a proper User-Agent,
 and is consequently excluded.</p>
 <p>We have an inordinate number of problems with automated scripts
 which do not supply a User-Agent, and violate the automated access
 guidelines posted at arxiv.org
 -- hence we now exclude them all.</p>
 <p>(In rare cases, we have found that accesses through proxy servers
 strip the User-Agent information. If this is the case, you need to contact
 the administrator of your proxy server to get it fixed.)</p>


<p>If you believe this determination to be in error, see
<b>http://arxiv.org/denied.html</b> for additional information.</p>
</body>
</html>

谢谢，这太好了。最后一个问题：你知道如何从这一页中提取摘要吗？如果你能帮忙，我将不胜感激me@user3741635当前位置请提出一个新问题，并展示您迄今为止在摘录摘要方面所做的工作。谢谢您，这太棒了。最后一个问题：你知道如何从这一页中提取摘要吗？如果你能帮忙，我将不胜感激me@user3741635当前位置请提出一个新问题，并说明您迄今为止在摘录摘要方面所做的工作。