Php 剥离html以删除所有js/css/html标记,从而提供实际文本(显示在浏览器上)以用于索引和搜索
我尝试过strip#u标记,但它仍然保留内联js:(function(){..})和内联css#按钮{} 我需要从html中提取纯文本,而不需要任何JS函数、样式或标记,这样我就可以为它编制索引并用于搜索功能 html2text似乎也不能解决这个问题 编辑 PHP代码:Php 剥离html以删除所有js/css/html标记,从而提供实际文本(显示在浏览器上)以用于索引和搜索,php,indexing,full-text-search,search-engine,Php,Indexing,Full Text Search,Search Engine,我尝试过strip#u标记,但它仍然保留内联js:(function(){..})和内联css#按钮{} 我需要从html中提取纯文本,而不需要任何JS函数、样式或标记,这样我就可以为它编制索引并用于搜索功能 html2text似乎也不能解决这个问题 编辑 PHP代码: $url = "http://blog.everymansoftware.com/2011/11/development-setup-for-neo4j-and-php_05.html"; $fileHeaders = @ge
$url = "http://blog.everymansoftware.com/2011/11/development-setup-for-neo4j-and-php_05.html";
$fileHeaders = @get_headers($url);
if( $fileHeaders[0] == "HTTP/1.1 200 OK" || $fileHeaders[0] == "HTTP/1.0 200 OK")
{
$content = strip_tags(file_get_contents($url));
}
输出:
$content=
(function() { var a=window,c="jstiming",d="tick";var e=function(b){this.t={};this.tick=function(b,o,f){f=void 0!=f?f:(new Date).getTime();this.t[b]=[f,o]};this[d]("start",null,b)},h=new e;a.jstiming={Timer:e,load:h};if(a.performance&&a.performance.timing){var i=a.performance.timing,j=a[c].load,k=i.navigationStart,l=i.responseStart;0=k&&(j[d]("_wtsrt",void 0,k),j[d]("wtsrt_","_wtsrt",l))}
try{var m=null;a.chrome&&a.chrome.csi&&(m=Math.floor(a.chrome.csi().pageT));null==m&&a.gtbExternal&&(m=a.gtbExternal.pageT());null==m&&a.external&&(m=a.external.pageT);m&&(a[c].pt=m)}catch(n){};a.tickAboveFold=function(b){var g=0;if(b.offsetParent){do g+=b.offsetTop;while(b=b.offsetParent)}b=g;750>=b&&a[c].load[d]("aft")};var p=!1;function q(){p||(p=!0,a[c].load[d]("firstScrollTime"))}a.addEventListener?a.addEventListener("scroll",q,!1):a.attachEvent("onscroll",q);
})();
Everyman Software: Development Setup for Neo4j and PHP: Part 2
#navbar-iframe { display:block }
if(window.addEventListener) {
window.addEventListener('load', prettyPrint, false);
} else {
window.attachEvent('onload', prettyPrint);
}
var a=navigator,b="userAgent",c="indexOf",f="&m=1",g="(^|&)m=",h="?",i="?m=1";function j(){var d=window.location.href,e=d.split(h);switch(e.length){case 1:return d+i;case 2:return 0
2011-11-05
Development Setup for Neo4j and PHP: Part 2
This is Part 2 of a series on setting up a development environment for building projects using the graph database Neo4j and PHP. In Part 1 of this series, we set up unit test and development databases. In this part, we'll build a skeleton project that includes unit tests, and a minimalistic user interface.
All the files will live under a directory on our web server. In a real project, you'll probably want only the user interface files under the web server directory and your testing and library files somewhere more protected.
Also, I won't be using any specific PHP framework. The principles in t
将删除<script type="text/javascript">alert('hello world');</script>
<a href="http://www.google.com">Google</a>
这将不会执行,而只是显示在您的网站上
或者,尝试转换“这是一个小片段,我总是使用它来删除网页中所有隐藏的文本,包括
、
等标记之间的所有内容。它还将用一个空格替换任何类型的多个空格
<?php
$url = "http://blog.everymansoftware.com/2011/11/development-setup-for-neo4j-and-php_05.html";
$fileHeaders = @get_headers($url);
if( $fileHeaders[0] == "HTTP/1.1 200 OK" || $fileHeaders[0] == "HTTP/1.0 200 OK")
{
$content = strip_html_tags(file_url_contents($url));
}
############################################
//To fetch the $url by using cURL
function file_url_contents($url){
$crl = curl_init();
$timeout = 30;
curl_setopt ($crl, CURLOPT_URL,$url);
curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout);
$ret = curl_exec($crl);
curl_close($crl);
return $ret;
} //file_url_contents ENDS
//To remove all the hidden text not displayed on a webpage
function strip_html_tags($str){
$str = preg_replace('/(<|>)\1{2}/is', '', $str);
$str = preg_replace(
array(// Remove invisible content
'@<head[^>]*?>.*?</head>@siu',
'@<style[^>]*?>.*?</style>@siu',
'@<script[^>]*?.*?</script>@siu',
'@<noscript[^>]*?.*?</noscript>@siu',
),
"", //replace above with nothing
$str );
$str = replaceWhitespace($str);
$str = strip_tags($str);
return $str;
} //function strip_html_tags ENDS
//To replace all types of whitespace with a single space
function replaceWhitespace($str) {
$result = $str;
foreach (array(
" ", " \t", " \r", " \n",
"\t\t", "\t ", "\t\r", "\t\n",
"\r\r", "\r ", "\r\t", "\r\n",
"\n\n", "\n ", "\n\t", "\n\r",
) as $replacement) {
$result = str_replace($replacement, $replacement[0], $result);
}
return $str !== $result ? replaceWhitespace($result) : $result;
}
############################
?>
在这里看到它的实际作用和输出:请与内嵌JS和CSSI共享您的HTML代码。CSSI需要编写一个通用函数,该函数可以接受任何url并删除所有非文本数据(HTML标记/JS/css)因此,这可以给一个索引搜索引擎。Preg_match不是一个解决方案,因为我可能需要过滤的可能性太多。我不想最后写另一个库,它本身就是一个完整的项目!仍然不考虑内联/嵌入式javascripts/css:(function(){var a=window,c=“jstimening”,d=“tick“var e=function(b){this.t={};this.tick=function(b,o,f){f=void 0!=f?f:(新日期).getTime();this.t[b]=[f,o]};thisd},h=newe;a.jstiming={Timer:e,load:h};if(a.performance&&a.performance.timing){var i=a.performance.timing,j=a[c]。load,k=i.navigationStart,l=i.restart;0&(jd;var;var=!p=!load0){a、 addEventListener?a.addEventListener(“滚动”,q,!1):a.attachEvent(“onscroll,q);})();这是使用给定url执行的代码:这是输出文本:查看此输出文本,让我知道哪些行/节是不需要的。请按行号参考。请注意,您需要通过此函数传递原始字符串,而不是使用strip_HTML()后获得的字符串。因为在后一种情况下,
,
等标识符标记已被剥离,因此我的函数无法区分隐藏和显示的文本。在回答您的第一条评论时,在探测时,我发现您正在引用的框架function(){var a=window,c="jsti….
位于源代码页的第7行,包含在
&
标记中。在执行strip\u标记后,请不要将整个字符串传递给我的函数。请将字符串原封不动地发送给所有标题、样式、脚本标记
。感谢这段漂亮的代码,它帮助我删除了JSA从字符串中删除css。
$content='<a href="http://www.google.com">Google</a>';
$regex="#<a href=".*?">(.+?)</a>#";
preg_match($regex,$content,$match);
echo $match[1];
<?php
$url = "http://blog.everymansoftware.com/2011/11/development-setup-for-neo4j-and-php_05.html";
$fileHeaders = @get_headers($url);
if( $fileHeaders[0] == "HTTP/1.1 200 OK" || $fileHeaders[0] == "HTTP/1.0 200 OK")
{
$content = strip_html_tags(file_url_contents($url));
}
############################################
//To fetch the $url by using cURL
function file_url_contents($url){
$crl = curl_init();
$timeout = 30;
curl_setopt ($crl, CURLOPT_URL,$url);
curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout);
$ret = curl_exec($crl);
curl_close($crl);
return $ret;
} //file_url_contents ENDS
//To remove all the hidden text not displayed on a webpage
function strip_html_tags($str){
$str = preg_replace('/(<|>)\1{2}/is', '', $str);
$str = preg_replace(
array(// Remove invisible content
'@<head[^>]*?>.*?</head>@siu',
'@<style[^>]*?>.*?</style>@siu',
'@<script[^>]*?.*?</script>@siu',
'@<noscript[^>]*?.*?</noscript>@siu',
),
"", //replace above with nothing
$str );
$str = replaceWhitespace($str);
$str = strip_tags($str);
return $str;
} //function strip_html_tags ENDS
//To replace all types of whitespace with a single space
function replaceWhitespace($str) {
$result = $str;
foreach (array(
" ", " \t", " \r", " \n",
"\t\t", "\t ", "\t\r", "\t\n",
"\r\r", "\r ", "\r\t", "\r\n",
"\n\n", "\n ", "\n\t", "\n\r",
) as $replacement) {
$result = str_replace($replacement, $replacement[0], $result);
}
return $str !== $result ? replaceWhitespace($result) : $result;
}
############################
?>