Google apps script 如何查询URL是否被Google索引?
我想创建一个Google脚本来检查给定URL是否被Google索引,因此我编写了以下函数:Google apps script 如何查询URL是否被Google索引?,google-apps-script,Google Apps Script,我想创建一个Google脚本来检查给定URL是否被Google索引,因此我编写了以下函数: function CheckURLForGoogleIndex(url, activesheet) {// Delete the https:// and http:// prefix var cururl = url.replace("https://", ""); cururl = cururl.replace("http://&q
function CheckURLForGoogleIndex(url, activesheet) {// Delete the https:// and http:// prefix
var cururl = url.replace("https://", "");
cururl = cururl.replace("http://", "");
var googlesearchurl = "https://www.google.com/search?q=site:" + encodeURIComponent(cururl);
var page = UrlFetchApp.fetch(googlesearchurl, {muteHttpExceptions: true}).getContentText();
// Wait for 1 second before starting another fetch
Utilities.sleep(1000);
var number = page.match("did not match any documents");
if (number) {
activesheet.getSheetByName("Not Google Index").appendRow([url]);
} else {
activesheet.getSheetByName("Google Index").appendRow([url]);
}
}
但是,在调试代码时,在调用UrlFetchApp.fetch之后,我只能看到变量页面的标题
我尝试使用Google索引URL和非索引URL测试该函数,但在page.match函数中两者都将返回null,因此都放在“Google索引”表中
我的功能有什么问题
谢谢
注意:
我已经问了这个问题,但是没有人回答,所以我必须在这里问
样本输入和输出
输入1:
url=
activesheet=包含“谷歌索引”和“非谷歌索引”页面的谷歌表单
预期输出1:由于被谷歌索引,它将被添加到“谷歌索引”表中
page=“site:www.datanumen.com/-谷歌搜索…”
输入2:
url=
activesheet=包含“谷歌索引”和“非谷歌索引”页面的谷歌表单
预期输出2:由于未被谷歌索引,它将被添加到“非谷歌索引”表中
page=“站点:www.datanumen.com/notindexurl/-G…”
目前,Input1和Input2都存在问题,实际结果是:URL将始终添加到“Google索引”表中,因为搜索结果根本不会包含“未匹配任何文档”文本
更新
我添加了console.log(第页);然后再次调试。对于Input1,我得到以下结果:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head><meta http-equiv="content-type" content="text/html; charset=utf-8"><meta name="viewport" content="initial-scale=1"><title>https://www.google.com/search?q=site:www.datanumen.com%2F</title></head>
<body style="font-family: arial, sans-serif; background-color: #fff; color: #000; padding:20px; font-size:18px;" onload="e=document.getElementById('captcha');if(e){e.focus();}">
<div style="max-width:400px;">
<hr noshade size="1" style="color:#ccc; background-color:#ccc;"><br>
<form id="captcha-form" action="index" method="post">
<script src="https://www.google.com/recaptcha/api.js" async defer></script>
<script>var submitCallback = function(response) {document.getElementById('captcha-form').submit();};</script>
<div id="recaptcha" class="g-recaptcha" data-sitekey="6LfwuyUTAAAAAOAmoS0fdqijC2PbbdH4kjq62Y1b" data-callback="submitCallback" data-s="c5Hy4maqTFv3SzYRiWhpsqYF2isZmauUQnLVljOiED_PiaVWJWCsHMzRAZyh8HLCBHJ_mjET7yODJu8AlZ33_xGAQ8TcKuXAd7rQpsYakaGKPD8USiGSFhiII2ai-Cf_B26i1Ufpko-qYQ8V3rezhiSXxi5J2yHZ-_WwEj8ukzy5znxzVurTM_2cY243Q4ofwP7E7eWBaHIg6N3ofmPuFXd-uRIUU4z0cU_pas8"></div>
<input type='hidden' name='q' value='EgRrsuB5GKmx6oUGIhBKAdWty9nssg-nAtyy9n7hMgFy'><input type="hidden" name="continue" value="https://www.google.com/search?q=site:www.datanumen.com%2F">
</form>
<hr noshade size="1" style="color:#ccc; background-color:#ccc;">
<div style="font-size:13px;">
<b>About this page</b><br><br>
Our systems have detected unusual traffic from your computer network. This page checks to see if it's really you sending the requests, and not a robot. <a href="#" onclick="document.getElementById('infoDiv').style.display='block';">Why did this happen?</a><br><br>
<div id="infoDiv" style="display:none; background-color:#eee; padding:10px; margin:0 0 15px 0; line-height:1.4em;">
This page appears when Google automatically detects requests coming from your computer network which appear to be in violation of the <a href="//www.google.com/policies/terms/">Terms of Service</a>. The block will expire shortly after those requests stop. In the meantime, solving the above CAPTCHA will let you continue to use our services.<br><br>This traffic may have been sent by malicious software, a browser plug-in, or a script that sends automated requests. If you share your network connection, ask your administrator for help — a different computer using the same IP address may be responsible. <a href="//support.google.com/websearch/answer/86640">Learn more</a><br><br>Sometimes you may be asked to solve the CAPTCHA if you are using advanced terms that robots are known to use, or sending requests very quickly.
</div>
IP address: 107.178.224.121<br>Time: 2021-06-04T21:18:34Z<br>URL: https://www.google.com/search?q=site:www.datanumen.com%2F<br>
</div>
</div>
</body>
</html>
https://www.google.com/search?q=site:www.datanumen.com%2F
var submitCallback=函数(响应){document.getElementById('captcha-form').submit();};
关于此页面
我们的系统检测到来自您计算机网络的异常流量。此页面检查是否为';他真的是你在发送请求,而不是机器人
当Google自动检测到来自您的计算机网络的请求时,会显示此页面,这些请求似乎违反了。这些请求停止后,该块将很快过期。同时,解决上述验证码将允许您继续使用我们的服务。
此流量可能是由恶意软件、浏览器插件或发送自动请求的脚本发送的。如果共享网络连接,请向管理员寻求帮助&mdash;可能是使用相同IP地址的不同计算机造成的
如果您使用的是已知机器人使用的高级术语,或者发送请求非常快,有时可能会要求您解决验证码问题。
IP地址:107.178.224.121
时间:2021-06-04T21:18:34Z
URL:https://www.google.com/search?q=site:www.datanumen.com%2F
回答:
不幸的是,通过尝试使用UrlFetchApp对搜索结果进行web刮取来直接执行此操作将不起作用。但是,您可以使用第三方工具获取搜索结果的数量
更多信息:
我使用指数退避方法对此进行了测试,当UrlFetchApp
调用获取请求时,该方法有时能够克服429
错误
当使用UrlFetchApp
进行web刮取或连接到API时,服务器可能会以请求过多或HTTP错误429
为由拒绝请求
谷歌应用程序脚本在云中运行,来自谷歌拥有的池中的一组IP地址。实际上,您可以看到所有IP范围。大多数网站(特别是像谷歌这样的大公司)都有适当的架构,以防止使用机器人刮取网站并减慢流量
有时,使用指数回退和随机时间间隔的混合方式(完全公开:这个GitHub存储库是我写的),可以克服这个错误
我假设要么是谷歌直接阻止了应用程序脚本IP池,要么就是有太多人在尝试同样的事情——因为使用相同的技术,我无法得到任何不涉及输入验证码的响应,正如我们在上面的评论中所讨论的,并且可以在页面的日志中看到
可以做什么:
有很多第三方API可以用来实现这一点,我建议您搜索一个满足您需求的API
我测试了一个名为的函数,它返回不同关键字的搜索引擎索引。API是异步的,因此可能需要一分钟才能得到响应,因此需要制作一个Web应用程序解决方案
我使用的流程如下所示:
- (免费)
- 创建新的应用程序脚本项目以进行API调用:
函数makeApiCall(url、方法、站点){
const public_key=“”
const private_key=“”
const salt=“”
让timestamp=Date.now()
const hash=Utilities.ComputeHMACSHA256签名(时间戳+公钥+salt、私钥)
常量头={
“授权”:“KeyAuth publicKey=“+public_key+”hash=“+toHexString(hash)+”ts=“+timestamp,
“内容类型”:“应用程序/json”
}
常量请求参数={
“搜索引擎”:“谷歌”,
“地区”:“美国”,
“语言”:“en”,
“最大结果”:100,
“短语”:站点,
“搜索类型”:“web”,
“用户\代理”:“pc”,
“参数”:{
“优先级”:“标准”
},
“回调类型”:“完全”,
“回调”:“脚本web应用程序执行url”
}
常量选项={
"方法":方法,,
“标题”:标题,
“muteHttpExceptions”:true,
“有效负载”:JSON.stringify(requestParameters)
}
const response=UrlFetchApp.fetch(url,选项)
返回响应
}
函数到十六进制字符串(byteArray){
常量hexString=Array.from(字节数组,函数){
返回('0'+(字节和0xFF).toString(16)).slice(-2)
}).加入(“”)
返回他
page = "<!doctype html><html lang="en"><head><meta charset="UTF-8"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>site:www.datanumen.com/notindexurl/ - G…"
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head><meta http-equiv="content-type" content="text/html; charset=utf-8"><meta name="viewport" content="initial-scale=1"><title>https://www.google.com/search?q=site:www.datanumen.com%2F</title></head>
<body style="font-family: arial, sans-serif; background-color: #fff; color: #000; padding:20px; font-size:18px;" onload="e=document.getElementById('captcha');if(e){e.focus();}">
<div style="max-width:400px;">
<hr noshade size="1" style="color:#ccc; background-color:#ccc;"><br>
<form id="captcha-form" action="index" method="post">
<script src="https://www.google.com/recaptcha/api.js" async defer></script>
<script>var submitCallback = function(response) {document.getElementById('captcha-form').submit();};</script>
<div id="recaptcha" class="g-recaptcha" data-sitekey="6LfwuyUTAAAAAOAmoS0fdqijC2PbbdH4kjq62Y1b" data-callback="submitCallback" data-s="c5Hy4maqTFv3SzYRiWhpsqYF2isZmauUQnLVljOiED_PiaVWJWCsHMzRAZyh8HLCBHJ_mjET7yODJu8AlZ33_xGAQ8TcKuXAd7rQpsYakaGKPD8USiGSFhiII2ai-Cf_B26i1Ufpko-qYQ8V3rezhiSXxi5J2yHZ-_WwEj8ukzy5znxzVurTM_2cY243Q4ofwP7E7eWBaHIg6N3ofmPuFXd-uRIUU4z0cU_pas8"></div>
<input type='hidden' name='q' value='EgRrsuB5GKmx6oUGIhBKAdWty9nssg-nAtyy9n7hMgFy'><input type="hidden" name="continue" value="https://www.google.com/search?q=site:www.datanumen.com%2F">
</form>
<hr noshade size="1" style="color:#ccc; background-color:#ccc;">
<div style="font-size:13px;">
<b>About this page</b><br><br>
Our systems have detected unusual traffic from your computer network. This page checks to see if it's really you sending the requests, and not a robot. <a href="#" onclick="document.getElementById('infoDiv').style.display='block';">Why did this happen?</a><br><br>
<div id="infoDiv" style="display:none; background-color:#eee; padding:10px; margin:0 0 15px 0; line-height:1.4em;">
This page appears when Google automatically detects requests coming from your computer network which appear to be in violation of the <a href="//www.google.com/policies/terms/">Terms of Service</a>. The block will expire shortly after those requests stop. In the meantime, solving the above CAPTCHA will let you continue to use our services.<br><br>This traffic may have been sent by malicious software, a browser plug-in, or a script that sends automated requests. If you share your network connection, ask your administrator for help — a different computer using the same IP address may be responsible. <a href="//support.google.com/websearch/answer/86640">Learn more</a><br><br>Sometimes you may be asked to solve the CAPTCHA if you are using advanced terms that robots are known to use, or sending requests very quickly.
</div>
IP address: 107.178.224.121<br>Time: 2021-06-04T21:18:34Z<br>URL: https://www.google.com/search?q=site:www.datanumen.com%2F<br>
</div>
</div>
</body>
</html>