Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/google-apps-script/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Google apps script 如何查询URL是否被Google索引?_Google Apps Script - Fatal编程技术网

Google apps script 如何查询URL是否被Google索引?

Google apps script 如何查询URL是否被Google索引?,google-apps-script,Google Apps Script,我想创建一个Google脚本来检查给定URL是否被Google索引,因此我编写了以下函数: function CheckURLForGoogleIndex(url, activesheet) {// Delete the https:// and http:// prefix var cururl = url.replace("https://", ""); cururl = cururl.replace("http://&q

我想创建一个Google脚本来检查给定URL是否被Google索引,因此我编写了以下函数:

function CheckURLForGoogleIndex(url, activesheet) {// Delete the https:// and http:// prefix
  var cururl = url.replace("https://", "");      
  cururl = cururl.replace("http://", "");
  var googlesearchurl = "https://www.google.com/search?q=site:" + encodeURIComponent(cururl);
    var page = UrlFetchApp.fetch(googlesearchurl, {muteHttpExceptions: true}).getContentText();
    // Wait for 1 second before starting another fetch
    Utilities.sleep(1000);
    var number = page.match("did not match any documents");
    if (number) {
      activesheet.getSheetByName("Not Google Index").appendRow([url]);
    } else {
      activesheet.getSheetByName("Google Index").appendRow([url]);
    }  
} 
但是,在调试代码时,在调用UrlFetchApp.fetch之后,我只能看到变量页面的标题

我尝试使用Google索引URL和非索引URL测试该函数,但在page.match函数中两者都将返回null,因此都放在“Google索引”表中

我的功能有什么问题

谢谢

注意:

我已经问了这个问题,但是没有人回答,所以我必须在这里问

样本输入和输出

输入1:

url=

activesheet=包含“谷歌索引”和“非谷歌索引”页面的谷歌表单

预期输出1:由于被谷歌索引,它将被添加到“谷歌索引”表中

page=“site:www.datanumen.com/-谷歌搜索…”
输入2:

url=

activesheet=包含“谷歌索引”和“非谷歌索引”页面的谷歌表单

预期输出2:由于未被谷歌索引,它将被添加到“非谷歌索引”表中

page=“站点:www.datanumen.com/notindexurl/-G…”
目前,Input1和Input2都存在问题,实际结果是:URL将始终添加到“Google索引”表中,因为搜索结果根本不会包含“未匹配任何文档”文本

更新

我添加了console.log(第页);然后再次调试。对于Input1,我得到以下结果:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head><meta http-equiv="content-type" content="text/html; charset=utf-8"><meta name="viewport" content="initial-scale=1"><title>https://www.google.com/search?q=site:www.datanumen.com%2F</title></head>
<body style="font-family: arial, sans-serif; background-color: #fff; color: #000; padding:20px; font-size:18px;" onload="e=document.getElementById('captcha');if(e){e.focus();}">
<div style="max-width:400px;">
<hr noshade size="1" style="color:#ccc; background-color:#ccc;"><br>
<form id="captcha-form" action="index" method="post">
<script src="https://www.google.com/recaptcha/api.js" async defer></script>
<script>var submitCallback = function(response) {document.getElementById('captcha-form').submit();};</script>
<div id="recaptcha" class="g-recaptcha" data-sitekey="6LfwuyUTAAAAAOAmoS0fdqijC2PbbdH4kjq62Y1b" data-callback="submitCallback" data-s="c5Hy4maqTFv3SzYRiWhpsqYF2isZmauUQnLVljOiED_PiaVWJWCsHMzRAZyh8HLCBHJ_mjET7yODJu8AlZ33_xGAQ8TcKuXAd7rQpsYakaGKPD8USiGSFhiII2ai-Cf_B26i1Ufpko-qYQ8V3rezhiSXxi5J2yHZ-_WwEj8ukzy5znxzVurTM_2cY243Q4ofwP7E7eWBaHIg6N3ofmPuFXd-uRIUU4z0cU_pas8"></div>
<input type='hidden' name='q' value='EgRrsuB5GKmx6oUGIhBKAdWty9nssg-nAtyy9n7hMgFy'><input type="hidden" name="continue" value="https://www.google.com/search?q=site:www.datanumen.com%2F">
</form>
<hr noshade size="1" style="color:#ccc; background-color:#ccc;">

<div style="font-size:13px;">
<b>About this page</b><br><br>

Our systems have detected unusual traffic from your computer network.  This page checks to see if it&#39;s really you sending the requests, and not a robot.  <a href="#" onclick="document.getElementById('infoDiv').style.display='block';">Why did this happen?</a><br><br>

<div id="infoDiv" style="display:none; background-color:#eee; padding:10px; margin:0 0 15px 0; line-height:1.4em;">
This page appears when Google automatically detects requests coming from your computer network which appear to be in violation of the <a href="//www.google.com/policies/terms/">Terms of Service</a>. The block will expire shortly after those requests stop.  In the meantime, solving the above CAPTCHA will let you continue to use our services.<br><br>This traffic may have been sent by malicious software, a browser plug-in, or a script that sends automated requests.  If you share your network connection, ask your administrator for help &mdash; a different computer using the same IP address may be responsible.  <a href="//support.google.com/websearch/answer/86640">Learn more</a><br><br>Sometimes you may be asked to solve the CAPTCHA if you are using advanced terms that robots are known to use, or sending requests very quickly.
</div>

IP address: 107.178.224.121<br>Time: 2021-06-04T21:18:34Z<br>URL: https://www.google.com/search?q=site:www.datanumen.com%2F<br>
</div>
</div>
</body>
</html>

https://www.google.com/search?q=site:www.datanumen.com%2F


var submitCallback=函数(响应){document.getElementById('captcha-form').submit();};
关于此页面

我们的系统检测到来自您计算机网络的异常流量。此页面检查是否为';他真的是你在发送请求,而不是机器人

当Google自动检测到来自您的计算机网络的请求时,会显示此页面,这些请求似乎违反了。这些请求停止后,该块将很快过期。同时,解决上述验证码将允许您继续使用我们的服务。

此流量可能是由恶意软件、浏览器插件或发送自动请求的脚本发送的。如果共享网络连接,请向管理员寻求帮助&mdash;可能是使用相同IP地址的不同计算机造成的

如果您使用的是已知机器人使用的高级术语,或者发送请求非常快,有时可能会要求您解决验证码问题。 IP地址:107.178.224.121
时间:2021-06-04T21:18:34Z
URL:https://www.google.com/search?q=site:www.datanumen.com%2F
回答: 不幸的是,通过尝试使用UrlFetchApp对搜索结果进行web刮取来直接执行此操作将不起作用。但是,您可以使用第三方工具获取搜索结果的数量

更多信息: 我使用指数退避方法对此进行了测试,当
UrlFetchApp
调用获取请求时,该方法有时能够克服
429
错误

当使用
UrlFetchApp
进行web刮取或连接到API时,服务器可能会以
请求过多或
HTTP错误429
为由拒绝请求

谷歌应用程序脚本在云中运行,来自谷歌拥有的池中的一组IP地址。实际上,您可以看到所有IP范围。大多数网站(特别是像谷歌这样的大公司)都有适当的架构,以防止使用机器人刮取网站并减慢流量

有时,使用指数回退和随机时间间隔的混合方式(完全公开:这个GitHub存储库是我写的),可以克服这个错误

我假设要么是谷歌直接阻止了应用程序脚本IP池,要么就是有太多人在尝试同样的事情——因为使用相同的技术,我无法得到任何不涉及输入验证码的响应,正如我们在上面的评论中所讨论的,并且可以在
页面的日志中看到

可以做什么: 有很多第三方API可以用来实现这一点,我建议您搜索一个满足您需求的API

我测试了一个名为的函数,它返回不同关键字的搜索引擎索引。API是异步的,因此可能需要一分钟才能得到响应,因此需要制作一个Web应用程序解决方案

我使用的流程如下所示:

  • (免费)
  • 创建新的应用程序脚本项目以进行API调用:
函数makeApiCall(url、方法、站点){
const public_key=“”
const private_key=“”
const salt=“”
让timestamp=Date.now()
const hash=Utilities.ComputeHMACSHA256签名(时间戳+公钥+salt、私钥)
常量头={
“授权”:“KeyAuth publicKey=“+public_key+”hash=“+toHexString(hash)+”ts=“+timestamp,
“内容类型”:“应用程序/json”
}
常量请求参数={
“搜索引擎”:“谷歌”,
“地区”:“美国”,
“语言”:“en”,
“最大结果”:100,
“短语”:站点,
“搜索类型”:“web”,
“用户\代理”:“pc”,
“参数”:{
“优先级”:“标准”
},
“回调类型”:“完全”,
“回调”:“脚本web应用程序执行url”
}
常量选项={
"方法":方法,,
“标题”:标题,
“muteHttpExceptions”:true,
“有效负载”:JSON.stringify(requestParameters)
}
const response=UrlFetchApp.fetch(url,选项)
返回响应
}
函数到十六进制字符串(byteArray){
常量hexString=Array.from(字节数组,函数){
返回('0'+(字节和0xFF).toString(16)).slice(-2)
}).加入(“”)
返回他
page = "<!doctype html><html lang="en"><head><meta charset="UTF-8"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>site:www.datanumen.com/notindexurl/ - G…"
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head><meta http-equiv="content-type" content="text/html; charset=utf-8"><meta name="viewport" content="initial-scale=1"><title>https://www.google.com/search?q=site:www.datanumen.com%2F</title></head>
<body style="font-family: arial, sans-serif; background-color: #fff; color: #000; padding:20px; font-size:18px;" onload="e=document.getElementById('captcha');if(e){e.focus();}">
<div style="max-width:400px;">
<hr noshade size="1" style="color:#ccc; background-color:#ccc;"><br>
<form id="captcha-form" action="index" method="post">
<script src="https://www.google.com/recaptcha/api.js" async defer></script>
<script>var submitCallback = function(response) {document.getElementById('captcha-form').submit();};</script>
<div id="recaptcha" class="g-recaptcha" data-sitekey="6LfwuyUTAAAAAOAmoS0fdqijC2PbbdH4kjq62Y1b" data-callback="submitCallback" data-s="c5Hy4maqTFv3SzYRiWhpsqYF2isZmauUQnLVljOiED_PiaVWJWCsHMzRAZyh8HLCBHJ_mjET7yODJu8AlZ33_xGAQ8TcKuXAd7rQpsYakaGKPD8USiGSFhiII2ai-Cf_B26i1Ufpko-qYQ8V3rezhiSXxi5J2yHZ-_WwEj8ukzy5znxzVurTM_2cY243Q4ofwP7E7eWBaHIg6N3ofmPuFXd-uRIUU4z0cU_pas8"></div>
<input type='hidden' name='q' value='EgRrsuB5GKmx6oUGIhBKAdWty9nssg-nAtyy9n7hMgFy'><input type="hidden" name="continue" value="https://www.google.com/search?q=site:www.datanumen.com%2F">
</form>
<hr noshade size="1" style="color:#ccc; background-color:#ccc;">

<div style="font-size:13px;">
<b>About this page</b><br><br>

Our systems have detected unusual traffic from your computer network.  This page checks to see if it&#39;s really you sending the requests, and not a robot.  <a href="#" onclick="document.getElementById('infoDiv').style.display='block';">Why did this happen?</a><br><br>

<div id="infoDiv" style="display:none; background-color:#eee; padding:10px; margin:0 0 15px 0; line-height:1.4em;">
This page appears when Google automatically detects requests coming from your computer network which appear to be in violation of the <a href="//www.google.com/policies/terms/">Terms of Service</a>. The block will expire shortly after those requests stop.  In the meantime, solving the above CAPTCHA will let you continue to use our services.<br><br>This traffic may have been sent by malicious software, a browser plug-in, or a script that sends automated requests.  If you share your network connection, ask your administrator for help &mdash; a different computer using the same IP address may be responsible.  <a href="//support.google.com/websearch/answer/86640">Learn more</a><br><br>Sometimes you may be asked to solve the CAPTCHA if you are using advanced terms that robots are known to use, or sending requests very quickly.
</div>

IP address: 107.178.224.121<br>Time: 2021-06-04T21:18:34Z<br>URL: https://www.google.com/search?q=site:www.datanumen.com%2F<br>
</div>
</div>
</body>
</html>