Google apps script 如何查询URL是否被Google索引？_Google Apps Script

Google apps script 如何查询URL是否被Google索引？

google-apps-script

Google apps script 如何查询URL是否被Google索引？,google-apps-script,Google Apps Script,我想创建一个Google脚本来检查给定URL是否被Google索引，因此我编写了以下函数： function CheckURLForGoogleIndex(url, activesheet) {// Delete the https:// and http:// prefix var cururl = url.replace("https://", ""); cururl = cururl.replace("http://&q

我想创建一个Google脚本来检查给定URL是否被Google索引，因此我编写了以下函数：

function CheckURLForGoogleIndex(url, activesheet) {// Delete the https:// and http:// prefix
  var cururl = url.replace("https://", "");      
  cururl = cururl.replace("http://", "");
  var googlesearchurl = "https://www.google.com/search?q=site:" + encodeURIComponent(cururl);
    var page = UrlFetchApp.fetch(googlesearchurl, {muteHttpExceptions: true}).getContentText();
    // Wait for 1 second before starting another fetch
    Utilities.sleep(1000);
    var number = page.match("did not match any documents");
    if (number) {
      activesheet.getSheetByName("Not Google Index").appendRow([url]);
    } else {
      activesheet.getSheetByName("Google Index").appendRow([url]);
    }  
}

但是，在调试代码时，在调用UrlFetchApp.fetch之后，我只能看到变量页面的标题

我尝试使用Google索引URL和非索引URL测试该函数，但在page.match函数中两者都将返回null，因此都放在“Google索引”表中

我的功能有什么问题

谢谢

注意：

我已经问了这个问题，但是没有人回答，所以我必须在这里问

样本输入和输出

输入1：

url=

activesheet=包含“谷歌索引”和“非谷歌索引”页面的谷歌表单

预期输出1：由于被谷歌索引，它将被添加到“谷歌索引”表中

page=“site:www.datanumen.com/-谷歌搜索…”

输入2：

url=

activesheet=包含“谷歌索引”和“非谷歌索引”页面的谷歌表单

预期输出2：由于未被谷歌索引，它将被添加到“非谷歌索引”表中

page=“站点：www.datanumen.com/notindexurl/-G…”

目前，Input1和Input2都存在问题，实际结果是：URL将始终添加到“Google索引”表中，因为搜索结果根本不会包含“未匹配任何文档”文本

更新

我添加了console.log（第页）；然后再次调试。对于Input1，我得到以下结果：

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head><meta http-equiv="content-type" content="text/html; charset=utf-8"><meta name="viewport" content="initial-scale=1"><title>https://www.google.com/search?q=site:www.datanumen.com%2F</title></head>
<body style="font-family: arial, sans-serif; background-color: #fff; color: #000; padding:20px; font-size:18px;" onload="e=document.getElementById('captcha');if(e){e.focus();}">
<div style="max-width:400px;">
<hr noshade size="1" style="color:#ccc; background-color:#ccc;"><br>
<form id="captcha-form" action="index" method="post">
<script src="https://www.google.com/recaptcha/api.js" async defer></script>
<script>var submitCallback = function(response) {document.getElementById('captcha-form').submit();};</script>
<div id="recaptcha" class="g-recaptcha" data-sitekey="6LfwuyUTAAAAAOAmoS0fdqijC2PbbdH4kjq62Y1b" data-callback="submitCallback" data-s="c5Hy4maqTFv3SzYRiWhpsqYF2isZmauUQnLVljOiED_PiaVWJWCsHMzRAZyh8HLCBHJ_mjET7yODJu8AlZ33_xGAQ8TcKuXAd7rQpsYakaGKPD8USiGSFhiII2ai-Cf_B26i1Ufpko-qYQ8V3rezhiSXxi5J2yHZ-_WwEj8ukzy5znxzVurTM_2cY243Q4ofwP7E7eWBaHIg6N3ofmPuFXd-uRIUU4z0cU_pas8"></div>
<input type='hidden' name='q' value='EgRrsuB5GKmx6oUGIhBKAdWty9nssg-nAtyy9n7hMgFy'><input type="hidden" name="continue" value="https://www.google.com/search?q=site:www.datanumen.com%2F">
</form>
<hr noshade size="1" style="color:#ccc; background-color:#ccc;">

<div style="font-size:13px;">
<b>About this page</b><br><br>

Our systems have detected unusual traffic from your computer network.  This page checks to see if it&#39;s really you sending the requests, and not a robot.  <a href="#" onclick="document.getElementById('infoDiv').style.display='block';">Why did this happen?</a><br><br>

<div id="infoDiv" style="display:none; background-color:#eee; padding:10px; margin:0 0 15px 0; line-height:1.4em;">
This page appears when Google automatically detects requests coming from your computer network which appear to be in violation of the <a href="//www.google.com/policies/terms/">Terms of Service</a>. The block will expire shortly after those requests stop.  In the meantime, solving the above CAPTCHA will let you continue to use our services.<br><br>This traffic may have been sent by malicious software, a browser plug-in, or a script that sends automated requests.  If you share your network connection, ask your administrator for help &mdash; a different computer using the same IP address may be responsible.  <a href="//support.google.com/websearch/answer/86640">Learn more</a><br><br>Sometimes you may be asked to solve the CAPTCHA if you are using advanced terms that robots are known to use, or sending requests very quickly.
</div>

IP address: 107.178.224.121<br>Time: 2021-06-04T21:18:34Z<br>URL: https://www.google.com/search?q=site:www.datanumen.com%2F<br>
</div>
</div>
</body>
</html>


https://www.google.com/search?q=site:www.datanumen.com%2F


var submitCallback=函数（响应）{document.getElementById（'captcha-form'）.submit（）；}；

关于此页面


我们的系统检测到来自您计算机网络的异常流量。此页面检查是否为'；他真的是你在发送请求，而不是机器人


当Google自动检测到来自您的计算机网络的请求时，会显示此页面，这些请求似乎违反了。这些请求停止后，该块将很快过期。同时，解决上述验证码将允许您继续使用我们的服务。

此流量可能是由恶意软件、浏览器插件或发送自动请求的脚本发送的。如果共享网络连接，请向管理员寻求帮助&mdash；可能是使用相同IP地址的不同计算机造成的

如果您使用的是已知机器人使用的高级术语，或者发送请求非常快，有时可能会要求您解决验证码问题。
IP地址：107.178.224.121
时间：2021-06-04T21:18:34Z
URL:https://www.google.com/search?q=site:www.datanumen.com%2F

回答：不幸的是，通过尝试使用UrlFetchApp对搜索结果进行web刮取来直接执行此操作将不起作用。但是，您可以使用第三方工具获取搜索结果的数量

更多信息：我使用指数退避方法对此进行了测试，当

UrlFetchApp

调用获取请求时，该方法有时能够克服

错误

当使用

UrlFetchApp

进行web刮取或连接到API时，服务器可能会以

请求过多或HTTP错误429
为由拒绝请求
谷歌应用程序脚本在云中运行，来自谷歌拥有的池中的一组IP地址。实际上，您可以看到所有IP范围。大多数网站（特别是像谷歌这样的大公司）都有适当的架构，以防止使用机器人刮取网站并减慢流量
有时，使用指数回退和随机时间间隔的混合方式（完全公开：这个GitHub存储库是我写的），可以克服这个错误
我假设要么是谷歌直接阻止了应用程序脚本IP池，要么就是有太多人在尝试同样的事情——因为使用相同的技术，我无法得到任何不涉及输入验证码的响应，正如我们在上面的评论中所讨论的，并且可以在页面的日志中看到
可以做什么：
有很多第三方API可以用来实现这一点，我建议您搜索一个满足您需求的API
我测试了一个名为的函数，它返回不同关键字的搜索引擎索引。API是异步的，因此可能需要一分钟才能得到响应，因此需要制作一个Web应用程序解决方案
我使用的流程如下所示：

（免费）
创建新的应用程序脚本项目以进行API调用：

函数makeApiCall（url、方法、站点）{
const public_key=“”
const private_key=“”
const salt=“”
让timestamp=Date.now（）
const hash=Utilities.ComputeHMACSHA256签名（时间戳+公钥+salt、私钥）
常量头={
“授权”：“KeyAuth publicKey=“+public_key+”hash=“+toHexString（hash）+”ts=“+timestamp，
“内容类型”：“应用程序/json”
}
常量请求参数={
“搜索引擎”：“谷歌”，
“地区”：“美国”，
“语言”：“en”，
“最大结果”：100，
“短语”：站点，
“搜索类型”：“web”，
“用户\代理”：“pc”，
“参数”：{
“优先级”：“标准”
},
“回调类型”：“完全”，
“回调”：“脚本web应用程序执行url”
}
常量选项={
"方法":方法,，
“标题”：标题，
“muteHttpExceptions”：true，
“有效负载”：JSON.stringify（requestParameters）
}
const response=UrlFetchApp.fetch（url，选项）
返回响应
}
函数到十六进制字符串（byteArray）{
常量hexString=Array.from（字节数组，函数）{
返回（'0'+（字节和0xFF）.toString（16））.slice（-2）
}).加入（“”）
返回他
page = "<!doctype html><html lang="en"><head><meta charset="UTF-8"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>site:www.datanumen.com/notindexurl/ - G…"

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head><meta http-equiv="content-type" content="text/html; charset=utf-8"><meta name="viewport" content="initial-scale=1"><title>https://www.google.com/search?q=site:www.datanumen.com%2F</title></head>
<body style="font-family: arial, sans-serif; background-color: #fff; color: #000; padding:20px; font-size:18px;" onload="e=document.getElementById('captcha');if(e){e.focus();}">
<div style="max-width:400px;">
<hr noshade size="1" style="color:#ccc; background-color:#ccc;"><br>
<form id="captcha-form" action="index" method="post">
<script src="https://www.google.com/recaptcha/api.js" async defer></script>
<script>var submitCallback = function(response) {document.getElementById('captcha-form').submit();};</script>
<div id="recaptcha" class="g-recaptcha" data-sitekey="6LfwuyUTAAAAAOAmoS0fdqijC2PbbdH4kjq62Y1b" data-callback="submitCallback" data-s="c5Hy4maqTFv3SzYRiWhpsqYF2isZmauUQnLVljOiED_PiaVWJWCsHMzRAZyh8HLCBHJ_mjET7yODJu8AlZ33_xGAQ8TcKuXAd7rQpsYakaGKPD8USiGSFhiII2ai-Cf_B26i1Ufpko-qYQ8V3rezhiSXxi5J2yHZ-_WwEj8ukzy5znxzVurTM_2cY243Q4ofwP7E7eWBaHIg6N3ofmPuFXd-uRIUU4z0cU_pas8"></div>
<input type='hidden' name='q' value='EgRrsuB5GKmx6oUGIhBKAdWty9nssg-nAtyy9n7hMgFy'><input type="hidden" name="continue" value="https://www.google.com/search?q=site:www.datanumen.com%2F">
</form>
<hr noshade size="1" style="color:#ccc; background-color:#ccc;">

<div style="font-size:13px;">
<b>About this page</b><br><br>

Our systems have detected unusual traffic from your computer network.  This page checks to see if it&#39;s really you sending the requests, and not a robot.  <a href="#" onclick="document.getElementById('infoDiv').style.display='block';">Why did this happen?</a><br><br>

<div id="infoDiv" style="display:none; background-color:#eee; padding:10px; margin:0 0 15px 0; line-height:1.4em;">
This page appears when Google automatically detects requests coming from your computer network which appear to be in violation of the <a href="//www.google.com/policies/terms/">Terms of Service</a>. The block will expire shortly after those requests stop.  In the meantime, solving the above CAPTCHA will let you continue to use our services.<br><br>This traffic may have been sent by malicious software, a browser plug-in, or a script that sends automated requests.  If you share your network connection, ask your administrator for help &mdash; a different computer using the same IP address may be responsible.  <a href="//support.google.com/websearch/answer/86640">Learn more</a><br><br>Sometimes you may be asked to solve the CAPTCHA if you are using advanced terms that robots are known to use, or sending requests very quickly.
</div>

IP address: 107.178.224.121<br>Time: 2021-06-04T21:18:34Z<br>URL: https://www.google.com/search?q=site:www.datanumen.com%2F<br>
</div>
</div>
</body>
</html>