Html 爬行：由于结构更改而停止刮纸_Html_Web Scraping_Web Crawler

Html 爬行：由于结构更改而停止刮纸

html web-scraping web-crawler

Html 爬行：由于结构更改而停止刮纸,html,web-scraping,web-crawler,Html,Web Scraping,Web Crawler,在抓取网页时，网页的结构不断变化，我指的是它的动态性，这会导致我的抓取程序停止工作。是否有一种机制可以在运行完整爬虫程序之前识别网页结构更改，以便识别结构是否已更改。如果您可以在网页中运行自己的javascript代码，则可以使用该机制监视对DOM树所做的更改比如： waitForDomStability(timeout: number) { return new Promise(resolve => { const waitResolve = observer => {

在抓取网页时，网页的结构不断变化，我指的是它的动态性，这会导致我的抓取程序停止工作。是否有一种机制可以在运行完整爬虫程序之前识别网页结构更改，以便识别结构是否已更改。

如果您可以在网页中运行自己的javascript代码，则可以使用该机制监视对DOM树所做的更改

比如：

waitForDomStability(timeout: number) {
  return new Promise(resolve => {

  const waitResolve = observer => {
    observer.disconnect();
    resolve();
  };

  let timeoutId;
  const observer = new MutationObserver((mutationList, observer) => {
    for (let i = 0; i < mutationList.length; i += 1) {
      // we only care if new nodes have been added
      if (mutationList[i].type === 'childList') {
        // restart the countdown timer
        window.clearTimeout(timeoutId);
        timeoutId = window.setTimeout(waitResolve, timeout, observer);
        break;
      }
    }
  });

  timeoutId = setTimeout(waitResolve, timeout, observer);

  // start observing document.body
  observer.observe(document.body, { attributes: true, childList: true, subtree: true });
  });
}

waitForDomStability（超时：数字）{
返回新承诺（解决=>{
const waitResolve=observer=>{
observer.disconnect（）；
解决（）；
};
让时光流逝；
const observer=新的MutationObserver（（mutationList，observer）=>{
for（设i=0；i


我在开源抓取扩展中使用这种方法。有关完整代码，请查看repo中的/packages/background/src/ts/plugins/builtin/FetchPlugin.ts。
您当然可以使用“快照”来比较同一页面的两个版本。为了实现这一点，我实现了类似于java字符串哈希代码的东西
javascript代码：
/*
returns a dom element snapshot as innerText hash code
starting point is java String hashCode: s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
keep everything fast: only work with a 32 bit hash, remove exponentiation
custom implementation: s[0]*31 + s[1]*31 + ... + s[n-1]*31
*/
function getSnapshot() {
    const snapshotSelector = 'body';
    const nodeToBeHashed = document.querySelector(snapshotSelector);
    if (!nodeToBeHashed) return 0;

    const { innerText } = nodeToBeHashed;

    let hash = 0;
    if (innerText.length === 0) {
      return hash;
    }

    for (let i = 0; i < innerText.length; i += 1) {
      // an integer between 0 and 65535 representing the UTF-16 code unit
      const charCode = innerText.charCodeAt(i);

      // multiply by 31 and add current charCode
      hash = ((hash << 5) - hash) + charCode;

      // convert to 32 bits as bitwise operators treat their operands as a sequence of 32 bits
      hash |= 0;
    }

    return hash;
}

/*
将dom元素快照作为innerText哈希代码返回
起点是java字符串哈希代码：s[0]*31^（n-1）+s[1]*31^（n-2）+…+s[n-1]
保持一切快速：只使用32位散列，删除指数运算
自定义实现：s[0]*31+s[1]*31+…+s[n-1]*31
*/
函数getSnapshot（）{
常量快照选择器='body'；
const nodeToBeHashed=document.querySelector（快照选择器）；
如果（！nodeToBeHashed）返回0；
const{innerText}=nodeToBeHashed；
设hash=0；
if（innerText.length==0）{
返回散列；
}
for（设i=0；ihash=（（hash）我想计算网页的ASCII值，然后在我再次抓取同一页时进行比较。你认为可行吗？现在更清楚了。我想你希望在抓取之前页面变得“稳定”。“我的抓取程序停止工作”。请提供一个。在pyhton中更容易，因为有内置哈希支持，请参阅以获取一些示例。