Python 从网页中刮取时的索引器_Python_Web Scraping_Web Crawler

Python 从网页中刮取时的索引器

python web-scraping web-crawler

Python 从网页中刮取时的索引器,python,web-scraping,web-crawler,Python,Web Scraping,Web Crawler,我一直在尝试使用此代码从xhamster频道中获取数据，以便进行研究 import json from multiprocessing.dummy import Pool as ThreadPool from lxml import html from util import req def get_channel_urls(url): r = req(url) tree = html.fromstring(r.text) print("Done&quo

我一直在尝试使用此代码从xhamster频道中获取数据，以便进行研究

import json
from multiprocessing.dummy import Pool as ThreadPool

from lxml import html

from util import req


def get_channel_urls(url):
    r = req(url)
    tree = html.fromstring(r.text)
    print("Done", url)
    return [x.attrib['href'] for x in tree.xpath('//div[@class="item"]/a')]

def write_channel_data(url):
    r = req(url)
    html_text = r.text
    tree = html.fromstring(html_text)
    json_data = json.loads(
        tree.xpath('//script[@id="initials-script"]/text()')[0].strip().split("window.initials =")[1][:-1].strip())
    with open("channel_html/{}".format(json_data['sponsorChannel']['inurl']), 'w', encoding='utf-8') as outfile:
        outfile.write(html_text)
    print("Written data for:", url)


def main():
    letters = '0abcdefghijklmnopqrstuvqxyz'
    index_urls = ['https://xhamster.com/channels/all/{}'.format(index_letter) for index_letter in letters]
    index_urls.extend(['https://xhamster.com/gay/channels/all/{}'.format(index_letter) for index_letter in letters])
    index_urls.extend(['https://xhamster.com/shemale/channels/all/{}'.format(index_letter) for index_letter in letters])
    channel_urls = []
    for url in index_urls:
        channel_urls.extend(get_channel_urls(url))

    with open('channel_urls', 'w') as channel_url_backup_file:
        channel_url_backup_file.write("\n".join(channel_urls))

    # with open('channel_urls') as i:  # THIS IS TO READ A PRE-DOWNLOADED URL FILE
    #     channel_urls = [url.strip() for url in i.read().split()]

    with ThreadPool(processes=10) as pool:
        pool.map(write_channel_data, channel_urls)


if __name__ == '__main__':
    main()

它确实工作了一段时间，但后来我得到了这个错误。这个错误显然在main（）函数中，但我不知道如何解决它

在

write\u channel\u data

函数中定义

json\u data

时，您将得到一个

索引器。我怀疑它来自…split（“window.initials=“）[1]
，因为如果找不到该文本，它将不会被拆分。您应该花更多时间检查html树的内容，以找出某些频道URL
没有文本“windows.initials=”的原因，这表明您在tree.xpath（“//script[@id=“initials script”]/text（）”）行遇到错误。…
。尝试将该行分解为多个较短的行，以检测该行的哪一部分有错误，如果仍然无法修复，则更新问题。