Python 从网页中刮取时的索引器
我一直在尝试使用此代码从xhamster频道中获取数据,以便进行研究Python 从网页中刮取时的索引器,python,web-scraping,web-crawler,Python,Web Scraping,Web Crawler,我一直在尝试使用此代码从xhamster频道中获取数据,以便进行研究 import json from multiprocessing.dummy import Pool as ThreadPool from lxml import html from util import req def get_channel_urls(url): r = req(url) tree = html.fromstring(r.text) print("Done&quo
import json
from multiprocessing.dummy import Pool as ThreadPool
from lxml import html
from util import req
def get_channel_urls(url):
r = req(url)
tree = html.fromstring(r.text)
print("Done", url)
return [x.attrib['href'] for x in tree.xpath('//div[@class="item"]/a')]
def write_channel_data(url):
r = req(url)
html_text = r.text
tree = html.fromstring(html_text)
json_data = json.loads(
tree.xpath('//script[@id="initials-script"]/text()')[0].strip().split("window.initials =")[1][:-1].strip())
with open("channel_html/{}".format(json_data['sponsorChannel']['inurl']), 'w', encoding='utf-8') as outfile:
outfile.write(html_text)
print("Written data for:", url)
def main():
letters = '0abcdefghijklmnopqrstuvqxyz'
index_urls = ['https://xhamster.com/channels/all/{}'.format(index_letter) for index_letter in letters]
index_urls.extend(['https://xhamster.com/gay/channels/all/{}'.format(index_letter) for index_letter in letters])
index_urls.extend(['https://xhamster.com/shemale/channels/all/{}'.format(index_letter) for index_letter in letters])
channel_urls = []
for url in index_urls:
channel_urls.extend(get_channel_urls(url))
with open('channel_urls', 'w') as channel_url_backup_file:
channel_url_backup_file.write("\n".join(channel_urls))
# with open('channel_urls') as i: # THIS IS TO READ A PRE-DOWNLOADED URL FILE
# channel_urls = [url.strip() for url in i.read().split()]
with ThreadPool(processes=10) as pool:
pool.map(write_channel_data, channel_urls)
if __name__ == '__main__':
main()
它确实工作了一段时间,但后来我得到了这个错误。这个错误显然在main()函数中,但我不知道如何解决它
在
write\u channel\u data
函数中定义json\u data
时,您将得到一个索引器。我怀疑它来自…split(“window.initials=“)[1]
,因为如果找不到该文本,它将不会被拆分。您应该花更多时间检查html树的内容,以找出某些频道URL
没有文本“windows.initials=”的原因,这表明您在tree.xpath(“//script[@id=“initials script”]/text()”)行遇到错误。…
。尝试将该行分解为多个较短的行,以检测该行的哪一部分有错误,如果仍然无法修复,则更新问题。