在Python中循环浏览所有网页时出现网页抓取错误_Python_Web Scraping_Beautifulsoup

在Python中循环浏览所有网页时出现网页抓取错误

python web-scraping

在Python中循环浏览所有网页时出现网页抓取错误,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我试图刮一个网页，并通过一个链接内的所有网页循环。当我循环浏览下面的所有页面时，代码会给出许多重复项 lst = [] urls = ['https://www.f150forum.com/f118/2019-adding-adaptive-cruise-454662/','https://www.f150forum.com/f118/adaptive-cruise-control-module-300894/'] for url in urls: with requests.Sess

我试图刮一个网页，并通过一个链接内的所有网页循环。当我循环浏览下面的所有页面时，代码会给出许多重复项

lst = []
urls = ['https://www.f150forum.com/f118/2019-adding-adaptive-cruise-454662/','https://www.f150forum.com/f118/adaptive-cruise-control-module-300894/']

for url in urls:
    with requests.Session() as req:
        for item in range(1,33):
            response = req.get(f"{url}index{item}/")
            soup = BeautifulSoup(response.content, "html.parser")
            threadtitle = soup.find('h1',attrs={"class":"threadtitle"})
            for item in soup.findAll('a',attrs={"class":"bigusername"}):
                lst.append([threadtitle.text])
            for div in soup.find_all('div', class_="ism-true"):
                try:
                    div.find('div', class_="panel alt2").extract()                  
                except AttributeError:
                    pass  
                try:
                    div.find('label').extract()
                except AttributeError:
                    pass  
                result = [div.get_text(strip=True, separator=" ")]
                comments.append(result)

对下面代码的修改不会给出重复的内容，而是跳过url的最后一页

comments= []
for url in urls:
    with requests.Session() as req:
        index=1
        while(True):
            response = req.get(url+"index{}/".format(index))
            index=index+1
            soup = BeautifulSoup(response.content, "html.parser")
            if 'disabled' in soup.select_one('a#mb_pagenext').attrs['class']:
                break
            posts = soup.find(id = "posts")
            threadtitle = soup.find('h1',attrs={"class":"threadtitle"})
            for item in soup.findAll('a',attrs={"class":"bigusername"}):
                lst.append([threadtitle.text])
            for div in soup.find_all('div', class_="ism-true"):
                try:
                    div.find('div', class_="panel alt2").extract()                  
                except AttributeError:
                    pass  # sometimes there is no 'panel alt2'
                try:
                    div.find('label').extract()
                except AttributeError:
                    pass  # sometimes there is no 'Quote'
                result = [div.get_text(strip=True, separator=" ")]
                comments.append(result)

删除soup中的“if'disabled'。select_one（'a#mb_pagenext'）。attrs['class']：break”此代码提供无限循环。如何在不获取重复项的情况下循环页面

只需将if条件的顺序更改为循环的底部，这样一旦所有项都被抓取，那么这将检查是否禁用。如果在顶部提供，这将中断，而不会捕获上一页的值

lst = []
urls = ['https://www.f150forum.com/f118/2019-adding-adaptive-cruise-454662/','https://www.f150forum.com/f118/adaptive-cruise-control-module-300894/']

for url in urls:
    with requests.Session() as req:
        for item in range(1,33):
            response = req.get(f"{url}index{item}/")
            soup = BeautifulSoup(response.content, "html.parser")
            threadtitle = soup.find('h1',attrs={"class":"threadtitle"})
            for item in soup.findAll('a',attrs={"class":"bigusername"}):
                lst.append([threadtitle.text])
            for div in soup.find_all('div', class_="ism-true"):
                try:
                    div.find('div', class_="panel alt2").extract()                  
                except AttributeError:
                    pass  
                try:
                    div.find('label').extract()
                except AttributeError:
                    pass  
                result = [div.get_text(strip=True, separator=" ")]
                comments.append(result)

comments= []
for url in urls:
    with requests.Session() as req:
        index=1
        while(True):
            response = req.get(url+"index{}/".format(index))
            index=index+1
            soup = BeautifulSoup(response.content, "html.parser")

            posts = soup.find(id = "posts")
            threadtitle = soup.find('h1',attrs={"class":"threadtitle"})
            for item in soup.findAll('a',attrs={"class":"bigusername"}):
                lst.append([threadtitle.text])
            for div in soup.find_all('div', class_="ism-true"):
                try:
                    div.find('div', class_="panel alt2").extract()                  
                except AttributeError:
                    pass  # sometimes there is no 'panel alt2'
                try:
                    div.find('label').extract()
                except AttributeError:
                    pass  # sometimes there is no 'Quote'
                result = [div.get_text(strip=True, separator=" ")]
                comments.append(result)

            if 'disabled' in soup.select_one('a#mb_pagenext').attrs['class']:
                break

我会给你我的提示来解决分页问题：1）获取最后一页的页码。2）迭代页面，直到找不到要转到下一页的元素为止。选择你更喜欢的，我在我的代码的第一部分做了这件事，它给出了许多重复。