Python 3.x 分页Web垃圾处理Python3-BS4-While循环_Python 3.x_Web Scraping_Beautifulsoup

Python 3.x 分页Web垃圾处理Python3-BS4-While循环

python-3.x web-scraping

Python 3.x 分页Web垃圾处理Python3-BS4-While循环,python-3.x,web-scraping,beautifulsoup,Python 3.x,Web Scraping,Beautifulsoup,我完成了一页的刮刀，并提取了下一页的href 我不能让刮刀在每个后续页面的循环中运行。我尝试了一个While-True循环，但这会破坏我第一页的结果此代码适用于第一页： import bs4 from urllib.request import urlopen as ireq from bs4 import BeautifulSoup as soup myurl = ('https://www.podiuminfo.nl/concertagenda/') uClient = ireq(myu

我完成了一页的刮刀，并提取了下一页的href

我不能让刮刀在每个后续页面的循环中运行。我尝试了一个While-True循环，但这会破坏我第一页的结果

此代码适用于第一页：

import bs4
from urllib.request import urlopen as ireq
from bs4 import BeautifulSoup as soup

myurl = ('https://www.podiuminfo.nl/concertagenda/')
uClient = ireq(myurl)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")

filename = "db.csv"
f = open(filename, "w")
headers = "Artist, Venue, City, Date\n"
f.write(headers)

DayContainer = page_soup.findAll("section",{"class":"overflow"})
print("Days on page: " + str(len(DayContainer)) + "\n")

def NextPage():
    np = page_soup.findAll("section", {"class":"next_news"})
    np = np[0].find('a').attrs['href']
    print(np)

for days in DayContainer: 
    shows = days.findAll("span", {"class":"concert_uitverkocht"})

    for soldout in shows:
        if shows:
            soldoutPlu = shows[0].parent.parent.parent

            artist = soldoutPlu.findAll("div", {"class":"td_2"})
            artist = artist[0].text.strip()

            venue = soldoutPlu.findAll("div", {"class":"td_3"})
            venue = venue[0].text

            city = soldoutPlu.findAll("div", {"class":"td_4"})
            city = city[0].text

            date = shows[0].parent.parent.parent.parent.parent
            date = date.findAll("section", {"class":"concert_agenda_date"})
            date = date[0].text
            date = date.strip().replace("\n", " ")
            print("Datum gevonden!")

            print("Artiest: " + artist)
            print("Locatie: " + venue)
            print("Stad: " + city) 
            print("Datum: " + date+ "\n")

            f.write(artist + "," + date + "," + city + "," + venue + "\n")

        else: 
            pass

NextPage()

我想不需要baseurl+number方法，因为我可以使用findAll从每个页面提取正确的url。我是新来的，所以这个错误一定很愚蠢

谢谢你的帮助

你的错误您必须获取在文件末尾找到的url，您只需调用

NextPage（）

，但它所做的只是打印出url

那是你的错误：）

扼要重述为了便于理解，我做了一个大的a循环，但它是有效的：）

您需要对进行一些调整，以便在db.csv中没有重复的名称和日期。请尝试以下脚本，以获得遍历不同页面的所需字段，并将其相应地写入csv文件。我试图清理您的重复编码，并应用稍微干净的方法来代替它。试一试：

import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup

link = 'https://www.podiuminfo.nl/concertagenda/?page={}&input_plaats=&input_datum=2018-06-30&input_podium=&input_genre=&input_provincie=&sort=&input_zoek='

with open("output.csv","w",newline="",encoding="utf-8") as infile:
    writer = csv.writer(infile)
    writer.writerow(['Artist','Venue','City'])

    pagenum = -1   #make sure to get the content of the first page as well which is "0" in the link
    while True:
        pagenum+=1
        res = urlopen(link.format(pagenum)).read()
        soup = BeautifulSoup(res, "html.parser")
        container = soup.find_all("section",class_="concert_rows_info")
        if len(container)<=1:break  ##as soon as there is no content the scraper should break out of the loop

        for items in container:
            artist = items.find(class_="td_2")("a")[0].get_text(strip=True)
            venue = items.find(class_="td_3").get_text(strip=True)
            city = items.find(class_="td_4").get_text(strip=True)
            writer.writerow([artist,city,venue])
            print(f'{artist}\n{venue}\n{city}\n')

导入csv
从urllib.request导入urlopen
从bs4导入BeautifulSoup
链接https://www.podiuminfo.nl/concertagenda/?page={}&input_plaats=&input_datum=2018-06-30&input_podium=&input_流派=&input_provincie=&sort=&input_zoek=&
以open（“output.csv”、“w”、newline=“”、encoding=“utf-8”）作为填充：
writer=csv.writer（infle）
writer.writerow（['Artist'，'venture'，'City']））
pagenum=-1#确保同时获取第一页的内容，即链接中的“0”
尽管如此：
pagenum+=1
res=urlopen（link.format（pagenum））.read（）
soup=BeautifulSoup（res，“html.parser”）
container=soup.find_all（“section”，class=“concert\u rows\u info”）
如果len（容器）给了我们网站，我们不知道你要反对什么样的代码更新！你的回答教会了我很多。。谢谢你所做的一切，这让我度过了一个周末！查看编辑。我试图以这样的方式创建一个循环，这样您就不会遇到无限循环。谢谢。这正是我想要达到的。你让代码变得越来越干净。我真的从你处理这类问题的方式中学到了很多，谢谢。也谢谢你的编辑，我可以看到现在发生了什么！
import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup

link = 'https://www.podiuminfo.nl/concertagenda/?page={}&input_plaats=&input_datum=2018-06-30&input_podium=&input_genre=&input_provincie=&sort=&input_zoek='

with open("output.csv","w",newline="",encoding="utf-8") as infile:
    writer = csv.writer(infile)
    writer.writerow(['Artist','Venue','City'])

    pagenum = -1   #make sure to get the content of the first page as well which is "0" in the link
    while True:
        pagenum+=1
        res = urlopen(link.format(pagenum)).read()
        soup = BeautifulSoup(res, "html.parser")
        container = soup.find_all("section",class_="concert_rows_info")
        if len(container)<=1:break  ##as soon as there is no content the scraper should break out of the loop

        for items in container:
            artist = items.find(class_="td_2")("a")[0].get_text(strip=True)
            venue = items.find(class_="td_3").get_text(strip=True)
            city = items.find(class_="td_4").get_text(strip=True)
            writer.writerow([artist,city,venue])
            print(f'{artist}\n{venue}\n{city}\n')