Python 短期及；Easy-soup.find_all不返回多个标记元素_Python_Web Scraping_Beautifulsoup

Python 短期及；Easy-soup.find_all不返回多个标记元素

python web-scraping

Python 短期及；Easy-soup.find_all不返回多个标记元素,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我需要刮除所有带有“结果标题”类的“a”标记，以及所有带有“结果价格”和“结果兜帽”类的“span”标记。然后，将输出跨多列写入.csv文件。当前代码不会将任何内容打印到csv文件中。这可能是糟糕的语法，但我真的看不出我遗漏了什么。谢谢 f = csv.writer(open(r"C:\Users\Sean\Desktop\Portfolio\Python - Web Scraper\RE Competitor Analysis.csv", "wb")) def scrape_links(st

我需要刮除所有带有“结果标题”类的“a”标记，以及所有带有“结果价格”和“结果兜帽”类的“span”标记。然后，将输出跨多列写入.csv文件。当前代码不会将任何内容打印到csv文件中。这可能是糟糕的语法，但我真的看不出我遗漏了什么。谢谢

f = csv.writer(open(r"C:\Users\Sean\Desktop\Portfolio\Python - Web Scraper\RE Competitor Analysis.csv", "wb"))

def scrape_links(start_url):
for i in range(0, 2500, 120):
    source = urllib.request.urlopen(start_url.format(i)).read()
    soup = BeautifulSoup(source, 'lxml')
    for a in soup.find_all("a", "span", {"class" : ["result-title hdrlnk", "result-price", "result-hood"]}):
        f.writerow([a['href']], span['results-title hdrlnk'].getText(), span['results-price'].getText(), span['results-hood'].getText() )
    if i < 2500:
        sleep(randint(30,120))
    print(i)


scrape_links('my_url')

f=csv.writer（打开（r“C:\Users\Sean\Desktop\Portfolio\Python-Web Scraper\RE Competitor Analysis.csv”，“wb”））
def刮取链接（开始url）：
对于范围（0、2500、120）内的i：
source=urllib.request.urlopen（start_url.format（i））.read（）
汤=BeautifulSoup（来源“lxml”）
对于汤中的a.find_all（“a”，“span”，“class”：[“结果标题hdrlnk”，“结果价格”，“结果兜帽”]}）：
f、 writerow（[a['href']]，span['results-title hdrlnk'].getText（），span['results-price'].getText（），span['results-hood'].getText（））
如果i<2500：
睡眠（randint（30120））
印刷品（一）
刮取链接（“我的url”）

如果您想通过一次调用

查找所有标记来查找多个标记，您应该将它们传递到列表中。例如：
soup.find_all(["a", "span"])

a = soup.find('a', class_ = 'result-title')
a_link = a['href']
a_text = a.text

spans = soup.find_all('span', class_ = ['results-price', 'result-hood'])

row = [a_link, a_text] + [s.text for s in spans]
print(row) # verify we are getting the results we expect

f.writerow(row)

如果无法访问正在抓取的页面，就很难给出完整的解决方案，但我建议每次提取一个变量并打印它，以帮助您进行调试。例如：
soup.find_all(["a", "span"])

a = soup.find('a', class_ = 'result-title')
a_link = a['href']
a_text = a.text

spans = soup.find_all('span', class_ = ['results-price', 'result-hood'])

row = [a_link, a_text] + [s.text for s in spans]
print(row) # verify we are getting the results we expect

f.writerow(row)

你能给出一个正在被过滤的文件的例子吗？我猜是某种XML或HTML？这些类会应用于列表中的所有内容吗？知道如何输出到csv吗？有点不清楚您试图写入csv的内容。您是否希望正好有一个a
标记和三个span
标记？是的，在“a”标记中有一个超链接，其中也包含文章标题，因此我希望将“a”标记href和文本刮除，并将它们输出到两个单独的列中。