Python 如何使用selenium迭代HREF？_Python_Selenium_Loops_Web Scraping_Href

Python 如何使用selenium迭代HREF？

python selenium loops web-scraping

Python 如何使用selenium迭代HREF？,python,selenium,loops,web-scraping,href,Python,Selenium,Loops,Web Scraping,Href,我一直在尝试获取一篇新闻文章主页的所有HREF。最后，我想创造一些东西，让我从所有的新闻文章中找到n个最常用的词。要做到这一点，我想我首先需要HREF，然后一个接一个地点击它们在这个平台的另一位用户的大量帮助下，我现在得到了以下代码： from bs4 import BeautifulSoup from selenium import webdriver url = 'https://ad.nl' # launch firefox with your url above # note th

我一直在尝试获取一篇新闻文章主页的所有HREF。最后，我想创造一些东西，让我从所有的新闻文章中找到n个最常用的词。要做到这一点，我想我首先需要HREF，然后一个接一个地点击它们

在这个平台的另一位用户的大量帮助下，我现在得到了以下代码：

from bs4 import BeautifulSoup
from selenium import webdriver

url = 'https://ad.nl'

# launch firefox with your url above
# note that you could change this to some other webdriver (e.g. Chrome)
driver = webdriver.Chrome()
driver.get(url)

# click the "accept cookies" button
btn = driver.find_element_by_name('action')
btn.click()

# grab the html. It'll wait here until the page is finished loading
html = driver.page_source

# parse the html soup
soup = BeautifulSoup(html.lower(), "html.parser")
articles = soup.findAll("article")

for i in articles:
    article = driver.find_element_by_class_name('ankeiler')
    hrefs = article.find_element_by_css_selector('a').get_attribute('href')
    print(hrefs)
driver.quit()

它给出了我认为的第一个href，但不会重复下一个href。它只是给了我第一个href，次数和它迭代的次数一样多。有人知道我是如何让它转到下一个href而不是停留在第一个href上的吗

如果有人对如何进一步完成我的小项目有一些建议，请随意分享，因为我还有很多关于Python和编程的知识需要学习

要获取文章中的所有HREF，您可以执行以下操作：

hrefs = article.find_elements_by_xpath('//a')
#OR article.find_element_by_css_selector('a')

for href in hrefs:
  print(href.get_attribute('href'))

不过，为了推进项目，下面的吼声可能会有所帮助：

hrefs = article.find_elements_by_xpath('//a')
links = [href.get_attribute("href") for href in hrefs]

for link in link:
  driver.get(link)
  #Add all words in the article to a dictionary with the key being the words and
  #the value being the number of times they occur

不要用漂亮的汤，这个怎么样

articles = driver.find_elements_by_css_selector('article')

for i in articles:
    href = i.find_element_by_css_selector('a').get_attribute('href')
    print(href)

为了改进我之前的回答，我为您的问题写了一个完整的解决方案：

from selenium import webdriver

url = 'https://ad.nl'

#Set up selenium driver
driver = webdriver.Chrome()
driver.get(url)

#Click the accept cookies button
btn = driver.find_element_by_name('action')
btn.click()

#Get the links of all articles
article_elements = driver.find_elements_by_xpath('//a[@class="ankeiler__link"]')
links = [link.get_attribute('href') for link in article_elements]

#Create a dictionary for every word in the articles
words = dict()

#Iterate through every article
for link in links:
    #Get the article
    driver.get(link)

    #get the elements that are the body of the article
    article_elements = driver.find_elements_by_xpath('//*[@class="article__paragraph"]')

    #Initalise a empty string
    article_text = ''

    #Add all the text from the elements to the one string
    for element in article_elements:
        article_text+= element.text + " "

    #Convert all character to lower case  
    article_text = article_text.lower()

    #Remove all punctuation other than spaces
    for char in article_text:
        if ord(char) > 122 or ord(char) < 97:
            if ord(char) != 32:
                article_text = article_text.replace(char,"")

    #Split the article into words
    for word in article_text.split(" "):
        #If the word is already in the article update the count
        if word  in words:
            words[word] += 1
        #Otherwise make a new entry
        else:
            words[word] = 1

#Print the final dictionary (Very large so maybe sort for most occurring words and display top 10)
#print(words)

#Sort words by most used
most_used = sorted(words.items(), key=lambda x: x[1],reverse=True)

#Print top 10 used words
print("TOP 10 MOST USED: ")
for i in range(10):
    print(most_used[i])

driver.quit()

从selenium导入webdriver
url='1〕https://ad.nl'
#设置selenium驱动程序
driver=webdriver.Chrome（）
获取驱动程序（url）
#单击接受cookies按钮
btn=驱动程序。通过名称（“操作”）查找元素
点击（）
#获取所有文章的链接
article\u elements=driver.通过xpath（'//a[@class=“ankeiler\u link”]”查找\u元素
links=[link.get_属性（'href'）用于文章元素中的链接]
#为文章中的每个单词创建一本词典
单词=dict（）
#反复阅读每一篇文章
对于链接中的链接：
#获取文章
驱动程序。获取（链接）
#获取作为文章主体的元素
article\u elements=driver。通过xpath（'//*[@class=“article\u段落”]”查找\u elements
#初始化空字符串
文章文本=“”
#将元素中的所有文本添加到一个字符串中
对于第_条元素中的元素：
article_text+=element.text+“”
#将所有字符转换为小写
article\u text=article\u text.lower（）
#删除除空格以外的所有标点符号
对于文章文本中的字符：
如果ord（字符）>122或ord（字符）<97：
如果ord（字符）！=32:
article_text=article_text.替换（字符，“”）
#把文章分成几个字
对于article_text.split（“”）中的单词：
#如果文章中已有该词，请更新计数
如果用文字表示：
字[字]+=1
#否则，请创建一个新条目
其他：
单词[单词]=1
#打印最终的词典（非常大，因此可能会对出现最多的单词进行排序，并显示前10名）
#印刷品（字）
#按最常用的词排序
most_used=sorted（words.items（），key=lambda x:x[1]，reverse=True）
#打印前10个常用词
打印（“最常用的前十名：”）
对于范围（10）内的i：
打印（最常用[i]）
driver.quit（）

对我来说很好，如果有任何错误，请告诉我。

这确实有效。但是有几个我不想拥有的HREF。我想要的HREF都有一个叫做“ankeiler\uuu链接”的类。你知道我怎么能只选那些吗？谢谢我已经写了一个完整的解决方案，作为另一个使用“ankeiler__链接”的答案。如果我能说荷兰语，那么我就会理解最常见的单词是什么：）这太神奇了。。。我唯一要补充的是，它不包括“the”和“a”“an”两个词。不过我确实有一些问题。驱动程序get（link）是否将驱动程序的值从ad.nl替换为链接的任何内容？为什么必须创建一个新的空字符串，而不是只使用element.text？你怎么知道ord必须大于122，或者第一个问题的答案是肯定的，驱动程序从“ad.nl”变为“ad.nl/article1”。第二个问题的答案是肯定的，因为每篇文章都有多个段落，所以您需要将所有段落的所有文本添加到一起。我知道ord（）必须介于122和97之间，因为（97-122）是字符a-z的ascii数。还值得注意的是，32是一个空格，因为我们不想删除空格，因为它们分隔单词。您可以通过查找“ASCII表”获得这些数字。