Python BeautifulSoup-如何刮取多个链接，然后刮取链接的内容_Python_Selenium_Web Scraping_Beautifulsoup

Python BeautifulSoup-如何刮取多个链接，然后刮取链接的内容

python selenium web-scraping

Python BeautifulSoup-如何刮取多个链接，然后刮取链接的内容,python,selenium,web-scraping,beautifulsoup,Python,Selenium,Web Scraping,Beautifulsoup,我试图在登录页有各种链接（顶部的5个子类别）的地方做一个刮擦：在每个类别中都有一个产品列表列出的每个产品都有一个链接以获取更多详细信息（作为单独页面直接链接到产品）到目前为止，我所做的整理工作将包括创建所需的所有单个页面链接的列表。但是，当我尝试循环每个产品链接以获取数据时，我似乎无法让BeautifulSoup映射这些链接中的数据。就好像它停留在上一页（如果您愿意）。我遗漏了什么来允许第二次“跳转”到“产品链接”地址（eg），并允许我从那里刮取数据？我原以为我可能需要添加一个时间。睡

我试图在登录页有各种链接（顶部的5个子类别）的地方做一个刮擦：

在每个类别中都有一个产品列表

列出的每个产品都有一个链接以获取更多详细信息（作为单独页面直接链接到产品）

到目前为止，我所做的整理工作将包括创建所需的所有单个页面链接的列表。但是，当我尝试循环每个产品链接以获取数据时，我似乎无法让BeautifulSoup映射这些链接中的数据。就好像它停留在上一页（如果您愿意）。
我遗漏了什么来允许第二次“跳转”到“产品链接”地址（eg），并允许我从那里刮取数据？我原以为我可能需要添加一个时间。睡眠（5）计时器，以允许所有人都加载，但仍然一无所获

代码：

PS为额外的进口道歉。它们是从上一个脚本复制和粘贴的，一旦确认不需要，就会删除。

使用浏览器时，该信息会从脚本标记中动态提取。在使用请求时，这将不在您可能要查找的位置。相反，从脚本标记中提取该信息

在本例中，我提取脚本中与给定模型相关的所有信息，并生成一个数据帧。我使用

ast

将脚本标记内的字符串转换为python对象。我将产品url和产品标题添加到数据框中

每个df被添加到一个列表中，该列表被转换为最终数据帧。因为我不知道需要什么样的最终标题名，所以我留下了一些默认名称

我已经添加了对于给定产品没有列出型号选项的情况的处理

然后你可以看看加速/整理东西：

from bs4 import BeautifulSoup 
import requests 
import pandas as pd
import re
import os 
import locale 
from selenium import webdriver 
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC 
import ast
from multiprocessing import Pool, cpu_count

def get_models_df(product_link):
    res = requests.get(product_link)
    soup = BeautifulSoup(res.text, 'lxml')
    title = soup.select_one('.ProductTitle').text

    try:
        df = pd.DataFrame(ast.literal_eval(re.search(r'(\[\[.*\]\])', soup.select_one('.ProductOptions script').string).groups(0)[0]))
        df.iloc[:, -1] = product_link
    except:
        placeholder = ['No options listed'] * 8
        placeholder.append(product_link)
        df = pd.DataFrame([placeholder])

    df.insert(0, 'title', title)
    return(df)


def get_all_pages(a_link):
    res = requests.get(a_link) 
    soup = BeautifulSoup(res.text, 'lxml') 
    all_links = ["https://mcavoyguns.co.uk/contents/en-uk/" + i['href'] for i in soup.select('.center-content > a')]   
    return all_links

if __name__ == '__main__':
    os.environ["PYTHONIOENCODING"] = "utf-8" 

    #selenium requests 
    browser = webdriver.Chrome(executable_path='C:/Users/admin/chromedriver.exe')
    browser.get("https://mcavoyguns.co.uk/contents/en-uk/d410_New_Browning_over___under_shotguns.html") 
    all_outlinks = [i.get_attribute('href') for i in WebDriverWait(browser,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".idx2Submenu a")))]
    browser.quit()
    
    with Pool(cpu_count()-1) as p:

        nested_links = p.map(get_all_pages , all_outlinks)
        flat_list = [link for links in nested_links for link in links]   
        results = p.map(get_models_df, flat_list)
        final = pd.concat(results)
        #print(final)
        final.to_csv('guninfo.csv', encoding='utf-8-sig', index = False)

所以我说我会看一看其他要求的项目，它们确实可以通过

要求

获得。一些需要处理的事情：

不同的产品有不同的标题；缺少一些标题

一些unicode字符（仍有一些编码问题需要研究）

处理描述缺失的案例

处理更多部分

更新某些输出值，使Excel不会将其转换为日期

标题处理

nan

待办事项：

其中一个函数现在变成了一个疯狂的怪物，需要重新分解成更小的友好函数调用

使用浏览器时，该信息会从脚本标记中动态提取。在使用请求时，这将不在您可能要查找的位置。相反，从脚本标记中提取该信息

在本例中，我提取脚本中与给定模型相关的所有信息，并生成一个数据帧。我使用

ast

将脚本标记内的字符串转换为python对象。我将产品url和产品标题添加到数据框中

每个df被添加到一个列表中，该列表被转换为最终数据帧。因为我不知道需要什么样的最终标题名，所以我留下了一些默认名称

我已经添加了对于给定产品没有列出型号选项的情况的处理

然后你可以看看加速/整理东西：

from bs4 import BeautifulSoup 
import requests 
import pandas as pd
import re
import os 
import locale 
from selenium import webdriver 
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC 
import ast
from multiprocessing import Pool, cpu_count

def get_models_df(product_link):
    res = requests.get(product_link)
    soup = BeautifulSoup(res.text, 'lxml')
    title = soup.select_one('.ProductTitle').text

    try:
        df = pd.DataFrame(ast.literal_eval(re.search(r'(\[\[.*\]\])', soup.select_one('.ProductOptions script').string).groups(0)[0]))
        df.iloc[:, -1] = product_link
    except:
        placeholder = ['No options listed'] * 8
        placeholder.append(product_link)
        df = pd.DataFrame([placeholder])

    df.insert(0, 'title', title)
    return(df)


def get_all_pages(a_link):
    res = requests.get(a_link) 
    soup = BeautifulSoup(res.text, 'lxml') 
    all_links = ["https://mcavoyguns.co.uk/contents/en-uk/" + i['href'] for i in soup.select('.center-content > a')]   
    return all_links

if __name__ == '__main__':
    os.environ["PYTHONIOENCODING"] = "utf-8" 

    #selenium requests 
    browser = webdriver.Chrome(executable_path='C:/Users/admin/chromedriver.exe')
    browser.get("https://mcavoyguns.co.uk/contents/en-uk/d410_New_Browning_over___under_shotguns.html") 
    all_outlinks = [i.get_attribute('href') for i in WebDriverWait(browser,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".idx2Submenu a")))]
    browser.quit()
    
    with Pool(cpu_count()-1) as p:

        nested_links = p.map(get_all_pages , all_outlinks)
        flat_list = [link for links in nested_links for link in links]   
        results = p.map(get_models_df, flat_list)
        final = pd.concat(results)
        #print(final)
        final.to_csv('guninfo.csv', encoding='utf-8-sig', index = False)

所以我说我会看一看其他要求的项目，它们确实可以通过

要求

获得。一些需要处理的事情：

不同的产品有不同的标题；缺少一些标题

一些unicode字符（仍有一些编码问题需要研究）

处理描述缺失的案例

处理更多部分

更新某些输出值，使Excel不会将其转换为日期

标题处理

nan

待办事项：

其中一个函数现在变成了一个疯狂的怪物，需要重新分解成更小的友好函数调用

正如QHarr指出的，硒是答案。这给了我用不同的眼光看问题的方向，也让我找到了答案

我将此作为我的答案发布，但基于之前提供的工作以及帮助解决方案的持续协助，我将@QHarr的工作归功于此

from bs4 import BeautifulSoup
import math
import requests
import shutil
import csv
import pandas
import numpy as np
from pandas import DataFrame
import re
import os
import urllib.request as urllib2
import locale
import json
from selenium import webdriver
import lxml.html
import time
from selenium.webdriver.support.ui import Select 
os.environ["PYTHONIOENCODING"] = "utf-8"

#selenium requests
browser = webdriver.Chrome(executable_path='C:/Users/andrew.glass/chromedriver.exe')
browser.get("https://mcavoyguns.co.uk/contents/en-uk/d410_New_Browning_over___under_shotguns.html") 
time.sleep(2) 

all_Outlinks=[] 
all_links=[]

soup = BeautifulSoup(browser.page_source, features="lxml") 
submenuFind = soup.find("div", "idx2Submenu") 
submenuItems = submenuFind.find_all("li", "GC34 idx2Sub") 

for submenuItem in submenuItems: 
    for link in submenuItem.select('a[href]'): 
        all_Outlinks.append("https://mcavoyguns.co.uk/contents/en-uk/" + link['href']) 
#print(all_Outlinks) 

for a_link in all_Outlinks:
    res = requests.get(a_link) 
    soup = BeautifulSoup(res.text, 'html.parser') 
    pageLinkDivs = soup.find_all("div", "column full")
    for pageLinkDiv in pageLinkDivs:
        for pageLink in pageLinkDiv.select('a[href]'):
            all_links.append("https://mcavoyguns.co.uk/contents/en-uk/" + pageLink['href'])
#print(all_links)
            
for product_link in all_links:
    
    browser.get(product_link)
    time.sleep(5)
    soup = BeautifulSoup(browser.page_source, 'html.parser')
    model = soup.find("div", "GC65 ProductOptions")
    modelFind = soup.find('select', attrs={'name': re.compile('model')})
    modelList = [x['origvalue'][:14] for x in modelFind.find_all('option')[1:]]
    print(modelList)

模型打印仍然有点凌乱，但一旦收集到所有需求，就可以将其清理干净。

正如QHarr指出的，硒就是答案。这给了我用不同的眼光看问题的方向，也让我找到了答案

我将此作为我的答案发布，但基于之前提供的工作以及帮助解决方案的持续协助，我将@QHarr的工作归功于此

from bs4 import BeautifulSoup
import math
import requests
import shutil
import csv
import pandas
import numpy as np
from pandas import DataFrame
import re
import os
import urllib.request as urllib2
import locale
import json
from selenium import webdriver
import lxml.html
import time
from selenium.webdriver.support.ui import Select 
os.environ["PYTHONIOENCODING"] = "utf-8"

#selenium requests
browser = webdriver.Chrome(executable_path='C:/Users/andrew.glass/chromedriver.exe')
browser.get("https://mcavoyguns.co.uk/contents/en-uk/d410_New_Browning_over___under_shotguns.html") 
time.sleep(2) 

all_Outlinks=[] 
all_links=[]

soup = BeautifulSoup(browser.page_source, features="lxml") 
submenuFind = soup.find("div", "idx2Submenu") 
submenuItems = submenuFind.find_all("li", "GC34 idx2Sub") 

for submenuItem in submenuItems: 
    for link in submenuItem.select('a[href]'): 
        all_Outlinks.append("https://mcavoyguns.co.uk/contents/en-uk/" + link['href']) 
#print(all_Outlinks) 

for a_link in all_Outlinks:
    res = requests.get(a_link) 
    soup = BeautifulSoup(res.text, 'html.parser') 
    pageLinkDivs = soup.find_all("div", "column full")
    for pageLinkDiv in pageLinkDivs:
        for pageLink in pageLinkDiv.select('a[href]'):
            all_links.append("https://mcavoyguns.co.uk/contents/en-uk/" + pageLink['href'])
#print(all_links)
            
for product_link in all_links:
    
    browser.get(product_link)
    time.sleep(5)
    soup = BeautifulSoup(browser.page_source, 'html.parser')
    model = soup.find("div", "GC65 ProductOptions")
    modelFind = soup.find('select', attrs={'name': re.compile('model')})
    modelList = [x['origvalue'][:14] for x in modelFind.find_all('option')[1:]]
    print(modelList)

模型打印仍然有点凌乱，但一旦收集到所有需求，就可以将其清理干净。

以一种迂回的方式，“我缺少什么来允许第二次“跳转”到“产品链接”地址（eg），并允许我从那里刮取数据？”这些数据包含了单个产品页面上的所有产品数据-如果我在问题中没有完全弄清楚，对不起。我刚刚重读，看到了。下班后我需要再去看看。我怀疑这需要selenium，但在我再次访问之前我不会知道。我认为适应并不困难，因为您有产品url列表。你只需要。取而代之，使用一个自定义函数来解析该页（可能带有等待条件，并返回一行（列表），该行（列表）将附加到全局列表中。该全局列表将获得一行/列表（你想要从页面中获取的每个项都是该行中的一列）从访问的每个产品页面追加。在结尾将整个内容转换为df。嗨@QHarr-我直到今天早上才看到您的答复。我将对其进行一次详细检查，看看它是什么样子。看起来会抓住很多对我非常有用的信息，所以如果它看起来是什么，请将您的答案标记为我的答案，而不是我自己的答案。

from bs4 import BeautifulSoup
import math
import requests
import shutil
import csv
import pandas
import numpy as np
from pandas import DataFrame
import re
import os
import urllib.request as urllib2
import locale
import json
from selenium import webdriver
import lxml.html
import time
from selenium.webdriver.support.ui import Select 
os.environ["PYTHONIOENCODING"] = "utf-8"

#selenium requests
browser = webdriver.Chrome(executable_path='C:/Users/andrew.glass/chromedriver.exe')
browser.get("https://mcavoyguns.co.uk/contents/en-uk/d410_New_Browning_over___under_shotguns.html") 
time.sleep(2) 

all_Outlinks=[] 
all_links=[]

soup = BeautifulSoup(browser.page_source, features="lxml") 
submenuFind = soup.find("div", "idx2Submenu") 
submenuItems = submenuFind.find_all("li", "GC34 idx2Sub") 

for submenuItem in submenuItems: 
    for link in submenuItem.select('a[href]'): 
        all_Outlinks.append("https://mcavoyguns.co.uk/contents/en-uk/" + link['href']) 
#print(all_Outlinks) 

for a_link in all_Outlinks:
    res = requests.get(a_link) 
    soup = BeautifulSoup(res.text, 'html.parser') 
    pageLinkDivs = soup.find_all("div", "column full")
    for pageLinkDiv in pageLinkDivs:
        for pageLink in pageLinkDiv.select('a[href]'):
            all_links.append("https://mcavoyguns.co.uk/contents/en-uk/" + pageLink['href'])
#print(all_links)
            
for product_link in all_links:
    
    browser.get(product_link)
    time.sleep(5)
    soup = BeautifulSoup(browser.page_source, 'html.parser')
    model = soup.find("div", "GC65 ProductOptions")
    modelFind = soup.find('select', attrs={'name': re.compile('model')})
    modelList = [x['origvalue'][:14] for x in modelFind.find_all('option')[1:]]
    print(modelList)