Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/290.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 刮削动态元件_Python_Beautifulsoup - Fatal编程技术网

Python 刮削动态元件

Python 刮削动态元件,python,beautifulsoup,Python,Beautifulsoup,下面是我的代码,它是有效的,但有时它不起作用吗?我可以说是日期问题,可能是因为页面中的动态元素?什么是动态元素的解决方案 def collect_bottom_url(product_string): """ collect_bottom_url: This function will accept product name as a argument. create a url of product and then collect all the urls g

下面是我的代码,它是有效的,但有时它不起作用吗?我可以说是日期问题,可能是因为页面中的动态元素?什么是动态元素的解决方案

def collect_bottom_url(product_string):
    """
    collect_bottom_url:
    This function will accept product name as a argument.
    create a url of product and then collect all the urls given in bottom of page for the product.

    :return: list_of_urls
    """

    url = 'https://www.amazon.in/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=' + product_string
    # download the main webpage of product
    webpage = requests.get(url)

    # Store the main URL of Product in a list
    list_of_urls = list()
    list_of_urls.append(url)

    # Create a web page of downloaded page using lxml parser
    my_soup = BeautifulSoup(webpage.text, "lxml")

    # find_all class = pagnLink in web page
    urls_at_bottom = my_soup.find_all(class_='pagnLink')

    empty_list = list()
    for b_url in urls_at_bottom:
        empty_list.append(b_url.find('a')['href'])

    for item in empty_list:
        item = "https://www.amazon.in/" + item
        list_of_urls.append(item)
    print(list_of_urls)


collect_bottom_url('book')
下面是输出1,这很好:

['https://www.amazon.in/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=book', 'https://www.amazon.in//book/s?ie=UTF8&page=2&rh=i%3Aaps%2Ck%3Abook', 'https://www.amazon.in//book/s?ie=UTF8&page=3&rh=i%3Aaps%2Ck%3Abook']
以下是不正确的输出2:

['https://www.amazon.in/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=book']

它不是动态的,但它要求验证码,因为你使用默认的用户代理,改变它

headers= {"User-Agent" : 'Mozilla/5.0.............'}
def collect_bottom_url(product_string):
    .....
    webpage = requests.get(url, headers=headers)

用于动态页面使用。

您能解释一下您在这些行中做了什么吗?headers={“User Agent”:“Mozilla/5.0………”}这是什么意思?它将用作
请求中使用的标题