Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/311.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 使用BeautifulSoup抓取多个URL_Python_Web Scraping_Beautifulsoup - Fatal编程技术网

Python 使用BeautifulSoup抓取多个URL

Python 使用BeautifulSoup抓取多个URL,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我有一个dataframe,其中一列包含4000多个不同的文章URL。我已经实现了以下代码来从URL中提取所有文本,它似乎对一个或两个URL都有效,但并不适用于所有URL for i in df.url: http = urllib3.PoolManager() response = http.request('GET', i) soup = bsoup(response.data, 'html.parser') # kill all script and sty

我有一个dataframe,其中一列包含4000多个不同的文章URL。我已经实现了以下代码来从URL中提取所有文本,它似乎对一个或两个URL都有效,但并不适用于所有URL

for i in df.url:

    http = urllib3.PoolManager() 
    response = http.request('GET', i)
    soup = bsoup(response.data, 'html.parser')


# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text)
break

在第一个
for
循环中,将所有解析的URL分配给同一个变量-
soup
。在循环结束时,此变量将包含最后一个url的解析内容,而不是您预期的所有url。这就是为什么您只看到一个输出

您可以将所有代码放在一个循环中

for url in df.url:
    http = urllib3.PoolManager() 
    response = http.request('GET', url)
    soup = bsoup(response.data, 'html.parser')

    # kill all script and style elements
    for script in soup(["script", "style"]):
        script.extract()    # rip it out

    # get text
    text = soup.get_text()

    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())

    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    text = '\n'.join(chunk for chunk in chunks if chunk)
    print(url)
    print(text)

我已经完成了以下相同的导入:将熊猫作为pd从bs4导入BeautifulSoup作为bsoup导入urllib3导入lxml导入html.parser