使用bs4和Python分离URL

使用bs4和Python分离URL,python,python-3.x,web-scraping,beautifulsoup,Python,Python 3.x,Web Scraping,Beautifulsoup,我正在抓取一个站点中的一堆链接,这些链接是在单个HTMLdiv标记中,带有标记到换行符,但是当我试图从该div获取所有URL时,它只是以单个字符串的形式出现 我无法在列表中分离。我的代码如下: 使用下面的代码,我将删除所有链接: links = soup.find('div', id='dle-content').find('div', class_='full').find( 'div', class_='full-news').find('div', class_='

我正在抓取一个站点中的一堆链接,这些链接是在单个HTML
div
标记中,带有

标记到换行符,但是当我试图从该
div
获取所有URL时,它只是以单个字符串的形式出现

我无法在
列表中分离。我的代码如下:

使用下面的代码,我将删除所有链接:

links = soup.find('div', id='dle-content').find('div', class_='full').find(
            'div', class_='full-news').find('div', class_='quote').text
<div class="quote">
<!--QuoteEBegin-->
https://example.com/asd.html
<br>
https://example.net/abc
<br>
https://example.org/v/kjg/
<br>
<br>
<!--QuoteEEnd-->
</div>
https://example.com/asd.htmlhttps://example.net/abchttps://example.org/v/kjg/
[
"https://example.com/asd.html",
"https://example.net/abc",
"https://example.org/v/kjg/"
]
print(new_output)
['https://example.com/asd.html', 'https://example.net/abc', 'https://example.org/v/kjg/']
以下是来自网站的html:

links = soup.find('div', id='dle-content').find('div', class_='full').find(
            'div', class_='full-news').find('div', class_='quote').text
<div class="quote">
<!--QuoteEBegin-->
https://example.com/asd.html
<br>
https://example.net/abc
<br>
https://example.org/v/kjg/
<br>
<br>
<!--QuoteEEnd-->
</div>
https://example.com/asd.htmlhttps://example.net/abchttps://example.org/v/kjg/
[
"https://example.com/asd.html",
"https://example.net/abc",
"https://example.org/v/kjg/"
]
print(new_output)
['https://example.com/asd.html', 'https://example.net/abc', 'https://example.org/v/kjg/']
我想要的输出:

links = soup.find('div', id='dle-content').find('div', class_='full').find(
            'div', class_='full-news').find('div', class_='quote').text
<div class="quote">
<!--QuoteEBegin-->
https://example.com/asd.html
<br>
https://example.net/abc
<br>
https://example.org/v/kjg/
<br>
<br>
<!--QuoteEEnd-->
</div>
https://example.com/asd.htmlhttps://example.net/abchttps://example.org/v/kjg/
[
"https://example.com/asd.html",
"https://example.net/abc",
"https://example.org/v/kjg/"
]
print(new_output)
['https://example.com/asd.html', 'https://example.net/abc', 'https://example.org/v/kjg/']
试试这个:

from bs4 import BeautifulSoup

sample = """<div class="quote">
<!--QuoteEBegin-->
https://example.com/asd.html
<br>
https://example.net/abc
<br>
https://example.org/v/kjg/
<br>
<br>
<!--QuoteEEnd-->
</div>"""

soup = BeautifulSoup(sample, "html.parser").find_all("div", class_="quote")
print([i.getText().split() for i in soup])
试试这个:

from bs4 import BeautifulSoup

sample = """<div class="quote">
<!--QuoteEBegin-->
https://example.com/asd.html
<br>
https://example.net/abc
<br>
https://example.org/v/kjg/
<br>
<br>
<!--QuoteEEnd-->
</div>"""

soup = BeautifulSoup(sample, "html.parser").find_all("div", class_="quote")
print([i.getText().split() for i in soup])

您可以通过字符串操作来修复它:

new_output = ' http'.join(output.split('http')).split()

您可以通过字符串操作来修复它:

new_output = ' http'.join(output.split('http')).split()

拆分字符串,然后使用列表理解将其组合在一起:

output = 'https://example.com/asd.htmlhttps://example.net/abchttps://example.org/v/kjg/'
split_output = output.split()
new_output = [x for x in split_output if x != '']
输出:

links = soup.find('div', id='dle-content').find('div', class_='full').find(
            'div', class_='full-news').find('div', class_='quote').text
<div class="quote">
<!--QuoteEBegin-->
https://example.com/asd.html
<br>
https://example.net/abc
<br>
https://example.org/v/kjg/
<br>
<br>
<!--QuoteEEnd-->
</div>
https://example.com/asd.htmlhttps://example.net/abchttps://example.org/v/kjg/
[
"https://example.com/asd.html",
"https://example.net/abc",
"https://example.org/v/kjg/"
]
print(new_output)
['https://example.com/asd.html', 'https://example.net/abc', 'https://example.org/v/kjg/']

拆分字符串,然后使用列表理解将其组合在一起:

output = 'https://example.com/asd.htmlhttps://example.net/abchttps://example.org/v/kjg/'
split_output = output.split()
new_output = [x for x in split_output if x != '']
输出:

links = soup.find('div', id='dle-content').find('div', class_='full').find(
            'div', class_='full-news').find('div', class_='quote').text
<div class="quote">
<!--QuoteEBegin-->
https://example.com/asd.html
<br>
https://example.net/abc
<br>
https://example.org/v/kjg/
<br>
<br>
<!--QuoteEEnd-->
</div>
https://example.com/asd.htmlhttps://example.net/abchttps://example.org/v/kjg/
[
"https://example.com/asd.html",
"https://example.net/abc",
"https://example.org/v/kjg/"
]
print(new_output)
['https://example.com/asd.html', 'https://example.net/abc', 'https://example.org/v/kjg/']

实现所需输出的另一种方法:

from bs4 import BeautifulSoup

html = """
    <div class="quote">
    <!--QuoteEBegin-->
    https://example.com/asd.html
    <br>
    https://example.net/abc
    <br>
    https://example.org/v/kjg/
    <br>
    <br>
    <!--QuoteEEnd-->
    </div>
"""

soup = BeautifulSoup(html,"html.parser")
print([i.strip() for i in soup.find("div",class_="quote").strings if i!='\n'])

实现所需输出的另一种方法:

from bs4 import BeautifulSoup

html = """
    <div class="quote">
    <!--QuoteEBegin-->
    https://example.com/asd.html
    <br>
    https://example.net/abc
    <br>
    https://example.org/v/kjg/
    <br>
    <br>
    <!--QuoteEEnd-->
    </div>
"""

soup = BeautifulSoup(html,"html.parser")
print([i.strip() for i in soup.find("div",class_="quote").strings if i!='\n'])

其中一些是http://的,而另一些是https://我现在怎么能指定它呢?当然可以,但条件有限。但实际上,可能只是按照我上面编辑它的方式工作。其中一些是仅使用http://的,而另一些是https://我现在如何指定它呢?当然可以,但有一个条件。但事实上,可能只是工作的方式,我编辑它上面。