使用bs4和Python分离URL_Python_Python 3.x_Web Scraping_Beautifulsoup

使用bs4和Python分离URL

python python-3.x web-scraping

使用bs4和Python分离URL,python,python-3.x,web-scraping,beautifulsoup,Python,Python 3.x,Web Scraping,Beautifulsoup,我正在抓取一个站点中的一堆链接，这些链接是在单个HTMLdiv标记中，带有标记到换行符，但是当我试图从该div获取所有URL时，它只是以单个字符串的形式出现我无法在列表中分离。我的代码如下：使用下面的代码，我将删除所有链接： links = soup.find('div', id='dle-content').find('div', class_='full').find( 'div', class_='full-news').find('div', class_='

我正在抓取一个站点中的一堆链接，这些链接是在单个HTML

div

标记中，带有

标记到换行符，但是当我试图从该

div

获取所有URL时，它只是以单个字符串的形式出现

我无法在

列表中分离。我的代码如下：
使用下面的代码，我将删除所有链接：
links = soup.find('div', id='dle-content').find('div', class_='full').find(
            'div', class_='full-news').find('div', class_='quote').text

<div class="quote">
<!--QuoteEBegin-->
https://example.com/asd.html
<br>
https://example.net/abc
<br>
https://example.org/v/kjg/
<br>
<br>
<!--QuoteEEnd-->
</div>

https://example.com/asd.htmlhttps://example.net/abchttps://example.org/v/kjg/

[
"https://example.com/asd.html",
"https://example.net/abc",
"https://example.org/v/kjg/"
]

print(new_output)
['https://example.com/asd.html', 'https://example.net/abc', 'https://example.org/v/kjg/']

以下是来自网站的html:
links = soup.find('div', id='dle-content').find('div', class_='full').find(
            'div', class_='full-news').find('div', class_='quote').text

<div class="quote">
<!--QuoteEBegin-->
https://example.com/asd.html
<br>
https://example.net/abc
<br>
https://example.org/v/kjg/
<br>
<br>
<!--QuoteEEnd-->
</div>

https://example.com/asd.htmlhttps://example.net/abchttps://example.org/v/kjg/

[
"https://example.com/asd.html",
"https://example.net/abc",
"https://example.org/v/kjg/"
]

print(new_output)
['https://example.com/asd.html', 'https://example.net/abc', 'https://example.org/v/kjg/']

我想要的输出：
links = soup.find('div', id='dle-content').find('div', class_='full').find(
            'div', class_='full-news').find('div', class_='quote').text

<div class="quote">
<!--QuoteEBegin-->
https://example.com/asd.html
<br>
https://example.net/abc
<br>
https://example.org/v/kjg/
<br>
<br>
<!--QuoteEEnd-->
</div>

https://example.com/asd.htmlhttps://example.net/abchttps://example.org/v/kjg/

[
"https://example.com/asd.html",
"https://example.net/abc",
"https://example.org/v/kjg/"
]

print(new_output)
['https://example.com/asd.html', 'https://example.net/abc', 'https://example.org/v/kjg/']

试试这个：
from bs4 import BeautifulSoup

sample = """<div class="quote">
<!--QuoteEBegin-->
https://example.com/asd.html
<br>
https://example.net/abc
<br>
https://example.org/v/kjg/
<br>
<br>
<!--QuoteEEnd-->
</div>"""

soup = BeautifulSoup(sample, "html.parser").find_all("div", class_="quote")
print([i.getText().split() for i in soup])

试试这个：
from bs4 import BeautifulSoup

sample = """<div class="quote">
<!--QuoteEBegin-->
https://example.com/asd.html
<br>
https://example.net/abc
<br>
https://example.org/v/kjg/
<br>
<br>
<!--QuoteEEnd-->
</div>"""

soup = BeautifulSoup(sample, "html.parser").find_all("div", class_="quote")
print([i.getText().split() for i in soup])

您可以通过字符串操作来修复它：
new_output = ' http'.join(output.split('http')).split()

您可以通过字符串操作来修复它：
new_output = ' http'.join(output.split('http')).split()

拆分字符串，然后使用列表理解将其组合在一起：
output = 'https://example.com/asd.htmlhttps://example.net/abchttps://example.org/v/kjg/'
split_output = output.split()
new_output = [x for x in split_output if x != '']

输出：
links = soup.find('div', id='dle-content').find('div', class_='full').find(
            'div', class_='full-news').find('div', class_='quote').text

<div class="quote">
<!--QuoteEBegin-->
https://example.com/asd.html
<br>
https://example.net/abc
<br>
https://example.org/v/kjg/
<br>
<br>
<!--QuoteEEnd-->
</div>

https://example.com/asd.htmlhttps://example.net/abchttps://example.org/v/kjg/

[
"https://example.com/asd.html",
"https://example.net/abc",
"https://example.org/v/kjg/"
]

print(new_output)
['https://example.com/asd.html', 'https://example.net/abc', 'https://example.org/v/kjg/']

拆分字符串，然后使用列表理解将其组合在一起：
output = 'https://example.com/asd.htmlhttps://example.net/abchttps://example.org/v/kjg/'
split_output = output.split()
new_output = [x for x in split_output if x != '']

输出：
links = soup.find('div', id='dle-content').find('div', class_='full').find(
            'div', class_='full-news').find('div', class_='quote').text

<div class="quote">
<!--QuoteEBegin-->
https://example.com/asd.html
<br>
https://example.net/abc
<br>
https://example.org/v/kjg/
<br>
<br>
<!--QuoteEEnd-->
</div>

https://example.com/asd.htmlhttps://example.net/abchttps://example.org/v/kjg/

[
"https://example.com/asd.html",
"https://example.net/abc",
"https://example.org/v/kjg/"
]

print(new_output)
['https://example.com/asd.html', 'https://example.net/abc', 'https://example.org/v/kjg/']

实现所需输出的另一种方法：
from bs4 import BeautifulSoup

html = """
    <div class="quote">
    <!--QuoteEBegin-->
    https://example.com/asd.html
    <br>
    https://example.net/abc
    <br>
    https://example.org/v/kjg/
    <br>
    <br>
    <!--QuoteEEnd-->
    </div>
"""

soup = BeautifulSoup(html,"html.parser")
print([i.strip() for i in soup.find("div",class_="quote").strings if i!='\n'])

实现所需输出的另一种方法：
from bs4 import BeautifulSoup

html = """
    <div class="quote">
    <!--QuoteEBegin-->
    https://example.com/asd.html
    <br>
    https://example.net/abc
    <br>
    https://example.org/v/kjg/
    <br>
    <br>
    <!--QuoteEEnd-->
    </div>
"""

soup = BeautifulSoup(html,"html.parser")
print([i.strip() for i in soup.find("div",class_="quote").strings if i!='\n'])

其中一些是http://的，而另一些是https://我现在怎么能指定它呢？当然可以，但条件有限。但实际上，可能只是按照我上面编辑它的方式工作。其中一些是仅使用http://的，而另一些是https://我现在如何指定它呢？当然可以，但有一个条件。但事实上，可能只是工作的方式，我编辑它上面。