Python 在HTML代码中下载并查找特定字符串_Python_Beautifulsoup_Html Parsing

Python 在HTML代码中下载并查找特定字符串

python

Python 在HTML代码中下载并查找特定字符串,python,beautifulsoup,html-parsing,Python,Beautifulsoup,Html Parsing,我有下面的代码，它试图从网页下载HTML代码，并将列表中的第二首歌曲打印到shell窗口中 from urllib.request import urlopen #----- url1 = 'http://www.itunescharts.net/aus/charts/songs/2020/10/03' #----- # Get a link to the web page from the server, using one # of the URLs above itunes_pag

我有下面的代码，它试图从网页下载HTML代码，并将列表中的第二首歌曲打印到shell窗口中

from urllib.request import urlopen

#-----

url1 = 'http://www.itunescharts.net/aus/charts/songs/2020/10/03'


#-----
# Get a link to the web page from the server, using one
# of the URLs above
itunes_page = urlopen(url1)

#-----
# Extract the web page's content as a Unicode string
html_code = itunes_page.read().decode('UTF-8')

#----
# close the connection to the web server
itunes_page.close()

#-----
#finding second song on the chart 
start_marker = '<span class="no">2</span> <span class="artist">'
end_marker = '</span>'
start_position = html_code.find(start_marker)
end_position = html_code.find(end_marker)
if start_position == -1 or end_position == -1:
    print('Error: Unable to Second Artist')
else:
    print('\n' + html_code[start_position + len(start_marker) : end_position].upper())

从urllib.request导入urlopen
#-----
url1=http://www.itunescharts.net/aus/charts/songs/2020/10/03'
#-----
#从服务器获取指向网页的链接，使用
#以上网址的
itunes_page=urlopen（url1）
#-----
#将网页内容提取为Unicode字符串
html\u code=itunes\u page.read（）.decode（'UTF-8'）
#----
#关闭与web服务器的连接
itunes_页面关闭（）
#-----
#找到图表上的第二首歌
开始标记='2'
结束标记=“”
开始位置=html代码。查找（开始标记）
end\u position=html\u code.find（end\u标记）
如果开始位置==-1或结束位置==-1：
打印（'错误：无法复制第二个艺术家'）
其他：
打印（'\n'+html_代码[start_position+len（start_标记）：end_position].upper（））

标记开始和结束的代码：

<li id="chart_aus_songs_2" class="no-move">
<span class="no">2</span>
<span class="artist">Jawsh 685, Jason Derulo & BTS</span> - <span class="entry">


2.
Jawsh 685、Jason Derulo和BTS-

我想知道如何更改我的标记，使shell窗口中的结果为==“Jawsh 685，Jason Derulo&BTS”。当我尝试运行代码时，我得到一个空白响应。非常感谢您的帮助

您可以使用

BeautifulSoup

库轻松解析HTML文档，而不是自己搜索标记

（文件：）

要在HTML文档中获取艺术家的姓名，可以执行以下操作：

from urllib.request import urlopen
from bs4 import BeautifulSoup

#-----

url1 = 'http://www.itunescharts.net/aus/charts/songs/2020/10/03'


#-----
# Get a link to the web page from the server, using one
# of the URLs above
itunes_page = urlopen(url1)

#-----
# Extract the web page's content as a Unicode string
html_code = itunes_page.read().decode('UTF-8')

#----
# close the connection to the web server
itunes_page.close()

# Pass your HTML doc to BeautifulSoup and parse it using 'html.parser'
soup = BeautifulSoup(html_code, 'html.parser')

# Find the HTML element with id = "chart". This is the list of your songs.
chart = soup.find(id="chart")

# The index of the song you want to find. So if you want the 10th song in the list, set song_index = 9
song_index = 1

# Get a list of all <li> elements with class "no-move" in the chart, and get the song_index item from the list
song = chart.find_all("li",class_="no-move")[song_index]

# Find the element containing artist's name in the selected song
artist = song.find("span",class_="artist")

# Get the text of the found artist name element
print(artist.get_text())

从urllib.request导入urlopen
从bs4导入BeautifulSoup
#-----
url1=http://www.itunescharts.net/aus/charts/songs/2020/10/03'
#-----
#从服务器获取指向网页的链接，使用
#以上网址的
itunes_page=urlopen（url1）
#-----
#将网页内容提取为Unicode字符串
html\u code=itunes\u page.read（）.decode（'UTF-8'）
#----
#关闭与web服务器的连接
itunes_页面关闭（）
#将HTML文档传递给BeautifulSoup，并使用“HTML.parser”对其进行解析
soup=BeautifulSoup（html_代码'html.parser'）
#使用id=“chart”查找HTML元素。这是你的歌曲列表。
chart=soup.find（id=“chart”）
#要查找的歌曲的索引。因此，如果您想要列表中的第10首歌曲，请将歌曲索引设置为9
宋_指数=1
#获取图表中类为“no move”的所有元素的列表，并从列表中获取song_索引项
宋=图表。查找所有（“李”，class=“不移动”）[宋索引]
#在所选歌曲中查找包含艺术家姓名的元素
艺人=歌曲。查找（“span”，class=“艺人”）
#获取找到的艺术家名称元素的文本
打印（艺术家获取文本（））

当然，您可以使用CSS选择器简化上述搜索，但这应该是一个开始。

您可以使用

BeautifulSoup

库轻松解析HTML文档，而不是自己搜索标记

（文件：）

要在HTML文档中获取艺术家的姓名，可以执行以下操作：

from urllib.request import urlopen
from bs4 import BeautifulSoup

#-----

url1 = 'http://www.itunescharts.net/aus/charts/songs/2020/10/03'


#-----
# Get a link to the web page from the server, using one
# of the URLs above
itunes_page = urlopen(url1)

#-----
# Extract the web page's content as a Unicode string
html_code = itunes_page.read().decode('UTF-8')

#----
# close the connection to the web server
itunes_page.close()

# Pass your HTML doc to BeautifulSoup and parse it using 'html.parser'
soup = BeautifulSoup(html_code, 'html.parser')

# Find the HTML element with id = "chart". This is the list of your songs.
chart = soup.find(id="chart")

# The index of the song you want to find. So if you want the 10th song in the list, set song_index = 9
song_index = 1

# Get a list of all <li> elements with class "no-move" in the chart, and get the song_index item from the list
song = chart.find_all("li",class_="no-move")[song_index]

# Find the element containing artist's name in the selected song
artist = song.find("span",class_="artist")

# Get the text of the found artist name element
print(artist.get_text())

从urllib.request导入urlopen
从bs4导入BeautifulSoup
#-----
url1=http://www.itunescharts.net/aus/charts/songs/2020/10/03'
#-----
#从服务器获取指向网页的链接，使用
#以上网址的
itunes_page=urlopen（url1）
#-----
#将网页内容提取为Unicode字符串
html\u code=itunes\u page.read（）.decode（'UTF-8'）
#----
#关闭与web服务器的连接
itunes_页面关闭（）
#将HTML文档传递给BeautifulSoup，并使用“HTML.parser”对其进行解析
soup=BeautifulSoup（html_代码'html.parser'）
#使用id=“chart”查找HTML元素。这是你的歌曲列表。
chart=soup.find（id=“chart”）
#要查找的歌曲的索引。因此，如果您想要列表中的第10首歌曲，请将歌曲索引设置为9
宋_指数=1
#获取图表中类为“no move”的所有元素的列表，并从列表中获取song_索引项
宋=图表。查找所有（“李”，class=“不移动”）[宋索引]
#在所选歌曲中查找包含艺术家姓名的元素
艺人=歌曲。查找（“span”，class=“艺人”）
#获取找到的艺术家名称元素的文本
打印（艺术家获取文本（））

当然，您可以使用CSS选择器简化上述搜索，但这应该是一个开始。

如果没有beautiful soup，我将如何执行此操作，因为我无法使用此插件？我认为我的标记几乎是正确的。只是断线把我累坏了？有什么想法吗？如果没有漂亮的汤，我怎么做，因为我不能使用这个插件？我认为我的标记几乎是正确的。只是断线把我累坏了？有什么想法吗？