使用Python从HTML文件中提取文本(音乐艺术家/标题)
我想从一个页面上提取艺术家和歌曲的标题 页面:使用Python从HTML文件中提取文本(音乐艺术家/标题),python,beautifulsoup,Python,Beautifulsoup,我想从一个页面上提取艺术家和歌曲的标题 页面: 不是我的儿子 起源 这在页面上重复了几次(请参见顶部链接swr3.de),但我不知道如何使用beautifulsoup和python创建列表,如: 创世纪-没有我的儿子 加倍你-请不要走 使用、和的组合: 首先,安装先决条件: pip install beautifulsoup4 pip install requests pip install lxml import requests, lxml from bs4 import Beaut
不是我的儿子
起源
这在页面上重复了几次(请参见顶部链接swr3.de),但我不知道如何使用beautifulsoup和python创建列表,如:
创世纪-没有我的儿子加倍你-请不要走 使用、和的组合: 首先,安装先决条件:
pip install beautifulsoup4
pip install requests
pip install lxml
import requests, lxml
from bs4 import BeautifulSoup
parsedsongs = []
result = requests.get('http://www.swr3.de//-/id=47424/cf=42/did=65794/93avs/index.html?hour=5&date=2015-10-23')
soup = BeautifulSoup(result.content, "lxml")
detailbodys = soup.find_all('div', 'detail-body')
for detailbody in detailbodys:
title = detailbody.h4.string.encode('utf-8').strip()
if detailbody.h5:
artist = detailbody.h5.string.encode('utf-8').strip()
else:
artist = detailbody.span.string.encode('utf-8').strip()
parsedsongs.append({'artist': artist, 'title': title})
for entry in parsedsongs:
print 'Artist: {}\tTitle: {}'.format(entry['artist'], entry['title'])
(swr3)macbook:swr3 joeyoung$ python swr3.py
Artist: Vaya Con Dios Title: Nah neh nah
Artist: Genesis Title: No son of mine
Artist: Genesis Title: No son of mine
Artist: Double You Title: Please don't go
Artist: Stereo MC's Title: Step it up
Artist: Cranberries Title: Zombie
Artist: La Bouche Title: Sweet dreams
Artist: Die Prinzen Title: Du mußt ein Schwein sein
Artist: Bad Religion Title: Punk rock song
Artist: Bellini Title: Samba de Janeiro
Artist: Dion, Celine; Bee Gees Title: Immortality
Artist: Jones, Tom; Mousse T. Title: Sex bomb
Artist: Yanai, Kate Title: Bacardi feeling (Summer dreamin')
Artist: Heroes Del Silencio Title: Entre dos tierras
swr3.py:
pip install beautifulsoup4
pip install requests
pip install lxml
import requests, lxml
from bs4 import BeautifulSoup
parsedsongs = []
result = requests.get('http://www.swr3.de//-/id=47424/cf=42/did=65794/93avs/index.html?hour=5&date=2015-10-23')
soup = BeautifulSoup(result.content, "lxml")
detailbodys = soup.find_all('div', 'detail-body')
for detailbody in detailbodys:
title = detailbody.h4.string.encode('utf-8').strip()
if detailbody.h5:
artist = detailbody.h5.string.encode('utf-8').strip()
else:
artist = detailbody.span.string.encode('utf-8').strip()
parsedsongs.append({'artist': artist, 'title': title})
for entry in parsedsongs:
print 'Artist: {}\tTitle: {}'.format(entry['artist'], entry['title'])
(swr3)macbook:swr3 joeyoung$ python swr3.py
Artist: Vaya Con Dios Title: Nah neh nah
Artist: Genesis Title: No son of mine
Artist: Genesis Title: No son of mine
Artist: Double You Title: Please don't go
Artist: Stereo MC's Title: Step it up
Artist: Cranberries Title: Zombie
Artist: La Bouche Title: Sweet dreams
Artist: Die Prinzen Title: Du mußt ein Schwein sein
Artist: Bad Religion Title: Punk rock song
Artist: Bellini Title: Samba de Janeiro
Artist: Dion, Celine; Bee Gees Title: Immortality
Artist: Jones, Tom; Mousse T. Title: Sex bomb
Artist: Yanai, Kate Title: Bacardi feeling (Summer dreamin')
Artist: Heroes Del Silencio Title: Entre dos tierras
输出:
pip install beautifulsoup4
pip install requests
pip install lxml
import requests, lxml
from bs4 import BeautifulSoup
parsedsongs = []
result = requests.get('http://www.swr3.de//-/id=47424/cf=42/did=65794/93avs/index.html?hour=5&date=2015-10-23')
soup = BeautifulSoup(result.content, "lxml")
detailbodys = soup.find_all('div', 'detail-body')
for detailbody in detailbodys:
title = detailbody.h4.string.encode('utf-8').strip()
if detailbody.h5:
artist = detailbody.h5.string.encode('utf-8').strip()
else:
artist = detailbody.span.string.encode('utf-8').strip()
parsedsongs.append({'artist': artist, 'title': title})
for entry in parsedsongs:
print 'Artist: {}\tTitle: {}'.format(entry['artist'], entry['title'])
(swr3)macbook:swr3 joeyoung$ python swr3.py
Artist: Vaya Con Dios Title: Nah neh nah
Artist: Genesis Title: No son of mine
Artist: Genesis Title: No son of mine
Artist: Double You Title: Please don't go
Artist: Stereo MC's Title: Step it up
Artist: Cranberries Title: Zombie
Artist: La Bouche Title: Sweet dreams
Artist: Die Prinzen Title: Du mußt ein Schwein sein
Artist: Bad Religion Title: Punk rock song
Artist: Bellini Title: Samba de Janeiro
Artist: Dion, Celine; Bee Gees Title: Immortality
Artist: Jones, Tom; Mousse T. Title: Sex bomb
Artist: Yanai, Kate Title: Bacardi feeling (Summer dreamin')
Artist: Heroes Del Silencio Title: Entre dos tierras
你有密码吗?你试过什么?试着用这个指南:但是只取一次艺术家的名字。。