使用Python从HTML文件中提取文本(音乐艺术家/标题)

使用Python从HTML文件中提取文本(音乐艺术家/标题),python,beautifulsoup,Python,Beautifulsoup,我想从一个页面上提取艺术家和歌曲的标题 页面: 不是我的儿子 起源 这在页面上重复了几次(请参见顶部链接swr3.de),但我不知道如何使用beautifulsoup和python创建列表,如: 创世纪-没有我的儿子 加倍你-请不要走 使用、和的组合: 首先,安装先决条件: pip install beautifulsoup4 pip install requests pip install lxml import requests, lxml from bs4 import Beaut

我想从一个页面上提取艺术家和歌曲的标题

页面:


不是我的儿子
起源
这在页面上重复了几次(请参见顶部链接swr3.de),但我不知道如何使用beautifulsoup和python创建列表,如:

创世纪-没有我的儿子
加倍你-请不要走

使用、和的组合:

首先,安装先决条件:

pip install beautifulsoup4
pip install requests
pip install lxml
import requests, lxml
from bs4 import BeautifulSoup

parsedsongs = []
result = requests.get('http://www.swr3.de//-/id=47424/cf=42/did=65794/93avs/index.html?hour=5&date=2015-10-23')
soup = BeautifulSoup(result.content, "lxml")
detailbodys = soup.find_all('div', 'detail-body')
for detailbody in detailbodys:
    title = detailbody.h4.string.encode('utf-8').strip()
    if detailbody.h5:
        artist = detailbody.h5.string.encode('utf-8').strip()
    else:
        artist = detailbody.span.string.encode('utf-8').strip()
    parsedsongs.append({'artist': artist, 'title': title})

for entry in parsedsongs:
    print 'Artist: {}\tTitle: {}'.format(entry['artist'], entry['title'])
(swr3)macbook:swr3 joeyoung$ python swr3.py
Artist: Vaya Con Dios   Title: Nah neh nah
Artist: Genesis Title: No son of mine
Artist: Genesis Title: No son of mine
Artist: Double You  Title: Please don't go
Artist: Stereo MC's Title: Step it up
Artist: Cranberries Title: Zombie
Artist: La Bouche   Title: Sweet dreams
Artist: Die Prinzen Title: Du mußt ein Schwein sein
Artist: Bad Religion    Title: Punk rock song
Artist: Bellini Title: Samba de Janeiro
Artist: Dion, Celine; Bee Gees  Title: Immortality
Artist: Jones, Tom; Mousse T.   Title: Sex bomb
Artist: Yanai, Kate Title: Bacardi feeling (Summer dreamin')
Artist: Heroes Del Silencio Title: Entre dos tierras
swr3.py:

pip install beautifulsoup4
pip install requests
pip install lxml
import requests, lxml
from bs4 import BeautifulSoup

parsedsongs = []
result = requests.get('http://www.swr3.de//-/id=47424/cf=42/did=65794/93avs/index.html?hour=5&date=2015-10-23')
soup = BeautifulSoup(result.content, "lxml")
detailbodys = soup.find_all('div', 'detail-body')
for detailbody in detailbodys:
    title = detailbody.h4.string.encode('utf-8').strip()
    if detailbody.h5:
        artist = detailbody.h5.string.encode('utf-8').strip()
    else:
        artist = detailbody.span.string.encode('utf-8').strip()
    parsedsongs.append({'artist': artist, 'title': title})

for entry in parsedsongs:
    print 'Artist: {}\tTitle: {}'.format(entry['artist'], entry['title'])
(swr3)macbook:swr3 joeyoung$ python swr3.py
Artist: Vaya Con Dios   Title: Nah neh nah
Artist: Genesis Title: No son of mine
Artist: Genesis Title: No son of mine
Artist: Double You  Title: Please don't go
Artist: Stereo MC's Title: Step it up
Artist: Cranberries Title: Zombie
Artist: La Bouche   Title: Sweet dreams
Artist: Die Prinzen Title: Du mußt ein Schwein sein
Artist: Bad Religion    Title: Punk rock song
Artist: Bellini Title: Samba de Janeiro
Artist: Dion, Celine; Bee Gees  Title: Immortality
Artist: Jones, Tom; Mousse T.   Title: Sex bomb
Artist: Yanai, Kate Title: Bacardi feeling (Summer dreamin')
Artist: Heroes Del Silencio Title: Entre dos tierras
输出:

pip install beautifulsoup4
pip install requests
pip install lxml
import requests, lxml
from bs4 import BeautifulSoup

parsedsongs = []
result = requests.get('http://www.swr3.de//-/id=47424/cf=42/did=65794/93avs/index.html?hour=5&date=2015-10-23')
soup = BeautifulSoup(result.content, "lxml")
detailbodys = soup.find_all('div', 'detail-body')
for detailbody in detailbodys:
    title = detailbody.h4.string.encode('utf-8').strip()
    if detailbody.h5:
        artist = detailbody.h5.string.encode('utf-8').strip()
    else:
        artist = detailbody.span.string.encode('utf-8').strip()
    parsedsongs.append({'artist': artist, 'title': title})

for entry in parsedsongs:
    print 'Artist: {}\tTitle: {}'.format(entry['artist'], entry['title'])
(swr3)macbook:swr3 joeyoung$ python swr3.py
Artist: Vaya Con Dios   Title: Nah neh nah
Artist: Genesis Title: No son of mine
Artist: Genesis Title: No son of mine
Artist: Double You  Title: Please don't go
Artist: Stereo MC's Title: Step it up
Artist: Cranberries Title: Zombie
Artist: La Bouche   Title: Sweet dreams
Artist: Die Prinzen Title: Du mußt ein Schwein sein
Artist: Bad Religion    Title: Punk rock song
Artist: Bellini Title: Samba de Janeiro
Artist: Dion, Celine; Bee Gees  Title: Immortality
Artist: Jones, Tom; Mousse T.   Title: Sex bomb
Artist: Yanai, Kate Title: Bacardi feeling (Summer dreamin')
Artist: Heroes Del Silencio Title: Entre dos tierras

你有密码吗?你试过什么?试着用这个指南:但是只取一次艺术家的名字。。