Python 为什么我先得到作者、标题、摘要和期刊，然后一起出现？他们应该在一起，每个标题。_Python_Beautifulsoup

Python 为什么我先得到作者、标题、摘要和期刊，然后一起出现？他们应该在一起，每个标题。

python

Python 为什么我先得到作者、标题、摘要和期刊，然后一起出现？他们应该在一起，每个标题。,python,beautifulsoup,Python,Beautifulsoup,我试图从链接的html文件中提取信息。对于每一篇论文的标题，我需要作者、期刊名称和摘要。但在把它们放在一起之前，我会先把它们重复一遍。请帮忙。这意味着我首先得到一个标题列表，然后是作者，然后是期刊，然后是摘要，然后我把它们按标题放在一起，如标题第一，然后是各自的作者，期刊名称和摘要。我只需要他们在一起，而不是单独 from BeautifulSoup import BeautifulSoup from bs4 import BeautifulSoup import urllib2 import

我试图从链接的html文件中提取信息。对于每一篇论文的标题，我需要作者、期刊名称和摘要。但在把它们放在一起之前，我会先把它们重复一遍。请帮忙。这意味着我首先得到一个标题列表，然后是作者，然后是期刊，然后是摘要，然后我把它们按标题放在一起，如标题第一，然后是各自的作者，期刊名称和摘要。我只需要他们在一起，而不是单独

from BeautifulSoup import BeautifulSoup
from bs4 import BeautifulSoup
import urllib2
import requests
import re

f = open('acmpage.html', 'r') #open html file stores locally
html = f.read() #read from the html file and store the content in 'html'
soup = BeautifulSoup(html)
pret = soup.prettify()
soup1 = BeautifulSoup(pret)
for content in soup1.find_all("table"):
    soup2 = BeautifulSoup(str(content))
    pret2 = soup2.prettify()
    soup3 = BeautifulSoup(pret2)

    for titles in soup3.find_all('a', target = '_self'): #to print title
        print "Title: ", 
        print titles.get_text()
    for auth in soup3.find_all('div', class_ = 'authors'): #to print authors
        print "Authors: ", 
        print auth.get_text()
    for journ in soup3.find_all('div', class_ = 'addinfo'): #to print name of journal
        print "Journal: ", 
        print journ.get_text()
    for abs in soup3.find_all('div', class_ = 'abstract2'): # to print abstract
        print "Abstract: ", 
        print abs.get_text()

您需要找到第一个addinfo div，然后向前爬行以在文档中的更远处的div中找到发布者。您需要在树上找到封闭的tr，然后获取连续tr的下一个同级。然后在该tr内搜索下一个数据项（发布者）

对所有需要显示的项目执行此操作后，对找到的所有项目发出一个打印命令。如果您要单独搜索每个信息列表，那么很难理解为什么会单独列出每种类型的信息

您的代码也充满了冗余；您只需要导入一个版本的BeautifulSoup（第一个导入被第二个导入遮蔽），并且也不需要重新解析元素2次。导入两个不同的URL加载库，然后通过加载本地文件来忽略这两个库

搜索包含标题信息的表行，然后对每个表行解析出包含的信息

对于此页面，由于其更复杂（坦率地说，是无组织的）布局包含多个表，因此最简单的方法是按照找到的每个标题链接转到表行：

from bs4 import BeautifulSoup
import requests

resp = requests.get('http://dl.acm.org/results.cfm', 
                    params={'CFID': '376026650', 'CFTOKEN': '88529867'})
soup = BeautifulSoup(resp.content, from_encoding=resp.encoding)

for title_link in soup.find_all('a', target='_self'):
    # find parent row to base rest of search of
    row = next(p for p in title_link.parents if p.name == 'tr')
    title = title_link.get_text()
    authors = row.find('div', class_='authors').get_text()
    journal = row.find('div', class_='addinfo').get_text()
    abstract = row.find('div', class_='abstract2').get_text()

next（）

调用在生成器表达式上循环，该表达式遍历标题链接的每个父级，直到找到

元素

现在，您已将所有信息按标题分组