Python 使用beautiful soup从HTML中提取特定标题_Python_Html_Parsing_Beautifulsoup_Extract

Python 使用beautiful soup从HTML中提取特定标题

python html parsing

Python 使用beautiful soup从HTML中提取特定标题,python,html,parsing,beautifulsoup,extract,Python,Html,Parsing,Beautifulsoup,Extract,这是我正在使用的专利示例。下面是我使用的代码。我想让代码只显示引用次数（3）的计数，这样我就知道这个专利被引用了多少次。我怎样才能让输出只显示引用次数为3的计数？请帮忙 soup = BeautifulSoup(patent, 'html.parser') cited_section =soup.findAll({"h2":"Cited By"}) print(cited_section) Output I get is [<h2>Inf

这是我正在使用的专利示例。下面是我使用的代码。我想让代码只显示引用次数（3）的计数，这样我就知道这个专利被引用了多少次。我怎样才能让输出只显示引用次数为3的计数？请帮忙

 
soup = BeautifulSoup(patent, 'html.parser')
cited_section =soup.findAll({"h2":"Cited By"})

print(cited_section)
Output I get is [<h2>Info</h2>, <h2>Links</h2>, <h2>Images</h2>, <h2>Classifications</h2>, <h2>Abstract</h2>, <h2>Description</h2>, <h2>Claims (<span itemprop="count">57</span>)</h2>, <h2>Priority Applications (5)</h2>, <h2>Applications Claiming Priority (1)</h2>, <h2>Related Parent Applications (1)</h2>, <h2>Publications (2)</h2>, <h2>ID=38925605</h2>, <h2>Family Applications (1)</h2>, <h2>Country Status (1)</h2>, <h2>Cited By (3)</h2>, <h2>Families Citing this family (12)</h2>, <h2>Citations (306)</h2>, <h2>Patent Citations (348)</h2>, <h2>Non-Patent Citations (23)</h2>, <h2>Cited By (4)</h2>, <h2>Also Published As</h2>, <h2>Similar Documents</h2>, <h2>Legal Events</h2>]````


soup=BeautifulSoup（专利“html.parser”）
QUICED_section=soup.findAll（{“h2”：“QUICED By”}）
印刷品（引用部分）
我得到的输出是[信息、链接、图像、分类、摘要、描述、权利要求书（57）、优先权申请书（5）、要求优先权的申请书（1）、相关父申请书（1）、出版物（2）、ID=38925605、家族申请书（1）、国家状况（1）、被（3）引用、引用该家族的家族（12）、引用（306）、专利引用（348），非专利引用（23），被（4）引用，也作为类似文件、法律事件发布]````

引用的数量是通过JavaScript动态创建的。但您可以使用

itemprop=“forwardReferencesFamily”

来计算元素的数量。例如：

import requests
from bs4 import BeautifulSoup


url = 'https://patents.google.com/patent/EP1208209A1/en?oq=medicinal+chemistry'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

print(len(soup.select('tr[itemprop="forwardReferencesFamily"]')))

印刷品：

您好，在这个链接中，我希望代码打印的专利引用，应该给出版编号，标题。然后我想使用pandas将出版物编号放在一列中，将标题放在另一列中。到目前为止，我已经使用BeautifulSoup将HTML文件转换为可读格式。我选择了反向引用HTML标记，并希望它在该标记下打印引用的出版物编号和标题。我只举了一个例子，但我有一个文件夹，里面装满了HTML文件，我稍后会做

x=soup.select('tr[itemprop="backwardReferences"]') 
y=soup.select('td[itemprop="title"]') # this line gives all the titles in the document not particularly under the patent citations
print(y)

页面似乎是异步呈现的。我建议您使用

Selenium

。您好，在html文件中的相同主题下，我只想查找html标记中引用的专利号和标题。我试过了，但它会打印HTML文件中的所有标题

html_file=open（filename，'r'，encoding='utf-8'）#在读取模式下打开文件patent=html_file.read（）#print（patent）total=0 soup=BeautifulSoup（patent，'html.parser'）x=soup.select（'tr[itemprop=“backardreferences”]）y=soup.select（'td[itemprop=“title”]）print（y）

@宇航员我建议在这里提出一个新问题，你在哪里描述这个问题+你尝试过什么。我会试着去看看。这是新的讨论。我尝试了使用HTML标记“tr[itemprop=“BackardReferencesFamily”]”的方法，但我无法让它在此标记下仅打印标题和出版物编号。我认为这是所有专利中唯一常见的HTML标记。其他任何东西都可能不是一致的模式。