仅从python中的html文档获取主文本

仅从python中的html文档获取主文本,python,html,parsing,beautifulsoup,Python,Html,Parsing,Beautifulsoup,我有(其中一个),我希望解析它并获得主要文本。我能够使用以下代码成功地解析它 url = "https://seekingalpha.com/article/4253393-boeing-bear-wakens" import requests url = requests.get(url) html = url.text soup = BeautifulSoup(html, "html.parser") for script in soup(["script", "style"]):

我有(其中一个),我希望解析它并获得主要文本。我能够使用以下代码成功地解析它

url = "https://seekingalpha.com/article/4253393-boeing-bear-wakens"

import requests

url = requests.get(url)
html = url.text
soup = BeautifulSoup(html, "html.parser")

for script in soup(["script", "style"]):
    script.extract()         
text = soup.get_text()
text.encode('ascii', 'ignore')

print(text)
我收到的文本是这样的

波音熊觉醒-波音公司(纽约证券交易所代码:BA)|寻找AlphaMarketplace寻找AlphaSUBSCRIBEPortfolioMy PortfoliosAll投资组合+创建PortfolioModel PortfolioPeopleNewsAnalysis登录/立即加入帮助基于知识的反馈以快速挑选和列出|波音熊觉醒工业产品。2019年9月6日上午6:30美国东部时间关于:波音公司(BA)作者:Dhierin BechaiDhierin Bechai航空航天、航空公司、商用飞机市场航空航天总公司波音生产暂时减少。关于减少的持续时间知之甚少,但降低生产率的决定可能是长期停产的迹象。生产率的下降增加了波音股票的下跌。随着波音737 MAX机队停产和停止向客户交货,波音感受到了来自双方的压力。虽然保险公司承担了部分损失

它包含所有的部分,如订阅、关于、时间、加入等

我需要两方面的帮助:

  • 是否有一种只解析主文本而不解析其他元素的通用方法
  • 另外一个元素,我是否可以将其单独返回,例如,如果我想知道文章对社交媒体的影响有多大(比如,评论、在不同平台上共享)
  • 要检查通用性,请试穿


    感谢您一直以来的帮助。

    您可以使用script标记提取json格式,并使用它:

    url = "https://seekingalpha.com/article/4253393-boeing-bear-wakens"
    
    import requests
    from bs4 import BeautifulSoup
    import json
    
    url = requests.get(url)
    html = url.text
    soup = BeautifulSoup(html, "html.parser")
    
    for script in soup(["script"]):
        if 'window.SA = ' in script.text:
    
            jsonStr = script.text.split('window.SA = ')[1]
            jsonStr = jsonStr.rsplit(';',1)[0]
            jsonObj = json.loads(jsonStr)
    
    title = jsonObj['pageConfig']['Data']['article']['title']
    print (title)
    
    这里面有很多信息。要获得这篇文章:

    article = soup.find('div', {'itemprop':'articleBody'})
    ps = article.find_all('p', {'class':'p p1'})
    for para in ps:
        print (para.text)
    
    输出:

    The Boeing Bear Wakens
    
    第条:

    With the Boeing (NYSE:BA) 737 MAX fleet being grounded and deliveries to customers being halted, Boeing is feeling the heat from two sides. While insurers have part of the damages covered, it is unlikely that a multi-month grounding will be fully covered. Initially, it seemed that Boeing was looking for a relatively fast fix to minimize disruptions as it was relatively quick with presenting a fix to stakeholders. Based on that quick roll-out, it seemed that Boeing was looking to have the fleet back in the air within 3 months. However, as the fix got delayed and Boeing and the FAA came under international scrutiny, it seems that timeline has slipped significantly as additional improvements are to be made. Initially, I expected that Boeing would be cleared to send the 737 MAX back to service in June/July, signalling a 3-4-month grounding and expected that Boeing's delivery target for the full year would decline by 40 units.
    
    
    
    Source: Everett Herald
    On the 5th of April, Boeing announced that it would be reducing the production rate for the Boeing 737 temporarily, which is a huge decision:
    As we continue to work through these steps, we're adjusting the 737 production system temporarily to accommodate the pause in MAX deliveries, allowing us to prioritize additional resources to focus on software certification and returning the MAX to flight. We have decided to temporarily move from a production rate of 52 airplanes per month to 42 airplanes per month starting in mid-April.
    
    您还可以获取注释的json表示:

    url = 'https://seekingalpha.com/account/ajax_get_comments?id=4253393&type=Article&commentType=topLiked'
    headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
    
    
    jsonObj_comments = requests.get(url, headers=headers).json()
    
    就一般方法而言,这将是困难的,因为每个网站都有自己的结构、格式、标签的使用和属性名称等。然而,我确实注意到,您提供的两个网站在其文章中都使用了
    标签,因此我想您可以从这些标签中提取文本。然而,使用一般方法,您将得到somewhat's generic output,这意味着您可能有过多的文本或文章中缺少的位

    import requests
    from bs4 import BeautifulSoup
    
    url1 = "https://seekingalpha.com/article/4253393-boeing-bear-wakens"
    url2 = "https://www.dqindia.com/accenture-helps-del-monte-foods-unlock-innovation-drive-business-growth-cloud/"
    
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}
    
    url = requests.get(url1, headers=headers)
    html = url.text
    soup = BeautifulSoup(html, "html.parser")
    
    paragraphs = soup.find_all('p')
    
    for p in paragraphs:
        print (p.text)
    

    使用json的方法很好!您好@chitown88,非常感谢您的回答,但我正在寻找一种通用方法,这就是为什么我在问题中还提到了第二个url。我们可以将其扩展到通用算法吗?很抱歉,我完全没有看到第二个url。我会看一看。不过,我最初的想法是,没有是一种通用算法,可用于多个站点……。与网站中的表格不同,网站中的表格是用HTML清楚地定义的,带有
    标记,其中有行
    和每行中的单元格
    ,书面文章可以有无限多个不同的标记和类名或ID组合。主要是这是因为每个网站的结构不同,可能包含不同的标签名称和属性。如果同一个网站中有多个页面,那么你有可能使用通用解决方案,因为它们的结构很可能相似。但这也取决于情况。我现在将看第二个ul,仔细看两个网站,您可能可以做一些通用的事情,因为它们都使用
    标记(这很有意义)对于文章的主体。我能想到的唯一规定是试图区分所有
    标记,它们与文章相关,与文章无关。我想人们可以找到
    标记,然后得到所有属于兄弟的“”标记。这是我脑海中所能想到的,我我给它打一针,看看它什么时候回来。