Python 使用beautifulsoup完成网页清理

Python 使用beautifulsoup完成网页清理,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,需要一些使用beautifulsoup库进行Web垃圾处理的帮助 我需要从网页中提取文本 我的目标是提取网页中的文本,就像我提取网页中的所有“p”标记及其文本一样,但在“p”标记中有“a”标记,其中也有一些文本 所以我的问题是: 如何将Unicode(“”)转换为普通字符串作为网页中的文本?因为当我只提取“p”标记时,beautifulsoup库会将文本转换为Unicode,甚至特殊字符也是Unicode,所以我想将提取的Unicode文本转换为普通文本。我该怎么做 如何提取包含“a”标记的“

需要一些使用beautifulsoup库进行Web垃圾处理的帮助

我需要从网页中提取文本

我的目标是提取网页中的文本,就像我提取网页中的所有“p”标记及其文本一样,但在“p”标记中有“a”标记,其中也有一些文本

所以我的问题是:

  • 如何将Unicode(“”)转换为普通字符串作为网页中的文本?因为当我只提取“p”标记时,beautifulsoup库会将文本转换为Unicode,甚至特殊字符也是Unicode,所以我想将提取的Unicode文本转换为普通文本。我该怎么做

  • 如何提取包含“a”标记的“p”标记内的文本。我的意思是我想提取“p”标记内的完整文本,包括嵌套标记内的文本

  • 我已尝试使用以下代码:

    html = requests.get("http://thehill.com/…/365407-sean-diddy-combs-wants-to-buy-c…").content
    news_soup = BeautifulSoup(html, "html.parser")
    a_text = news_soup.find_all('p')
    
    y = a_text[1].find_all('a').string
    

    您可以使用嵌套列表理解来查找带有段落标记的所有链接,并使用
    编码(“ascii”,“忽略”)
    来解码unicode:

    import urllib
    from bs4 import BeautifulSoup as soup
    s = soup(str(urllib.urlopen('http://thehill.com/blogs/blog-briefing-room/365407-sean-diddy-combs-wants-to-buy-carolina-panthers-and-sign-kaepernick').read()), 'lxml')
    all_text = [i.text.encode("ascii", 'ignore') for i in s.find_all('p')]
    all_paragraphs = filter(None, [[b.text.encode("ascii", 'ignore') for b in i.find_all('a')] for i in s.find_all('p')])
    print(all_text)
    print(all_paragraphs)
    
    输出:

    ['Hip hop mogul Sean Diddy Combs said Sunday night hes interested in buying the Carolina Panthers and signing quarterback Colin Kaepernick, who has been unemployed this season after kneeling during the national anthem in 2016.', 'Panthers owner Jerry Richardson announced Sunday he would be selling the team after the 2017 season, just hours after Sports Illustrated published accusations of sexual misconduct from former employees. Richardson also allegedly used a racial slur about a team scout.', 'Diddy took to Twitter soon after the Panthers announced the upcoming sale, declaring his desire to own a team and increase diversity among NFL ownership.', 'I would like to buy the @Panthers. Spread the word. Retweet!', 'There are no majority African American NFL owners. Lets make history.', '', 'Kaepernick respondedSundaymorning, saying I want in on the ownership group!', 'I want in on the ownership group! Lets make it happen!, 'Other athletes, including NBA starStephen Curryandformer NFL playerGreg Jennings,responded to Combs saying they were interested in part-owning the team.', "Former league MVP Cam Newton is the team's current quarterback.", 'Kaepernick has been a free agent since the end of the 2016 season, when he made headlinesfor kneeling during the national anthem before games to protest issues of racial inequality.', 'President TrumpDonald John TrumpHouse Democrat slams Donald Trump Jr. for serious case of amnesia after testimony Skier Lindsey Vonn: I dont want to represent Trump at Olympics Poll: 4 in 10 Republicans think senior Trump advisers had improper dealings with Russia MORE hascriticized Kaepernick directly, saying the NFL should have suspended him for the demonstration. He has since taken aim at other players who have knelt or sat during the anthem during the 2017 season.', '- This story was updated at 11:03 A.M. EST.', 'View the discussion thread.', 'The Hill 1625 K Street, NW Suite 900 Washington DC 20006 | 202-628-8500 tel | 202-628-8503 fax', 'The contents of this site are 2017 Capitol Hill Publishing Corp., a subsidiary of News Communications, Inc.']
    [['Sports Illustrated'], ['@Panthers'], ['Stephen Curry', 'former NFL player'], ['President Trump', 'Donald John Trump', 'House Democrat slams Donald Trump Jr. for serious case of amnesia after testimony', 'Skier Lindsey Vonn: I dont want to represent Trump at Olympics', 'Poll: 4 in 10 Republicans think senior Trump advisers had improper dealings with Russia', 'MORE', 'criticized Kaepernick directly', 'knelt or sat'], ['View the discussion thread.']]