Python 如何从pandas中解析的html页面中提取文本_Python_Html_Pandas_Beautifulsoup

Python 如何从pandas中解析的html页面中提取文本

python html pandas

Python 如何从pandas中解析的html页面中提取文本,python,html,pandas,beautifulsoup,Python,Html,Pandas,Beautifulsoup,考虑这个简单的例子 df = pd.DataFrame({'link' : ['https://en.wikipedia.org/wiki/World%27s_funniest_joke', 'https://en.wikipedia.org/wiki/The_Funniest_Joke_in_the_World']}) df Out[169]:

考虑这个简单的例子

df = pd.DataFrame({'link' : ['https://en.wikipedia.org/wiki/World%27s_funniest_joke',
                             'https://en.wikipedia.org/wiki/The_Funniest_Joke_in_the_World']})

df
Out[169]: 
                                                           link
0         https://en.wikipedia.org/wiki/World%27s_funniest_joke
1  https://en.wikipedia.org/wiki/The_Funniest_Joke_in_the_World

我想使用beautiful soup解析每个链接，并将解析后的内容存储到数据帧的另一列中。以下几点似乎效果不错：

def puller(mylink):
    doc = requests.get(mylink)
    return BeautifulSoup(doc.content, 'html5lib')

df['parsed'] = df.apply(lambda x: puller(x))
df['mytag'] = df.parsed.apply(lambda x: x.find_all('p'))

问题是，我正在获取列表，我需要处理其中的文本。特别是，我试图在文本中的某个地方保留提及笑话的段落，但我无法做到这一点

def extractor(mylist):
    return list(filter(lambda x: re.search('joke', x), mylist))

df.mytag.apply(lambda x: extractor(x))
TypeError: expected string or bytes-like object

在这里最好的方法是什么

谢谢

df[mytag]的每个条目都是一个beautifulsoup''元素的列表。您可以编写一个函数，获取此列表并返回包含您的单词的文本。然后使用.apply over df[mytag]使其在所有行上工作

def myfunc(list_of_ps, word='joke'):
    '''
    This will return a list of string text paragraphs 
    containing the word.
    '''
    result_ps = []
    for p in list of ps:
        if word in p.text:
            result_ps.append(p.text) # p if p itself is needed

    return result_ps if result_ps else None

df['mytag'].apply(myfunc)

编辑：你问题中的错误反映了上面斜体字提到的事实。re.search需要字符串作为参数。换句话说，该函数调用中的x必须是字符串或类似字节的对象。在本例中，它是作为单个元素的BeautifulSoup对象。该错误可以通过将元素的字符串文本获取为x.text来解决。

def myfunc(list_of_ps, word='joke'):
    '''
    This will return a list of string text paragraphs 
    containing the word.
    '''
    result_ps = []
    for p in list of ps:
        if word in p.text:
            result_ps.append(p.text) # p if p itself is needed

    return result_ps if result_ps else None

df['mytag'].apply(myfunc)

编辑：

你问题中的错误反映了上面斜体字提到的事实。re.search需要字符串作为参数。换句话说，该函数调用中的x必须是字符串或类似字节的对象。在本例中，它是作为单个元素的BeautifulSoup对象。该错误可以通过将元素的字符串文本获取为x.text来解决。

熊猫api设计用于更原始的数据类型；最好编写一个函数来转换所需的链接->文本，然后调用apply。这里有一个解决方案：

import pandas as pd
from bs4 import BeautifulSoup

df = pd.DataFrame({'link' : [
        'https://en.wikipedia.org/wiki/World%27s_funniest_joke',
        'https://en.wikipedia.org/wiki/The_Funniest_Joke_in_the_World'
    ]
})

def parse_link(mylink):
    doc = requests.get(mylink)
    return BeautifulSoup(doc.content, 'html5lib')

def matching_paragraphs(soup, text):
    res = [p.get_text() for p in soup.find_all("p") if text in p.get_text()]
    return res
   
def apply_func(link, text):
    soup = parse_link(link)
    res = matching_paragraphs(soup, text=text)
    return res
    

df['text'] = df.link.apply(apply_func, args=("joke",))

输出：

                                                link                                               text
0  https://en.wikipedia.org/wiki/World%27s_funnie...  [The "world's funniest joke" is a term used by...
1  https://en.wikipedia.org/wiki/The_Funniest_Jok...  ["The Funniest Joke in the World" (also "Joke ...

使用dataframe，您可以更合理地将字符串列表转换为行：

df.explode(column="text", ignore_index=True)

结果:

                                                 link                                               text
0   https://en.wikipedia.org/wiki/World%27s_funnie...  The "world's funniest joke" is a term used by ...
1   https://en.wikipedia.org/wiki/World%27s_funnie...  The winning joke, which was later found to be ...
2   https://en.wikipedia.org/wiki/World%27s_funnie...  Researchers also included five computer-genera...
3   https://en.wikipedia.org/wiki/The_Funniest_Jok...  "The Funniest Joke in the World" (also "Joke W...
4   https://en.wikipedia.org/wiki/The_Funniest_Jok...  The sketch appeared in the first episode of th...
5   https://en.wikipedia.org/wiki/The_Funniest_Jok...  The sketch is framed in a documentary style an...
6   https://en.wikipedia.org/wiki/The_Funniest_Jok...  The British Army are soon eager to determine "...
7   https://en.wikipedia.org/wiki/The_Funniest_Jok...  The German version is described as being "over...
8   https://en.wikipedia.org/wiki/The_Funniest_Jok...  The Germans attempt counter-jokes, but each at...
9   https://en.wikipedia.org/wiki/The_Funniest_Jok...  The British joke is said to have been laid to ...
10  https://en.wikipedia.org/wiki/The_Funniest_Jok...  The footage of Adolf Hitler is taken from Leni...
11  https://en.wikipedia.org/wiki/The_Funniest_Jok...  If the German version of the joke is entered i...

熊猫api设计用于更原始的数据类型；最好编写一个函数来转换所需的链接->文本，然后调用apply。这里有一个解决方案：

import pandas as pd
from bs4 import BeautifulSoup

df = pd.DataFrame({'link' : [
        'https://en.wikipedia.org/wiki/World%27s_funniest_joke',
        'https://en.wikipedia.org/wiki/The_Funniest_Joke_in_the_World'
    ]
})

def parse_link(mylink):
    doc = requests.get(mylink)
    return BeautifulSoup(doc.content, 'html5lib')

def matching_paragraphs(soup, text):
    res = [p.get_text() for p in soup.find_all("p") if text in p.get_text()]
    return res
   
def apply_func(link, text):
    soup = parse_link(link)
    res = matching_paragraphs(soup, text=text)
    return res
    

df['text'] = df.link.apply(apply_func, args=("joke",))

输出：

                                                link                                               text
0  https://en.wikipedia.org/wiki/World%27s_funnie...  [The "world's funniest joke" is a term used by...
1  https://en.wikipedia.org/wiki/The_Funniest_Jok...  ["The Funniest Joke in the World" (also "Joke ...

使用dataframe，您可以更合理地将字符串列表转换为行：

df.explode(column="text", ignore_index=True)

结果:

                                                 link                                               text
0   https://en.wikipedia.org/wiki/World%27s_funnie...  The "world's funniest joke" is a term used by ...
1   https://en.wikipedia.org/wiki/World%27s_funnie...  The winning joke, which was later found to be ...
2   https://en.wikipedia.org/wiki/World%27s_funnie...  Researchers also included five computer-genera...
3   https://en.wikipedia.org/wiki/The_Funniest_Jok...  "The Funniest Joke in the World" (also "Joke W...
4   https://en.wikipedia.org/wiki/The_Funniest_Jok...  The sketch appeared in the first episode of th...
5   https://en.wikipedia.org/wiki/The_Funniest_Jok...  The sketch is framed in a documentary style an...
6   https://en.wikipedia.org/wiki/The_Funniest_Jok...  The British Army are soon eager to determine "...
7   https://en.wikipedia.org/wiki/The_Funniest_Jok...  The German version is described as being "over...
8   https://en.wikipedia.org/wiki/The_Funniest_Jok...  The Germans attempt counter-jokes, but each at...
9   https://en.wikipedia.org/wiki/The_Funniest_Jok...  The British joke is said to have been laid to ...
10  https://en.wikipedia.org/wiki/The_Funniest_Jok...  The footage of Adolf Hitler is taken from Leni...
11  https://en.wikipedia.org/wiki/The_Funniest_Jok...  If the German version of the joke is entered i...

这可能不是explode的正确用例。此外，您还应该通过示例说明df['mytag']的性质。添加了一些更多信息。Thank简化并澄清了可能不是explode正确用例的问题。此外，您还应该通过示例说明df['mytag']的性质。添加了一些更多信息。谢谢简化并澄清了这个问题