Python 如何从pandas中解析的html页面中提取文本
考虑这个简单的例子Python 如何从pandas中解析的html页面中提取文本,python,html,pandas,beautifulsoup,Python,Html,Pandas,Beautifulsoup,考虑这个简单的例子 df = pd.DataFrame({'link' : ['https://en.wikipedia.org/wiki/World%27s_funniest_joke', 'https://en.wikipedia.org/wiki/The_Funniest_Joke_in_the_World']}) df Out[169]:
df = pd.DataFrame({'link' : ['https://en.wikipedia.org/wiki/World%27s_funniest_joke',
'https://en.wikipedia.org/wiki/The_Funniest_Joke_in_the_World']})
df
Out[169]:
link
0 https://en.wikipedia.org/wiki/World%27s_funniest_joke
1 https://en.wikipedia.org/wiki/The_Funniest_Joke_in_the_World
我想使用beautiful soup解析每个链接,并将解析后的内容存储到数据帧的另一列中。以下几点似乎效果不错:
def puller(mylink):
doc = requests.get(mylink)
return BeautifulSoup(doc.content, 'html5lib')
df['parsed'] = df.apply(lambda x: puller(x))
df['mytag'] = df.parsed.apply(lambda x: x.find_all('p'))
问题是,我正在获取列表,我需要处理其中的文本。特别是,我试图在文本中的某个地方保留提及笑话的段落,但我无法做到这一点
def extractor(mylist):
return list(filter(lambda x: re.search('joke', x), mylist))
df.mytag.apply(lambda x: extractor(x))
TypeError: expected string or bytes-like object
在这里最好的方法是什么
谢谢 df[mytag]的每个条目都是一个beautifulsoup''元素的列表。您可以编写一个函数,获取此列表并返回包含您的单词的文本。然后使用.apply over df[mytag]使其在所有行上工作
def myfunc(list_of_ps, word='joke'):
'''
This will return a list of string text paragraphs
containing the word.
'''
result_ps = []
for p in list of ps:
if word in p.text:
result_ps.append(p.text) # p if p itself is needed
return result_ps if result_ps else None
df['mytag'].apply(myfunc)
编辑:
你问题中的错误反映了上面斜体字提到的事实。re.search需要字符串作为参数。换句话说,该函数调用中的x必须是字符串或类似字节的对象。在本例中,它是作为单个元素的BeautifulSoup对象。该错误可以通过将元素的字符串文本获取为x.text来解决。df[mytag]的每个条目都是一个beautifulsoup''元素的列表。您可以编写一个函数,获取此列表并返回包含您的单词的文本。然后使用.apply over df[mytag]使其在所有行上工作
def myfunc(list_of_ps, word='joke'):
'''
This will return a list of string text paragraphs
containing the word.
'''
result_ps = []
for p in list of ps:
if word in p.text:
result_ps.append(p.text) # p if p itself is needed
return result_ps if result_ps else None
df['mytag'].apply(myfunc)
编辑:
你问题中的错误反映了上面斜体字提到的事实。re.search需要字符串作为参数。换句话说,该函数调用中的x必须是字符串或类似字节的对象。在本例中,它是作为单个元素的BeautifulSoup对象。该错误可以通过将元素的字符串文本获取为x.text来解决。熊猫api设计用于更原始的数据类型;最好编写一个函数来转换所需的链接->文本,然后调用apply。这里有一个解决方案:
import pandas as pd
from bs4 import BeautifulSoup
df = pd.DataFrame({'link' : [
'https://en.wikipedia.org/wiki/World%27s_funniest_joke',
'https://en.wikipedia.org/wiki/The_Funniest_Joke_in_the_World'
]
})
def parse_link(mylink):
doc = requests.get(mylink)
return BeautifulSoup(doc.content, 'html5lib')
def matching_paragraphs(soup, text):
res = [p.get_text() for p in soup.find_all("p") if text in p.get_text()]
return res
def apply_func(link, text):
soup = parse_link(link)
res = matching_paragraphs(soup, text=text)
return res
df['text'] = df.link.apply(apply_func, args=("joke",))
输出:
link text
0 https://en.wikipedia.org/wiki/World%27s_funnie... [The "world's funniest joke" is a term used by...
1 https://en.wikipedia.org/wiki/The_Funniest_Jok... ["The Funniest Joke in the World" (also "Joke ...
使用dataframe,您可以更合理地将字符串列表转换为行:
df.explode(column="text", ignore_index=True)
结果:
link text
0 https://en.wikipedia.org/wiki/World%27s_funnie... The "world's funniest joke" is a term used by ...
1 https://en.wikipedia.org/wiki/World%27s_funnie... The winning joke, which was later found to be ...
2 https://en.wikipedia.org/wiki/World%27s_funnie... Researchers also included five computer-genera...
3 https://en.wikipedia.org/wiki/The_Funniest_Jok... "The Funniest Joke in the World" (also "Joke W...
4 https://en.wikipedia.org/wiki/The_Funniest_Jok... The sketch appeared in the first episode of th...
5 https://en.wikipedia.org/wiki/The_Funniest_Jok... The sketch is framed in a documentary style an...
6 https://en.wikipedia.org/wiki/The_Funniest_Jok... The British Army are soon eager to determine "...
7 https://en.wikipedia.org/wiki/The_Funniest_Jok... The German version is described as being "over...
8 https://en.wikipedia.org/wiki/The_Funniest_Jok... The Germans attempt counter-jokes, but each at...
9 https://en.wikipedia.org/wiki/The_Funniest_Jok... The British joke is said to have been laid to ...
10 https://en.wikipedia.org/wiki/The_Funniest_Jok... The footage of Adolf Hitler is taken from Leni...
11 https://en.wikipedia.org/wiki/The_Funniest_Jok... If the German version of the joke is entered i...
熊猫api设计用于更原始的数据类型;最好编写一个函数来转换所需的链接->文本,然后调用apply。这里有一个解决方案:
import pandas as pd
from bs4 import BeautifulSoup
df = pd.DataFrame({'link' : [
'https://en.wikipedia.org/wiki/World%27s_funniest_joke',
'https://en.wikipedia.org/wiki/The_Funniest_Joke_in_the_World'
]
})
def parse_link(mylink):
doc = requests.get(mylink)
return BeautifulSoup(doc.content, 'html5lib')
def matching_paragraphs(soup, text):
res = [p.get_text() for p in soup.find_all("p") if text in p.get_text()]
return res
def apply_func(link, text):
soup = parse_link(link)
res = matching_paragraphs(soup, text=text)
return res
df['text'] = df.link.apply(apply_func, args=("joke",))
输出:
link text
0 https://en.wikipedia.org/wiki/World%27s_funnie... [The "world's funniest joke" is a term used by...
1 https://en.wikipedia.org/wiki/The_Funniest_Jok... ["The Funniest Joke in the World" (also "Joke ...
使用dataframe,您可以更合理地将字符串列表转换为行:
df.explode(column="text", ignore_index=True)
结果:
link text
0 https://en.wikipedia.org/wiki/World%27s_funnie... The "world's funniest joke" is a term used by ...
1 https://en.wikipedia.org/wiki/World%27s_funnie... The winning joke, which was later found to be ...
2 https://en.wikipedia.org/wiki/World%27s_funnie... Researchers also included five computer-genera...
3 https://en.wikipedia.org/wiki/The_Funniest_Jok... "The Funniest Joke in the World" (also "Joke W...
4 https://en.wikipedia.org/wiki/The_Funniest_Jok... The sketch appeared in the first episode of th...
5 https://en.wikipedia.org/wiki/The_Funniest_Jok... The sketch is framed in a documentary style an...
6 https://en.wikipedia.org/wiki/The_Funniest_Jok... The British Army are soon eager to determine "...
7 https://en.wikipedia.org/wiki/The_Funniest_Jok... The German version is described as being "over...
8 https://en.wikipedia.org/wiki/The_Funniest_Jok... The Germans attempt counter-jokes, but each at...
9 https://en.wikipedia.org/wiki/The_Funniest_Jok... The British joke is said to have been laid to ...
10 https://en.wikipedia.org/wiki/The_Funniest_Jok... The footage of Adolf Hitler is taken from Leni...
11 https://en.wikipedia.org/wiki/The_Funniest_Jok... If the German version of the joke is entered i...
这可能不是explode的正确用例。此外,您还应该通过示例说明df['mytag']的性质。添加了一些更多信息。Thank简化并澄清了可能不是explode正确用例的问题。此外,您还应该通过示例说明df['mytag']的性质。添加了一些更多信息。谢谢简化并澄清了这个问题