Python，从网页中提取文本_Python_Html_Parsing_Web Scraping_Web Crawler

Python，从网页中提取文本

python html parsing web-scraping web-crawler

Python，从网页中提取文本,python,html,parsing,web-scraping,web-crawler,Python,Html,Parsing,Web Scraping,Web Crawler,我正在做一个项目，在这个项目中，我对数千个网站进行爬网以提取文本数据，最终用例是自然语言处理编辑*由于我正在抓取成千上万的网站，我无法为每一个网站定制一个抓取代码，这意味着我无法搜索特定的元素id，我正在寻找的解决方案是通用的* 我知道一些解决方案，如beautiful soup中的.get_text（）函数。这种方法的问题在于，它从网站获取所有文本，其中大部分与特定页面上的主要主题无关。在大多数情况下，一个网站页面将专注于一个单一的主题，但在侧面和顶部和底部可能有关于其他主题、促销或其他内容

我正在做一个项目，在这个项目中，我对数千个网站进行爬网以提取文本数据，最终用例是自然语言处理

编辑*由于我正在抓取成千上万的网站，我无法为每一个网站定制一个抓取代码，这意味着我无法搜索特定的元素id，我正在寻找的解决方案是通用的*

我知道一些解决方案，如beautiful soup中的.get_text（）函数。这种方法的问题在于，它从网站获取所有文本，其中大部分与特定页面上的主要主题无关。在大多数情况下，一个网站页面将专注于一个单一的主题，但在侧面和顶部和底部可能有关于其他主题、促销或其他内容的链接或文本

通过.get_text（）函数，它可以一次性返回站点页面上的所有文本。问题是，它将所有内容（相关部分与不相关部分）结合在一起。是否还有另一个类似于.get_text（）的函数，它返回所有文本，但作为列表，并且每个列表对象都是文本的特定部分，这样就可以知道新主题的开始和结束位置

另外，有没有一种方法可以识别网页上文本的主体？

下面我提到了一些片段，您可以使用BeautifulSoup4和Python3以所需的方式查询数据：

import requests
from bs4 import BeautifulSoup

response = requests.get('https://yoursite/page')
soup = BeautifulSoup(response.text, 'html.parser')
# Print the body content in list form
print(soup.body.contents[0])
# Print the first found div on html page
print(soup.find('div'))
# Print the all divs on html page in list form
print(soup.find_all('div'))
# Print the element with 'required_element_id' id
print(soup.find(id='required_element_id'))
# Print the all html elements in list form that matches the selectors
print(soup.select(required_css_selectors))
# Print the attribute value in list form
print(soup.find(id='someid').get("attribute-name"))
# You can also break your one large query into multiple queries
parent = soup.find(id='someid')
# getText() return the text between opening and closing tag
print(parent.select(".some-class")[0].getText())

对于更高级的要求，您也可以查看。如果您在实现此功能时遇到任何挑战，或者您的要求是其他要求，请告诉我。

下面我提到了一些片段，您可以使用BeautifulSoup4和Python3以所需的方式查询数据：

import requests
from bs4 import BeautifulSoup

response = requests.get('https://yoursite/page')
soup = BeautifulSoup(response.text, 'html.parser')
# Print the body content in list form
print(soup.body.contents[0])
# Print the first found div on html page
print(soup.find('div'))
# Print the all divs on html page in list form
print(soup.find_all('div'))
# Print the element with 'required_element_id' id
print(soup.find(id='required_element_id'))
# Print the all html elements in list form that matches the selectors
print(soup.select(required_css_selectors))
# Print the attribute value in list form
print(soup.find(id='someid').get("attribute-name"))
# You can also break your one large query into multiple queries
parent = soup.find(id='someid')
# getText() return the text between opening and closing tag
print(parent.select(".some-class")[0].getText())

对于您更高级的要求，您也可以查看。如果您在实现此要求时遇到任何挑战，或者您的要求是其他要求，请告诉我。

也许您可以尝试使用正则表达式来获取所需的链接。@MustardTiger，您是否尝试过使用

find\u all

来按标记和属性搜索元素，然后调用e> text也许你可以尝试使用regex来获取你需要的链接。@MustardTiger，你是否尝试过使用

find_all

，它允许按标记和属性搜索元素，然后调用

text

嗨，我对问题进行了编辑以使事情更清楚，我对问题进行了编辑以使事情更清楚