Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/xpath/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
使用Python从网站中提取Web元素_Python_Xpath_Lxml_Urllib2_Xml.etree - Fatal编程技术网

使用Python从网站中提取Web元素

使用Python从网站中提取Web元素,python,xpath,lxml,urllib2,xml.etree,Python,Xpath,Lxml,Urllib2,Xml.etree,我想从这个网站的表格和段落文本中提取各种元素 这是我正在使用的代码: import lxml from lxml import html from lxml import etree import urllib2 source = urllib2.urlopen('https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30656&SSO=1').read

我想从这个网站的表格和段落文本中提取各种元素

这是我正在使用的代码:

import lxml
from lxml import html
from lxml import etree
import urllib2
source = urllib2.urlopen('https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30656&SSO=1').read()
x = etree.HTML(source)
growth = x.xpath("//*[@id="home_feature_container"]/div/div[2]/div/table[2]/tbody/tr[3]/td[2]/p)")
growth
在不必每次更改代码中的XPath的情况下,从网站中提取所需元素的最佳方法是什么?他们每个月都在同一个网站上发布新数据,但XPath有时似乎有点变化

救援:

from bs4 import BeautifulSoup
import urllib2

r = urllib2.urlopen('https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30655')
soup = BeautifulSoup(r)
soup.find('div', {'id': 'home_feature_container'}, 'h4')
这段代码即将实现所描述的规范。如果使用
soup.find().contents
,它将创建元素中包含的每个项目的列表


至于页面上的更改,这确实要视情况而定。如果更改非常剧烈,则必须更改
soup.find()
。否则,您可能能够编写足够通用的代码,使其始终适用。(例如,如果始终使用名为home\u feature\u容器的
div
,则您永远不必更改该容器。)

如果要定期更改项目的位置,请尝试按名称检索它们。例如,下面是如何从“neworders”行中的表中提取元素

或者,如果需要整个html表:

data = tree.xpath('//th[text()="MANUFACTURING AT A GLANCE"]/../..')

for elements in data:
    print(etree.tostring(elements, pretty_print=True))
另一个使用BeautifulSoup的示例

from bs4  import BeautifulSoup
import requests

url = "https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30655&SSO=1"

content = requests.get(url).content

soup = BeautifulSoup(content, "lxml")

table = soup.find_all('table')[1]

table_body = table.find('tbody')

data= []
rows = table_body.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele])

print(data)

你想要什么元素?您的XPath无效,无法在此页面上测试。我已更改了XPath。我需要“制造一览表”中的元素。还有段落文本。嗨,你能给我一个返回值的代码示例吗。有一张“制造一览表”。你能用你的技术展示一些被提取和显示的元素吗。非常感谢!!嘿,埃托雷,有个小问题。我在这里描述过:谢谢!!
from bs4  import BeautifulSoup
import requests

url = "https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30655&SSO=1"

content = requests.get(url).content

soup = BeautifulSoup(content, "lxml")

table = soup.find_all('table')[1]

table_body = table.find('tbody')

data= []
rows = table_body.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele])

print(data)