Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/341.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/xpath/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 从网站上的表获取数据_Python_Xpath_Web Scraping_Beautifulsoup_Request - Fatal编程技术网

Python 从网站上的表获取数据

Python 从网站上的表获取数据,python,xpath,web-scraping,beautifulsoup,request,Python,Xpath,Web Scraping,Beautifulsoup,Request,我需要从网页上的表格中提取或废弃数据的帮助。我正在用漂亮的汤。无法提取表号为6的表。如有任何帮助,将不胜感激: 需要表-6中的所有行数据。一个网页中有多个表,但我只需要法规遵从性信息的数据,我不知道如何做到这一点 URL是给定的 我的代码如下: link = ["http://ec.europa.eu/environment/ets/ohaDetails.do?returnURL=&languageCode=en&accountID=&registryCode=&

我需要从网页上的表格中提取或废弃数据的帮助。我正在用漂亮的汤。无法提取表号为6的表。如有任何帮助,将不胜感激:

需要表-6中的所有行数据。一个网页中有多个表,但我只需要法规遵从性信息的数据,我不知道如何做到这一点

URL是给定的

我的代码如下:

link = ["http://ec.europa.eu/environment/ets/ohaDetails.do?returnURL=&languageCode=en&accountID=&registryCode=&buttonAction=all&action=&account.registryCode=&accountType=&identifierInReg=&accountHolder=&primaryAuthRep=&installationIdentifier=&installationName=&accountStatus=&permitIdentifier=&complianceStatus=&mainActivityType=-1&searchType=oha&resultList.currentPageNumber=1&nextList=Next%C2%A0%3E&selectedPeriods="]

for pagenum, links in enumerate(link[start:end]):

  print(links)
  r = requests.get(links)

  time.sleep(random.randint(2,5)) 

  soup = BeautifulSoup(r.content,"lxml")

  tree = html.fromstring(str(soup))

  value = []

  data_block = soup.find_all("table", {"class": "bordertb"})

  print (data_block)

  output = []

  for item in data_block:

    table_data = item.find_all("td", {"class": "tabletitle"})[0].table

    value.append([table_data])

    print (value)


  with open("Exhibit_2_EXP_data.tsv", "wb") as outfile:

    outfile = unicodecsv.writer(outfile, delimiter="\t")

   outfile.writerow(["Data_Output"])

   for item in value:

     outfile.writerow(item)

试试这个。下面的脚本应该从该表中获取内容。要使其具体化,您应该从上一个表开始操作(因为它有一个唯一的ID),然后使用适当的方法可以访问所需表的内容。以下是我为实现同样的目标所做的:

import requests
from bs4 import BeautifulSoup

url = "http://ec.europa.eu/environment/ets/ohaDetails.do?returnURL=&languageCode=en&accountID=&registryCode=&buttonAction=all&action=&account.registryCode=&accountType=&identifierInReg=&accountHolder=&primaryAuthRep=&installationIdentifier=&installationName=&accountStatus=&permitIdentifier=&complianceStatus=&mainActivityType=-1&searchType=oha&resultList.currentPageNumber=1&nextList=Next%C2%A0%3E&selectedPeriods="

r = requests.get(url)
soup = BeautifulSoup(r.text,"lxml")
for items in soup.find(id="tblInstallationContacts").find_next_sibling().find_all("tr")[:-5]:
    data = [item.get_text(strip=True) for item in items.find_all("td")]
    print(data)

感谢SIM卡,它工作正常,数据以所需格式提供。感谢您的帮助:)别忘了勾选“我的答案”旁边的“向上/向下”按钮之间的灰色复选标记,将其选为可接受的解决方案。看这里。感谢您,请帮助根据Span HTML标记分离表的输出值,以便输出类似于195640 421****的URL。有人可以在这方面提供帮助吗?解决此问题。使用str.split和str.join