Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/jsf-2/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何从带有tr、td和span的表中提取所有文本?_Python_Web Scraping_Html Table - Fatal编程技术网

Python 如何从带有tr、td和span的表中提取所有文本?

Python 如何从带有tr、td和span的表中提取所有文本?,python,web-scraping,html-table,Python,Web Scraping,Html Table,在python3中,我想从一个表中提取表的所有文本内容。但是信息的组织方式不是使用传统的表,而是使用tr、td和span 信息在屏幕上的“Movimentações”块上 要开始提取的程序: import requests from bs4 import BeautifulSoup import urllib3; urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning) res = requests.get("htt

在python3中,我想从一个表中提取表的所有文本内容。但是信息的组织方式不是使用传统的表,而是使用tr、td和span

信息在屏幕上的“Movimentações”块上

要开始提取的程序:

import requests
from bs4 import BeautifulSoup
import urllib3; urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

res = requests.get("https://esaj.tjsp.jus.br/cpopg/show.do?processo.codigo=2S000YR9Q0000&processo.foro=100&paginaConsulta=2&conversationId=&dadosConsulta.localPesquisa.cdLocal=100&cbPesquisa=NMPARTE&dadosConsulta.tipoNuProcesso=UNIFICADO&dadosConsulta.valorConsulta=Google&uuidCaptcha=&pbEnviar=Pesquisar", verify=False)

soup = BeautifulSoup(res.content,'lxml')

# I get a numeric code to organize what will be extracted
num_processo = soup.select_one('td:has(>.labelClass:contains("Processo:")) + td').text.strip() if soup.select_one('td:has(>.labelClass:contains("Assunto:")) + td') is not None else 'N/A'

# This is where the table is    
table = soup.find_all("tbody",{"id":"tabelaUltimasMovimentacoes"})
我只想提取行中的所有文本,并像这样组织它(第一行的示例):

这里是“tabelaUltimasMovimentacoes”中HTML代码的一部分


22/04/2019
Expedida公共认证

Relação:0130/2019 数据发布日期:2019年4月22日 公共数据:2019年4月23日 努梅罗·多迪亚里奥:2792 帕吉纳:402/420 16/04/2019 雷米蒂多·奥德杰
Relação:0130/2019 Teor do ato:Ante o exposto、julgo PROCEDENTES os pedidos、com resoluão do mérito、nos termos do artigo 487、Incoso I、do Código de Processo Civil、para que procedam作为谷歌和FACEBOOK的要求,帕吉纳的支持https://www.youtube.com/channel/UCOMI2Kd2YtfpicY5UJXiXhg Ehttps://www.facebook.com/leiamirandaoficial1/,这是一个关于IPs的文件,它是一个安全的文件,它是一个安全的文件,确认一个乌尔盖尼亚的图特拉。卡达·帕德·阿尔卡·奥诺里奥斯·帕德·帕德·阿卡里奥斯·奥诺里奥斯·卡德·帕德·帕德·阿卡里奥斯·阿卡里奥斯·奥诺里奥斯·帕德·帕德·帕德·阿卡里奥斯·帕德·奥诺里奥斯·帕德·帕诺·帕诺·埃克奥斯·卡奥斯·帕德·帕德·帕德·帕德·帕德·帕德·帕德·帕德·帕德·帕德·帕德·帕德·阿卡里奥斯·奥里奥斯。过渡时期位于julgado、ao arquivo、dando se baixa na Distributiço。P.I.C。 顾问:Celso de Faria Monteiro(OAB 138436/SP)、Fabio Rivelli(OAB 297608/SP)、Rafael Gomes Anastacio(OAB 320579/SP)

请问,有人知道我如何提取所有文本并创建字典吗?

这是一个小技巧,但使用请求将html传递给pandas以提取表。然后在桌子上做一些化妆品

import pandas as pd
import requests
import urllib3; urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
pd.options.mode.chained_assignment = None
r = requests.get('https://esaj.tjsp.jus.br/cpopg/show.do?processo.codigo=2S000YR9Q0000&processo.foro=100&paginaConsulta=2&conversationId=&dadosConsulta.localPesquisa.cdLocal=100&cbPesquisa=NMPARTE&dadosConsulta.tipoNuProcesso=UNIFICADO&dadosConsulta.valorConsulta=Google&uuidCaptcha=&pbEnviar=Pesquisar', verify = False)
tables = pd.read_html(r.content)
result = tables[4].head()
result.drop(['Unnamed: 1'], axis=1, inplace = True)
print(result)

使用selenium单击以显示更多信息:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import pandas as pd
from bs4 import BeautifulSoup as bs
url = 'https://esaj.tjsp.jus.br/cpopg/show.do?processo.codigo=2S000YR9Q0000&processo.foro=100&paginaConsulta=2&conversationId=&dadosConsulta.localPesquisa.cdLocal=100&cbPesquisa=NMPARTE&dadosConsulta.tipoNuProcesso=UNIFICADO&dadosConsulta.valorConsulta=Google&uuidCaptcha=&pbEnviar=Pesquisar'
d = webdriver.Chrome()
d.get(url)
WebDriverWait(d, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR, '#linkmovimentacoes'))).click()
time.sleep(1)
pd.options.mode.chained_assignment = None
tables = pd.read_html(d.find_element_by_css_selector('#divLinksTituloBlocoMovimentacoes + table').get_attribute('outerHTML'))
result = tables[0]
result.drop(['Unnamed: 1'], axis=1, inplace = True)
print(result)
d.quit()

我建议对堆栈溢出进行更仔细的搜索。有几个现有问题的答案与您的问题密切相关,可以帮助您解决问题。非常感谢@QHarr。我似乎无法提取所有的线条。请注意,单击“>>Listar todas as movimentações.”将打开整个表。这就是我想要得到的,我不得不使用selenium来获得完整的列表。答案已编辑。非常感谢@QHarr。HTML中包含“#divlinkstituloblocoVimentacoes+表”的位置有一个XPath“/HTML/body/div/table[4]/tbody/tr/td/table[5]”,用于检查站点上的元素。拜托,我只是不明白你是怎么得到“#divlinkstituloblocovimentaces+表”的,在感兴趣的表之前的元素有id divlinkstituloblocovimentaces。因此,我使用divlinkstituloblocomovimentaceos来指定我希望表格紧跟在id为divlinkstituloblocomovimentaceos的元素之后。我可以通过使用inspect元素查看该表的html来查看这种关系。。
import pandas as pd
import requests
import urllib3; urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
pd.options.mode.chained_assignment = None
r = requests.get('https://esaj.tjsp.jus.br/cpopg/show.do?processo.codigo=2S000YR9Q0000&processo.foro=100&paginaConsulta=2&conversationId=&dadosConsulta.localPesquisa.cdLocal=100&cbPesquisa=NMPARTE&dadosConsulta.tipoNuProcesso=UNIFICADO&dadosConsulta.valorConsulta=Google&uuidCaptcha=&pbEnviar=Pesquisar', verify = False)
tables = pd.read_html(r.content)
result = tables[4].head()
result.drop(['Unnamed: 1'], axis=1, inplace = True)
print(result)
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import pandas as pd
from bs4 import BeautifulSoup as bs
url = 'https://esaj.tjsp.jus.br/cpopg/show.do?processo.codigo=2S000YR9Q0000&processo.foro=100&paginaConsulta=2&conversationId=&dadosConsulta.localPesquisa.cdLocal=100&cbPesquisa=NMPARTE&dadosConsulta.tipoNuProcesso=UNIFICADO&dadosConsulta.valorConsulta=Google&uuidCaptcha=&pbEnviar=Pesquisar'
d = webdriver.Chrome()
d.get(url)
WebDriverWait(d, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR, '#linkmovimentacoes'))).click()
time.sleep(1)
pd.options.mode.chained_assignment = None
tables = pd.read_html(d.find_element_by_css_selector('#divLinksTituloBlocoMovimentacoes + table').get_attribute('outerHTML'))
result = tables[0]
result.drop(['Unnamed: 1'], axis=1, inplace = True)
print(result)
d.quit()