Python 使用lxml.html解析大型html文档_Python_Html_Parsing_Beautifulsoup_Lxml

Python 使用lxml.html解析大型html文档

python html parsing

Python 使用lxml.html解析大型html文档,python,html,parsing,beautifulsoup,lxml,Python,Html,Parsing,Beautifulsoup,Lxml,这会返回所有的文本，但不会返回我需要的文本。这个怎么样 from bs4 import BeautifulSoup import requests from lxml import html, cssselect link = "https://www.sec.gov/Archives/edgar/data/320193/000032019318000100/a10-qq320186302018.htm" response = requests.get(link) soup = Beau

这会返回所有的文本，但不会返回我需要的文本。

这个怎么样

from bs4 import BeautifulSoup
import requests
from lxml import html, cssselect

link =    "https://www.sec.gov/Archives/edgar/data/320193/000032019318000100/a10-qq320186302018.htm"
response = requests.get(link)
soup = BeautifulSoup(response.text, 'html.parser')
str_soup = str(soup)
doc = html.document_fromstring(str_soup)
for col in doc.cssselect('font'):
    try:
        style = col.attrib['style']
        if style=="font-family:Helvetica,sans-serif;font-size:9pt;":
            print(col.text.strip())
    except:
        pass

输出：

from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(html, 'html.parser')

x = soup.find_all('font')
name = re.sub(r"[\n\t\s]*", "", x[0].get_text())
value = re.sub(r"[\n\t\s]*", "", x[3].get_text())

print(name, 'costs', value)

我没有得到我想要的东西，但这是我目前能想到的，并以此为基础

iPhone costs 29,906

我想如果这个文档就是我发布的，那么这个方法就行了，但是我正在查看的文档要大得多，所以很难找到我需要的任何值的每个索引。我总是知道我需要“iPhone”字段，只是它后面的数字会有所不同。

iPhone costs 29,906

from bs4 import BeautifulSoup
import requests
from lxml import html, cssselect
import csv


link = "https://www.sec.gov/Archives/edgar/data/320193/000032019318000100/a10-qq320186302018.htm"
response = requests.get(link)
soup = BeautifulSoup(response.text, 'html.parser')
str_soup = str(soup)
doc = html.document_fromstring(str_soup)


with open('AAPL_financials.csv', 'w') as csvfile:
    writer = csv.writer(csvfile)
    for col in doc.cssselect('tr'):
        row = []
        for text in col.cssselect('font'):
            if text.text == None:
                continue
            value = text.text.strip()
            if value == "":
                continue
            if value == "$":
                continue
            if value == "%":
                continue
            if value == ")":
                continue
            if value[0] == "(":
                value = value.replace("(", "-"))
            row.append(value)
        writer.writerow(row)