从HTML页面提取数据(Python)

从HTML页面提取数据(Python),python,regex,Python,Regex,我正试图从中提取一些数据。我想提取两个字符串之间的任何文本(项目1A风险因素和项目1B未解决的员工意见)。很难找到正确的正则表达式来实现这一点 import re import html2text url = "https://www.sec.gov/Archives/edgar/data/104169/000010416916000079/wmtform10-kx1312016.htm" html = urllib.urlopen(url).read() text = html2text.

我正试图从中提取一些数据。我想提取两个字符串之间的任何文本(项目1A风险因素和项目1B未解决的员工意见)。很难找到正确的正则表达式来实现这一点

import re
import html2text

url = "https://www.sec.gov/Archives/edgar/data/104169/000010416916000079/wmtform10-kx1312016.htm"
html = urllib.urlopen(url).read()

text = html2text.html2text(html)

regex= '(?<=Item 1A Risk Factors)(.*)(?=Item 1B Unresolved)'

match = re.search(regex, text, flags=re.IGNORECASE)

print match
重新导入
导入html2text
url=”https://www.sec.gov/Archives/edgar/data/104169/000010416916000079/wmtform10-kx1312016.htm"
html=urllib.urlopen(url.read())
text=html2text.html2text(html)

regex='(?如果您想使用regex,可以使用以下在Python 3.5.2中运行的代码。 尝试打印您的“文本”以查看项目1A的实际值,该值与您在网页(项目\  \ 1A)中看到的值不同。希望这对您有所帮助

import urllib.request
from urllib.error import URLError, HTTPError
import re
import contextlib

mainpage = "https://www.sec.gov/Archives/edgar/data/104169/000010416916000079/wmtform10-kx1312016.htm"

try:
    with contextlib.closing(urllib.request.urlopen(mainpage)) as url:
        htmltext = url.read().decode('utf-8')
        #print(htmltext)
except HTTPError as e:
    print("HTTPError") 
except URLError as e:
    print("URLError") 
else:
    results = re.findall(r'(?=ITEM\&\#160\;1A\.(.*)(RISK FACTORS))(.*)(?=ITEM\&\#160\;1B\.(.*)(UNRESOLVED))',htmltext)
    print (results)

你可以用这个删除html标签

查找:

(3)目前,除了除了上述两个国家的施工现场现场的施工现场现场的施工现场现场的施工现场现场的施工现场现场的施工现场现场的施工现场现场的施工现场现场的施工现场现场的施工现场现场的施工现场现场的施工现场现场的施工现场现场的施工现场现场的施工现场现场的施工现场现场的施工现场现场的现场现场现场的施工现场的施工现场现场的施工现场的施工现场的施工现场现场的现场现场现场的施工现场现场的现场的施工现场的现场现场的施工现场的现场的施工现场的施工现场的除除除除除除其他其他其他其他其他其他其他其他其他其他其他的现场现场现场的现场的现场的现场的现场的现场的现场现场的施工现场的现场现场的现场的现场的施工现场的现场现场的现场现场的施工现场的现场的现场的现场现场的存存存存存存存存存存存存存存存存存存存存存存存存存存存存存存存存存存的))))))))))))))\S\S]*?\]\])|(?:--[\S\S]*?-)|(?:ATTLIST[\S\S]*?)|(?:实体[\S\S]*?)|(?:元素[\S\S]*?)>

替换为“无”:

然后在生成的字符串上运行此命令

1A\s*\.\s*风险\s+因素(.*1B\s*\.\s*未解决\s+员工\s+评论

您需要的是捕获组1

你可以在自己的应用程序中包装文本,或者

将组1字符串粘贴到应用程序中
文档,右键单击上下文菜单->其他实用程序->自动换行。
输入一个最大线长度约为60的值

它会弹出5k的包装文本,如下所示(被截断)


不使用正则表达式解析HTML?您可以将CSS选择器或Xpath与实际的解析器一起使用吗?html源代码既不包含字符串“Item 1A Risk Factors”也不包含“Item 1B Unresolved”。“Item 1A Risk Factors”和“Item 1B Unresolved”在实际文本中。这就是为什么我先删除html标记,然后尝试使用正则表达式。希望这有意义。可能值得注意的是,html2text将HTML源代码转换为有效的标记,而不是纯文本。谢谢@异常的
The risks described below could materially and adversely 
affect our business, results of operations, financial 
condition and liquidity.  Our business operations could also
be affected by additional factors that apply to all 
companies operating in the U.S. and globally.Strategic 
RisksGeneral or macro-economic factors, both domestically 
and internationally, may materially adversely affect our 
financial performance.General economic conditions, globally 
or in one or more of the markets we serve, may adversely 
affect our financial performance.  Higher interest rates, 
lower or higher prices of petroleum products, including 
crude oil, natural gas, gasoline, and diesel fuel, higher 
costs for electricity and other energy, weakness in the 
housing market, inflation, deflation, increased costs of 
essential services, such as medical care and utilities, 
higher levels of unemployment, decreases in consumer 
disposable income, unavailability of consumer credit, higher
consumer debt levels, changes in consumer spending and 
shopping patterns, fluctuations in currency exchange rates, 
higher tax rates, imposition of new taxes and surcharges, 
other changes in tax laws, other regulatory changes, overall