提取网址&；它们的名称是存储在磁盘上的html文件的名称，并分别打印它们-Python_Python_Html Parsing_Extract

提取网址&；它们的名称是存储在磁盘上的html文件的名称，并分别打印它们-Python

python

提取网址&；它们的名称是存储在磁盘上的html文件的名称，并分别打印它们-Python,python,html-parsing,extract,Python,Html Parsing,Extract,我正在尝试提取和打印URL及其名称（在之间，存在于html文件（保存在磁盘中）中，无需使用BeautifulSoup或其他库。这只是一个初级Python代码。打印格式为： http://..filepath/filename.pdf File's Name so on... 我能够提取和打印所有url或所有名称，但我无法在标记前包含的代码中附加一段时间后的所有名称，并将它们打印到每个url下方。我的代码变得凌乱，我的堆栈非常凌乱。这是我目前的代码： import os with open

我正在尝试提取和打印URL及其名称（在

之间，存在于html文件（保存在磁盘中）中，无需使用BeautifulSoup或其他库。这只是一个初级Python代码。打印格式为：

http://..filepath/filename.pdf File's Name so on...
我能够提取和打印所有url或所有名称，但我无法在标记前包含的代码中附加一段时间后的所有名称，并将它们打印到每个url下方。我的代码变得凌乱，我的堆栈非常凌乱。这是我目前的代码：

import os with open (os.path.expanduser('~/SomeFolder/page.html'),'r') as html: txt = html.read() # for urls nolp = 0 urlarrow = [] while nolp == 0: pos = txt.find("href") if pos >= 0: txtcount = len(txt) txt = txt[pos:txtcount] pos = txt.find('"') txtcount = len(txt) txt = txt[pos+1:txtcount] pos = txt.find('"') url = txt[0:pos] if url.startswith("http") and url.endswith("pdf"): urlarrow.append(url) else: nolp = 1 for item in urlarrow: print(item) #for names almost identical code to the above html.close()
如何使其工作？我需要将它们合并为一个函数或def，但如何？
注：我在下面贴了一个答案，但我认为可能有一个更简单的Pythonic解决方案
这是我需要的正确输出，但我相信有更好的方法

import os with open ('~/SomeFolder/page.html'),'r') as html: txt = html.read() text = txt #for urls nolp = 0 urlarrow = [] while nolp == 0: pos = txt.find("href") if pos >= 0: txtcount = len(txt) txt = txt[pos:txtcount] pos = txt.find('"') txtcount = len(txt) txt = txt[pos+1:txtcount] pos = txt.find('"') url = txt[0:pos] if url.startswith("http") and url.endswith("pdf"): urlarrow.append(url) else: nolp = 1 with open (os.path.expanduser('~/SomeFolder/page.html'),'r') as html: text = html.read() #for names noloop = 0 namearrow = [] while noloop == 0: posB = text.find("title") if posB >= 0: textcount = len(text) text = text[posB:textcount] posB = text.find('"') textcount = len(text) text = text[posB+19:textcount] #because string starts 19 chars after the posB posB = text.find('</') name = text[1:posB] if text[0].startswith('>'): namearrow.append(name) else: noloop = 1 fullarrow = [] for pair in zip(urlarrow, namearrow): for item in pair: fullarrow.append(item) for instance in fullarrow: print(instance) html.close()

导入操作系统以html形式打开（“~/SomeFolder/page.html”），“r”）： txt=html.read（） text=txt #用于URL nolp=0 urlarrow=[] 当nolp==0时： pos=txt.find（“href”）如果位置>=0： txtcount=len（txt） txt=txt[pos:txtcount] pos=txt.find（“”） txtcount=len（txt） txt=txt[pos+1:txtcount] pos=txt.find（“”） url=txt[0:pos] 如果url.startswith（“http”）和url.endswith（“pdf”）： urlarrow.append（url）其他： nolp=1 以html形式打开（os.path.expanduser（“~/SomeFolder/page.html”），“r”）： text=html.read（） #姓名 noloop=0 名称箭头=[] 当noloop==0时： posB=text.find（“标题”）如果posB>=0： textcount=len（文本） text=text[posB:textcount] posB=text.find（“”） textcount=len（文本） text=text[posB+19:textcount]#因为字符串在posB之后开始19个字符 posB=文本。查找（“”）： namearrow.append（名称）其他： noloop=1 fullarrow=[] 对于拉链中的配对（URLAROW、namearrow）：对于成对项目： fullarrow.append（项目）例如，在fullarrow中：打印（实例） html.close（）