Python-解析HTML列表并从中创建JSON数组_Python_Html_Json_Parsing_Lxml

Python-解析HTML列表并从中创建JSON数组

python html json parsing

Python-解析HTML列表并从中创建JSON数组,python,html,json,parsing,lxml,Python,Html,Json,Parsing,Lxml,我正在使用Flask创建一个restful API，对于其中一个公开的方法，我正在尝试解析如下所示的HTML： <li class="product some_product"> <div class="product_wrap"> <div class="basic_stat product_title"> <a href="/product/type/title1"> The Never Ending Story

我正在使用Flask创建一个restful API，对于其中一个公开的方法，我正在尝试解析如下所示的HTML：

<li class="product some_product">
  <div class="product_wrap">
   <div class="basic_stat product_title">
    <a href="/product/type/title1">
     The Never Ending Story
    </a>
   </div>
   <div class="basic_stat product_score score">
    <div class="score_w">
     100
    </div>
   </div>

例如，给我一个空数组

如何只过滤我需要的行（文本），而在每行的开头没有任何空格？这是唯一让我头疼的部分

比如：

我可以通过转换成JSON自己来管理以后的工作

我不确定我是否走上了正确的道路，或者我应该尝试一种完全不同的方法。我想先得到这样的文本值，然后创建dict或json应该是一件简单的事情

我真的很感激你的帮助/usr/bin/python3

#!/usr/bin/python3
#==========================================================

titles  = 'movies.html'

defaultDir  = 'C:\\Users\\geekiechic\\Documents\\'  ##  win
# defaultDir  = '/home/eli/Documents/'  ##  linux
##=========================================================

movies = []

with open((defaultDir + titles), 'r') as data:
    even = -1
    for line in data:
        if not line .lstrip() .startswith('<'):
            even += 1
            if even % 2 == 0:
                movies += ("{'Title': '%s" % line.strip())
            else:
                movies += ("', 'Score': '%s'}," % line.strip())

output  = '' .join(movies) .rstrip(',')
print('[' + output + ']')

#========================================================== titles='movies.html' defaultDir='C:\\Users\\geekiechic\\Documents\\\'##win #defaultDir='/home/eli/Documents/'###linux ##========================================================= 电影=[] 以open（（defaultDir+标题），“r”）作为数据：偶数=-1 对于行输入数据：

如果不是line.lstrip（）.startswith（“伙计们，谢谢你们的帮助。我是SO的新手，作为海报，所以如果需要请标记

我在做了一些测试后，用这种漂亮的汤解决了这个问题：

<li class="product some_product">
  <div class="product_wrap">
   <div class="basic_stat product_title">
    <a href="/product/type/title1">
     The Never Ending Story
    </a>
   </div>
   <div class="basic_stat product_score score">
    <div class="score_w">
     100
    </div>
   </div>

我首先分析了如何浏览增强的html，然后提取了标题和分数，创建了两个单独的列表

titles, scores = [],[]
for line in movies_html.findAll('a'):
    titles.append(line.getText(strip=True))

for line in movies_html.findAll('div', class_="basic_stat product_score"):
    scores.append(line.getText(strip=True))

#然后，我使用zip将这些列表合并成一个dict列表

result= [ { "title": str(t), "score": int(s) } for t, s in zip(titles, scores) ]

#最后，我使用json模块转储为json格式

 json = jsom.dumps(result, indent=2) 
 return json

也许有一个更干净或更漂亮的方法，但我不认为这有那么难看？

只是给你一个警告，要求广泛建议的问题在技术上被认为是离题的，结果可能会被标记。哎呀，对不起，我不太确定这会被认为是“广泛建议”事实上，我是stackoverflow的新手。当我测试得更深入、更接近我的目标时，我会回来的-对不起。@geekiechic你能粘贴更多的HTML吗？我需要看看几个产品的结构才能帮助你。

 json = jsom.dumps(result, indent=2) 
 return json