Python 刮取内容中的标记必须与原始HTML文件中的标记具有相同的顺序_Python_Python 3.x_Web Scraping_Beautifulsoup

Python 刮取内容中的标记必须与原始HTML文件中的标记具有相同的顺序

python python-3.x web-scraping

Python 刮取内容中的标记必须与原始HTML文件中的标记具有相同的顺序,python,python-3.x,web-scraping,beautifulsoup,Python,Python 3.x,Web Scraping,Beautifulsoup,我试图建立一个网页刮板。我的刮板必须找到与所选标记对应的所有行，并将它们以与原始HTML相同的顺序保存到新的文件.md文件中在数组中指定标记： list_of_tags_you_want_to_scrape = ['h1', 'h2', 'h3', 'p', 'li'] 然后，这只提供指定标记中的内容： soup_each_html = BeautifulSoup(particular_page_content, "html.parser") inner_content = soup_eac

我试图建立一个网页刮板。我的刮板必须找到与所选标记对应的所有行，并将它们以与原始HTML相同的顺序保存到新的
文件.md
文件中
在数组中指定标记：

list_of_tags_you_want_to_scrape = ['h1', 'h2', 'h3', 'p', 'li']
然后，这只提供指定标记中的内容：

soup_each_html = BeautifulSoup(particular_page_content, "html.parser") inner_content = soup_each_html.find("article", "container")
假设这是结果：

<article class="container"> <h1>this is headline 1</h1> <p>this is paragraph</p> <h2>this is headline 2</h2> <a href="bla.html">this won't be shown bcs 'a' tag is not in the array</a> </article>

<article class="container"> <h1>this is headline 1</h1> <h2>this is headline 2</h2> <p>this is paragraph</p> </article>
确实如此。但问题是，它会遍历数组（对于每个数组），并首先保存所有
，然后将所有
保存在第二位，等等。这是因为这是您希望在
标记列表中指定的顺序结果是： <article class="container"> <h1>this is headline 1</h1> <p>this is paragraph</p> <h2>this is headline 2</h2> <a href="bla.html">this won't be shown bcs 'a' tag is not in the array</a> </article> <article class="container"> <h1>this is headline 1</h1> <h2>this is headline 2</h2> <p>this is paragraph</p> </article> 这是标题1 这是标题2 这是一段因此，我希望他们在正确的顺序一样，原来的HTML已经。在第一个之后应该是元素这意味着我可能还需要对每个循环使用internal\u content 并检查此internal\u content中的每一行是否至少等于数组中的一个标记。如果是，则保存，然后移到另一行。我试图这样做，并为每个内部_的内容得到一行一行，但它给了我一个错误，我不知道这是正确的方式如何做。（使用BeautifulSoup模块的第一天）请提供如何修改我的方法以实现此目的的任何提示或建议？谢谢大家! 要保持html 输入的原始顺序，可以使用递归循环soup.contents 属性： from bs4 import BeautifulSoup as soup def parse(content, to_scrape = ['h1', 'h2', 'h3', 'p', 'li']): if content.name in to_scrape: yield content for i in getattr(content, 'contents', []): yield from parse(i) 例如： html = """ <html> <body> <h1>My website</h1> <p>This is my first site</p> <h2>See a listing of my interests below</h2> <ul> <li>programming</li> <li>math</li> <li>physics</li> </ul> <h3>Thanks for visiting!</h3> </body> </html> """ result = list(parse(soup(html, 'html.parser'))) 每个bs4 对象都包含一个name 和contents 属性。name 属性是标记名本身，而contents 属性存储所有子HTMLparse 使用a首先检查传递的bs4 对象是否具有属于to_scrape 列表的标记，如果是，则产生该值。最后，parse 对content 的内容进行迭代，并对每个元素进行调用。有两件事：首先，作为一般事项-如果可能，您应该粘贴代码，而不是它的图片（在本例中，是您的输出）。第二，你应该在你的问题中加入你的内部内容（或者至少是一个有代表性的部分），这样人们就可以看到你在处理什么了。感谢您注意到我。@MapeSVK和您最近的示例，顺序不应该是[h1，h2，p] ，因为这是HTML本身的结构吗？@Ajax1234不，不应该。原始顺序为h1->p->h2。这只是一个例子。我需要坚持正确的顺序。这不是一个关于如何构造HTML的主题。这是一个关于刮片后的正确顺序的主题。@MapeSVK总是以{something}的形式输入？我刚刚将行：result=list（parse（soup（html，'html.parser'））更改为result=list（parse（internal_content）），因为我的内容已经在soup之后了。再次感谢！ with open('file.md', 'w') as f: f.write('\n'.join(map(str, result)))