将刮取的数据放入html文件（Python3）_Python_Html_Iteration

将刮取的数据放入html文件（Python3）

python html

将刮取的数据放入html文件（Python3）,python,html,iteration,Python,Html,Iteration,我正在创建一个脚本，该脚本将： 1.从当地公司的网站新闻部分搜集数据。（新闻日期、新闻标题和文章链接） 2.将这些信息放在网页（html文件）的特定位置，以便在其他地方显示。（通过读取文件，逐行复制文件，用刮取的数据替换关键字..也许这不是最好的方法；如果是这样，让我知道我已经设法完成了第一部分（刮片），但在想象如何将其放入html文件时遇到了困难。目前，我试图复制html文件，但替换了很明显，我的方法和/或调用它的位置中的一个（或两个）需要更改还有一点值得一提的是，这里有10条新闻标题，

我正在创建一个脚本，该脚本将： 1.从当地公司的网站新闻部分搜集数据。（新闻日期、新闻标题和文章链接） 2.将这些信息放在网页（html文件）的特定位置，以便在其他地方显示。（通过读取文件，逐行复制文件，用刮取的数据替换关键字..也许这不是最好的方法；如果是这样，让我知道

我已经设法完成了第一部分（刮片），但在想象如何将其放入html文件时遇到了困难。目前，我试图复制html文件，但替换了

很明显，我的方法和/或调用它的位置中的一个（或两个）需要更改

还有一点值得一提的是，这里有10条新闻标题，我想把这10条新闻标题全部放在另一个html文件中（目前，这是一个非常标准的html表，请参见下面的python代码）

我的python3代码：

目标HTML文档（存储数据的位置标记为@1@2@3 for（1:date 2:headlineext 3:link））：


HTML表格
@1
@2
@3
@1
@2
@3
@1
@2
@3
@1
@2
@3
@1
@2
@3
@1
@2
@3
@1
@2
@3
@1
@2
@3
@1
@2
@3
@1
@2
@3
@1
@2
@3
@1
@2
@3

如果我正确理解您的查询：

您希望从新闻网站检索每个新闻日期、标题和url
然后，您希望将其作为一个表放入一个HTML页面中

你已经做过网络垃圾了

我建议您的脚本执行以下操作：

初始化包含标题和表开头的html内容字符串
做网页拉屎
对于你发现的每一条新闻
- 检索您要查找的数据：日期、标题、utl
- 将包含此数据的表行附加到内容字符串中
通过关闭表并放置页脚来完成内容字符串
将此内容字符串写入html文件

这可以用bash或python实现，可能还有其他语言

如果您的html页面更复杂，您也只能将表创建为字符串，并将其附加到html页眉和页脚文件中，或者将单个标记替换为包含所有内容的html文件（使用CSS等）。

您可以在项目中使用JINJA2生成html（如果未使用框架，则框架已具有模板引擎）

我会将您的代码重构为：

#import requirements
import bs4 as bs
import requests
import urllib.request
import re
import jinja2


# set values
the_url = '[redacted]'
base_url = '[redacted]'
html_output_file = "testpagina.html"


# get&interpret html
sauce = urllib.request.urlopen(the_url).read()
soup = bs.BeautifulSoup(sauce, 'lxml')

# select the unordered list named link-list bigger
section = soup.find('ul', class_='link-list bigger')

# select all list items in $section
subsections = section.find_all('li')

items = []
# for each news item in the news section performs actions described in comments.
for subsection in subsections:

    # selects href from single news headline, concatenates it with base-url.
    news_link = base_url + subsection.a.get('href')

    # selects text part from single news headline, strips empty white spaces before and after
    stripped_text = subsection.text.strip()

    # seperates release-date from headline for a single news headline
    news_date = stripped_text[0:5]
    news_headline = stripped_text[5:]

    # create dict to use in template and append
    item = {}
    item['datum'] = news_date
    item['headline'] = news_headline
    item['link'] = news_link
    items.append(item)

jinja2.Template("""
    <!DOCTYPE html>
    <html>
      <head>
      </head>
      <body>
        <h2>HTML Table</h2>
        <table>
        {%for item in items %}
          <tr>
            <td>{{ item.datum }}</td>
            <td>{{ item.headline }}</td>
            <td>{{ item.link }}</td>
          </tr>
        {% endfor %}
        </table>
      </body>
    </html>
    """).stream(items=items).dump('hello.html')

#导入要求
将bs4作为bs导入
导入请求
导入urllib.request
进口稀土
进口金玉2
#设定值
_url='[redact]'
基本url=“[修订]”
html\u output\u file=“testpagina.html”
#获取并解释html
sauce=urllib.request.urlopen（the_url.read（））
汤=bs.BeautifulSoup（酱汁，'lxml'）
#选择名为链接列表的无序列表
section=soup.find（'ul'，class='link-list'
#选择$section中的所有列表项
子节=节。查找所有（'li'）
项目=[]
#对于新闻部分中的每个新闻项，执行注释中描述的操作。
对于子节中的子节：
#从单个新闻标题中选择href，将其与基本url连接。
news\u link=base\u url+subsection.a.get（'href'）
#从单个新闻标题中选择文本部分，去掉前后的空白
stripped_text=subsection.text.strip（）
#将发布日期与单个新闻标题的标题分开
新闻日期=文本[0:5]
新闻标题=精简文本[5:]
#创建要在模板和附加中使用的dict
项目={}
项目['datum']=新闻日期
项目['headline']=新闻标题
项目['link']=新闻链接
items.append（项目）
jinja2.模板（“”）
HTML表格
{items%%中的项的%s}
{{item.datum}}
{{item.headline}
{{item.link}
{%endfor%}
“”）.stream（items=items.dump（'hello.html'））

为了澄清我的变化：

已删除copyandapt（）函数（我认为您试图在此处创建HTML）

添加了以下行：

import jinja2

items = []

# create dict to use in template and append
item = {}
item['datum'] = news_date
item['headline'] = news_headline
item['link'] = news_link
items.append(item)


jinja2.Template("""
<!DOCTYPE html>
<html>
  <head>
  </head>
  <body>
    <h2>HTML Table</h2>
    <table>
      {%for item in items %}
      <tr>
      <td>{{ item.datum }}</td>
      <td>{{ item.headline }}</td>
      <td>{{ item.link }}</td>
    </tr>
    {% endfor %}
    </table>
  </body>
</html>
""").stream(items=items).dump('hello.html')

importjinja2
项目=[]
#创建要在模板和附加中使用的dict
项目={}
项目['datum']=新闻日期
项目['headline']=新闻标题
项目['link']=新闻链接
items.append（项目）
jinja2.模板（“”）
HTML表格
{items%%中的项的%s}
{{item.datum}}
{{item.headline}
{{item.link}
{%endfor%}
“”）.stream（items=items.dump（'hello.html'））

对于任何感兴趣的人，以及未来的读者，这就是我最终使其成功的原因（尽管如此，劳伦特·C.和亨里克·冈卡尔维斯都给出了非常有价值和适用的答案！非常感谢他们！）

我的最终代码：

#import requirements
import bs4 as bs
import requests
import urllib.request
import re


def ReplaceInFile(counter, tableRow):
    if counter > 0:
        fin = open("outputfile.html", "rt")
    elif counter == 0:
        fin = open("input.html", "rt")
    data = fin.read()
    data = data.replace('@'+str(counter), tableRow)
    fin.close()

    fin = open("outputfile.html", "wt")
    fin.write(data)
    fin.close()

# set values
the_url = '[redacted]'
base_url = '[redacted]'
html_output_file = "testpagina.html"


# get&interpret html
sauce = urllib.request.urlopen(the_url).read()
soup = bs.BeautifulSoup(sauce, 'lxml')

# select the unordered list named link-list bigger
section = soup.find('ul', class_='link-list bigger')

# select all list items in $section
subsections = section.find_all('li')

#create start of table
table_start = "<h2>HTML Table</h2>\n<table>"

# for each news item in the news section performs actions described in comments.
counter = 0
for subsection in subsections:

    # selects href from single news headline, concatenates it with base-url.
    news_link = base_url + subsection.a.get('href')

    # selects text part from single news headline, strips empty white spaces before and after
    stripped_text = subsection.text.strip()

    # seperates release-date from headline for a single news headline
    news_date = stripped_text[0:5]
    news_headline = stripped_text[5:]

    #create row for table from single news subject
    table_row = "<tr>\n    <th>"+news_date+"</th>\n    <th>"+news_headline+"</th>\n    <th>"+news_link+"</th>\n  </tr>"

    ReplaceInFile(counter, table_row)
    counter = counter + 1

#导入要求
将bs4作为bs导入
导入请求
导入urllib.request
进口稀土
def ReplaceInFile（计数器，表格行）：
如果计数器>0：
fin=打开（“outputfile.html”、“rt”）
elif计数器==0：
fin=open（“input.html”、“rt”）
data=fin.read（）
data=data.replace（'@'+str（计数器），tableRow）
财务结束（）
fin=打开（“outputfile.html”、“wt”）
财务写入（数据）
财务结束（）
#设定值
_url='[redact]'
基本url=“[修订]”
html\u output\u file=“testpagina.html”
#获取并解释html
sauce=urllib.request.urlopen（the_url.read（））
汤=bs.BeautifulSoup（酱汁，'lxml'）
#选择名为链接列表的无序列表
section=soup.find（'ul'，class='link-list'
#选择$section中的所有列表项
子节=节。查找所有（'li'）
#创建表的开头
table\u start=“HTML table\n”
#对于新闻部分中的每个新闻项，执行co中描述的操作
import jinja2

items = []

# create dict to use in template and append
item = {}
item['datum'] = news_date
item['headline'] = news_headline
item['link'] = news_link
items.append(item)


jinja2.Template("""
<!DOCTYPE html>
<html>
  <head>
  </head>
  <body>
    <h2>HTML Table</h2>
    <table>
      {%for item in items %}
      <tr>
      <td>{{ item.datum }}</td>
      <td>{{ item.headline }}</td>
      <td>{{ item.link }}</td>
    </tr>
    {% endfor %}
    </table>
  </body>
</html>
""").stream(items=items).dump('hello.html')

#import requirements
import bs4 as bs
import requests
import urllib.request
import re


def ReplaceInFile(counter, tableRow):
    if counter > 0:
        fin = open("outputfile.html", "rt")
    elif counter == 0:
        fin = open("input.html", "rt")
    data = fin.read()
    data = data.replace('@'+str(counter), tableRow)
    fin.close()

    fin = open("outputfile.html", "wt")
    fin.write(data)
    fin.close()

# set values
the_url = '[redacted]'
base_url = '[redacted]'
html_output_file = "testpagina.html"


# get&interpret html
sauce = urllib.request.urlopen(the_url).read()
soup = bs.BeautifulSoup(sauce, 'lxml')

# select the unordered list named link-list bigger
section = soup.find('ul', class_='link-list bigger')

# select all list items in $section
subsections = section.find_all('li')

#create start of table
table_start = "<h2>HTML Table</h2>\n<table>"

# for each news item in the news section performs actions described in comments.
counter = 0
for subsection in subsections:

    # selects href from single news headline, concatenates it with base-url.
    news_link = base_url + subsection.a.get('href')

    # selects text part from single news headline, strips empty white spaces before and after
    stripped_text = subsection.text.strip()

    # seperates release-date from headline for a single news headline
    news_date = stripped_text[0:5]
    news_headline = stripped_text[5:]

    #create row for table from single news subject
    table_row = "<tr>\n    <th>"+news_date+"</th>\n    <th>"+news_headline+"</th>\n    <th>"+news_link+"</th>\n  </tr>"

    ReplaceInFile(counter, table_row)
    counter = counter + 1