使用python在文件中打印文本时出现小错误_Python_Web Scraping_Beautifulsoup

使用python在文件中打印文本时出现小错误

python web-scraping

使用python在文件中打印文本时出现小错误,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我正在用python在csv文件中编写一些文本。。这是我获取文件中写入数据的方式的屏幕截图。您可以看到，在“频道社交媒体链接”列中，所有链接在下一行的其他单元格中都写得很好，但第一个链接在“频道社交媒体链接”列中没有写。请问我怎么能这样写？我的python脚本在这里 from urllib.request import urlopen as uReq from bs4 import BeautifulSoup as soup myUrl='https://www.youtube.co

我正在用python在csv文件中编写一些文本。。这是我获取文件中写入数据的方式的屏幕截图。

您可以看到，在“频道社交媒体链接”列中，所有链接在下一行的其他单元格中都写得很好，但第一个链接在“频道社交媒体链接”列中没有写。请问我怎么能这样写？

我的python脚本在这里

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

myUrl='https://www.youtube.com/user/HolaSoyGerman/about'


uClient = uReq(myUrl)
page_html = uClient.read()
uClient.close()

page_soup = soup(page_html, "html.parser")

containers = page_soup.findAll("h1",{"class":"branded-page-header-title"})

filename="Products2.csv"
f = open(filename,"w")

headers = "Channel Name,Channel Description,Channel Social Media Links\n"

f.write(headers)

channel_name = containers[0].a.text 
print("Channel Name :" + channel_name)

# For About Section Info
aboutUrl='https://www.youtube.com/user/HolaSoyGerman/about'


uClient1 = uReq(aboutUrl)
page_html1 = uClient1.read()
uClient1.close()

page_soup1 = soup(page_html1, "html.parser")

description_div = page_soup.findAll("div",{"class":"about-description 
branded-page-box-padding"})
channel_description = description_div[0].pre.text
print("Channel Description :" + channel_description)
f.write(channel_name+ "," +channel_description)
links = page_soup.findAll("li",{"class":"channel-links-item"})
for link in links: 
social_media = link.a.get("href")
f.write(","+","+social_media+"\n")
f.close()

如果您在写入文件时使用Python的CSV库，这会有所帮助。这可以将项目列表转换为正确的逗号分隔值

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import csv

myUrl = 'https://www.youtube.com/user/HolaSoyGerman/about'

uClient = uReq(myUrl)
page_html = uClient.read()
uClient.close()

page_soup = soup(page_html, "html.parser")
containers = page_soup.findAll("h1",{"class":"branded-page-header-title"})
filename = "Products2.csv"

with open(filename, "w", newline='') as f:
    csv_output = csv.writer(f)
    headers = ["Channel Name", "Channel Description", "Channel Social Media Links"]
    csv_output.writerow(headers)

    channel_name = containers[0].a.text 
    print("Channel Name :" + channel_name)

    # For About Section Info
    aboutUrl = 'https://www.youtube.com/user/HolaSoyGerman/about'

    uClient1 = uReq(aboutUrl)
    page_html1 = uClient1.read()
    uClient1.close()

    page_soup1 = soup(page_html1, "html.parser")

    description_div = page_soup.findAll("div",{"class":"about-description branded-page-box-padding"})
    channel_description = description_div[0].pre.text
    print("Channel Description :" + channel_description)

    links = [link.a.get('href') for link in page_soup.findAll("li",{"class":"channel-links-item"})]
    csv_output.writerow([channel_name, channel_description, links[0]])

    for link in links[1:]:
        csv_output.writerow(['', '', link])

这将为您提供最后一列中包含每个HREF的单行，例如：

频道名称、频道描述、频道社交媒体链接
霍拉索伊格曼（HolaSoyGerman），位于佩里托（Pantuflas De Perrito）附近的维德河（Los Hombres De Verdad Usan），http://www.twitter.com/germangarmendia
,,http://instagram.com/germanchelo
,,http://www.youtube.com/juegagerman
,,http://www.youtube.com/juegagerman
,,http://www.twitter.com/germangarmendia
,,http://instagram.com/germanchelo
,,https://plus.google.com/108460714456031131326

每个

writerow（）

调用都会将值列表作为逗号分隔的值写入文件，并在末尾自动为您添加换行符。所需要的只是为每一行构建值列表。首先，获取第一个链接，并将其作为频道描述之后列表中的最后一个值。其次，为前两列具有空白值的每个剩余链接写一行

要回答您的评论，请从以下内容开始：

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import csv

def get_data(url, csv_output):

    if not url.endswith('/about'):
        url += '/about'

    print("URL: {}".format(url))
    uClient = uReq(url)
    page_html = uClient.read()
    uClient.close()

    page_soup = soup(page_html, "html.parser")
    containers = page_soup.findAll("h1", {"class":"branded-page-header-title"})

    channel_name = containers[0].a.text 
    print("Channel Name :" + channel_name)

    description_div = page_soup.findAll("div", {"class":"about-description branded-page-box-padding"})
    channel_description = description_div[0].pre.text
    print("Channel Description :" + channel_description)

    links = [link.a.get('href') for link in page_soup.findAll("li", {"class":"channel-links-item"})]
    csv_output.writerow([channel_name, channel_description, links[0]])

    for link in links[1:]:
        csv_output.writerow(['', '', link])

    #TODO - get list of links for the related channels

    return related_links


my_url = 'https://www.youtube.com/user/HolaSoyGerman'
filename = "Products2.csv"

with open(filename, "w", newline='') as f:
    csv_output = csv.writer(f)
    headers = ["Channel Name", "Channel Description", "Channel Social Media Links"]
    csv_output.writerow(headers)

    for _ in range(5):
        next_links = get_data(my_url, csv_output)
        my_url = next_links[0]      # e.g. follow the first of the related links

在

f.write（channel\u name+，“+channel\u description）

之后不包含换行符，因此第一行当然会在后面。另外，请注意，

”、“+”、“==”、“

”，并且CSV模块支持从序列中写入，而不是自己添加逗号。因此，如何实现这一点，请给出示例。我希望社会媒体链接不被逗号分隔。我希望在社交媒体链接中，所有链接都应该写在社交媒体链接专栏的下一行新单元格中。我正在使用网络抓取从youtube频道获取信息，并将其保存在csv文件中。现在我想，当任何youtube的频道信息已在csv文件中获取，然后自动从“相关频道”部分第一个频道的url将获取一个变量，然后再次整个过程应该完成，并持续5次。我该怎么做呢？我想这是一个完全不同的问题。你需要用一个例子更好地解释它。如果我的回答解决了你的第一个问题，我建议你接受这个解决方案（单击“向上/向下”按钮下的灰色勾号），然后开始更详细的第二个问题，并提供更多详细信息。我想了解我的脚本流我想了解我的脚本流1）从youtube频道获取信息（姓名、社交媒体链接、描述）（我第一次硬编码youtube频道的url以获取信息）2）从“相关频道”部分获取另一个频道url，再次重复步骤1，并重复这些步骤5次。我已经用硬代码url完成了第一步，但现在我想要第二步。我该怎么做？我会在第二天左右试试看。