在Python中从多个网页中抓取文本_Python

在Python中从多个网页中抓取文本

python

在Python中从多个网页中抓取文本,python,Python,我的任务是从我们的某个客户的主机上刮去任何网页上的所有文本。我已经成功地编写了一个脚本，该脚本将从单个网页上刮取文本，并且您可以在每次需要刮取不同网页时手动替换代码中的URL。但显然这是非常低效的。理想情况下，我可以让Python连接到包含我所需的所有URL的列表，它将遍历该列表，并将所有刮取的文本打印到单个CSV中。我试图编写一个“测试”版本的代码，方法是创建一个2个URL的长列表，并尝试让我的代码刮取这两个URL。但是，正如您所看到的，我的代码只抓取列表中最新的url，而不保留它抓取的第一页

我的任务是从我们的某个客户的主机上刮去任何网页上的所有文本。我已经成功地编写了一个脚本，该脚本将从单个网页上刮取文本，并且您可以在每次需要刮取不同网页时手动替换代码中的URL。但显然这是非常低效的。理想情况下，我可以让Python连接到包含我所需的所有URL的列表，它将遍历该列表，并将所有刮取的文本打印到单个CSV中。我试图编写一个“测试”版本的代码，方法是创建一个2个URL的长列表，并尝试让我的代码刮取这两个URL。但是，正如您所看到的，我的代码只抓取列表中最新的url，而不保留它抓取的第一页。我想这是因为我的打印声明中有一个缺陷，因为它总是写在上面。有没有办法把我刮下来的东西都放在某个地方，直到循环遍历整个列表，然后打印出来

请随意完全删除我的代码。我对计算机语言一无所知。我只是不断地被分配这些任务，并使用谷歌尽我所能

import urllib
import re
from bs4 import BeautifulSoup

data_file_name = 'C:\\Users\\confusedanalyst\\Desktop\\python_test.csv'
urlTable = ['url1','url2']

def extractText(string):
    page = urllib.request.urlopen(string)
    soup = BeautifulSoup(page, 'html.parser')

##Extracts all paragraph and header variables from URL as GroupObjects
    text = soup.find_all("p")
    headers1 = soup.find_all("h1")
    headers2 = soup.find_all("h2")
    headers3 = soup.find_all("h3")

##Forces GroupObjects into str
    text = str(text)
    headers1 = str(headers1)
    headers2 = str(headers2)
    headers3 = str(headers3)

##Strips HTML tags and brackets from extracted strings
    text = text.strip('[')
    text = text.strip(']')
    text = re.sub('<[^<]+?>', '', text)

    headers1 = headers1.strip('[')
    headers1 = headers1.strip(']')
    headers1 = re.sub('<[^<]+?>', '', headers1)

    headers2 = headers2.strip('[')
    headers2 = headers2.strip(']')
    headers2 = re.sub('<[^<]+?>', '', headers2)

    headers3 = headers3.strip('[')
    headers3 = headers3.strip(']')
    headers3 = re.sub('<[^<]+?>', '', headers3)

    print_to_file = open (data_file_name, 'w' , encoding = 'utf')
    print_to_file.write(text + headers1 + headers2 + headers3)
    print_to_file.close()


for i in urlTable:
    extractText (i)

导入urllib
进口稀土
从bs4导入BeautifulSoup
数据\u文件\u名称='C:\\Users\\confusedanalyst\\Desktop\\python\u test.csv'
urlTable=['url1'，'url2']
def提取文本（字符串）：
page=urllib.request.urlopen（字符串）
soup=BeautifulSoup（页面“html.parser”）
##将URL中的所有段落和标题变量提取为GroupObjects
text=soup.find_all（“p”）
headers1=汤。查找所有（“h1”）
headers2=汤。查找所有（“h2”）
headers3=汤。查找所有（“h3”）
##将GroupObjects强制放入str
text=str（text）
headers1=str（headers1）
headers2=str（headers2）
headers3=str（headers3）
##从提取的字符串中剥离HTML标记和括号
text=text.strip（“[”）
text=text.strip（']'）
text=re.sub（“尝试此操作，'w'将打开文件，指针位于文件开头。是否希望指针位于文件结尾
print\u to\u file=open（数据文件名'a'，编码='utf'）

以下是所有不同的读写模式，供将来参考
The argument mode points to a string beginning with one of the following
 sequences (Additional characters may follow these sequences.):

 ``r''   Open text file for reading.  The stream is positioned at the
         beginning of the file.

 ``r+''  Open for reading and writing.  The stream is positioned at the
         beginning of the file.

 ``w''   Truncate file to zero length or create text file for writing.
         The stream is positioned at the beginning of the file.

 ``w+''  Open for reading and writing.  The file is created if it does not
         exist, otherwise it is truncated.  The stream is positioned at
         the beginning of the file.

 ``a''   Open for writing.  The file is created if it does not exist.  The
         stream is positioned at the end of the file.  Subsequent writes
         to the file will always end up at the then current end of file,
         irrespective of any intervening fseek(3) or similar.

 ``a+''  Open for reading and writing.  The file is created if it does not
         exist.  The stream is positioned at the end of the file.  Subse-
         quent writes to the file will always end up at the then current
         end of file, irrespective of any intervening fseek(3) or similar.

尝试此操作，“w”将打开文件，文件开头有一个指针。是否希望指针位于文件末尾
print\u to\u file=open（数据文件名'a'，编码='utf'）

以下是所有不同的读写模式，供将来参考
The argument mode points to a string beginning with one of the following
 sequences (Additional characters may follow these sequences.):

 ``r''   Open text file for reading.  The stream is positioned at the
         beginning of the file.

 ``r+''  Open for reading and writing.  The stream is positioned at the
         beginning of the file.

 ``w''   Truncate file to zero length or create text file for writing.
         The stream is positioned at the beginning of the file.

 ``w+''  Open for reading and writing.  The file is created if it does not
         exist, otherwise it is truncated.  The stream is positioned at
         the beginning of the file.

 ``a''   Open for writing.  The file is created if it does not exist.  The
         stream is positioned at the end of the file.  Subsequent writes
         to the file will always end up at the then current end of file,
         irrespective of any intervening fseek(3) or similar.

 ``a+''  Open for reading and writing.  The file is created if it does not
         exist.  The stream is positioned at the end of the file.  Subse-
         quent writes to the file will always end up at the then current
         end of file, irrespective of any intervening fseek(3) or similar.

非常感谢！这正是我想要的。我想一旦我从客户端获得了一个真实的URL列表，我就可以应用相同的原则。再次感谢！非常感谢！这正是我想要的。我想一旦我从客户端获得了一个真实的URL列表，我就可以应用相同的原则。再次感谢！