如何爬网多个网站以查找常用词（BeautifulSoup、Requests、Python3）_Python_Pandas_Beautifulsoup

如何爬网多个网站以查找常用词（BeautifulSoup、Requests、Python3）

python pandas

如何爬网多个网站以查找常用词（BeautifulSoup、Requests、Python3）,python,pandas,beautifulsoup,Python,Pandas,Beautifulsoup,我想知道如何使用漂亮的soup/请求来抓取多个不同的网站，而不必一遍又一遍地重复我的代码这是我现在的代码： import requests from bs4 import BeautifulSoup from collections import Counter import pandas as pd Website1 = requests.get("http://www.nerdwallet.com/the-best-credit-cards") soup = BeautifulSoup(W

我想知道如何使用漂亮的soup/请求来抓取多个不同的网站，而不必一遍又一遍地重复我的代码

这是我现在的代码：

import requests
from bs4 import BeautifulSoup
from collections import Counter
import pandas as pd
Website1 = requests.get("http://www.nerdwallet.com/the-best-credit-cards")
soup = BeautifulSoup(Website1.content)
texts = soup.findAll(text=True)
a = Counter([x.lower() for y in texts for x in y.split()])
b = (a.most_common())
makeaframe = pd.DataFrame(b)
makeaframe.columns = ['Words', 'Frequency']
print(makeaframe)

我想做的理想情况下是抓取5个不同的网站，找到这些网站上的所有单词，找到每个网站上每个单词的频率，将每个特定单词的所有频率相加，然后将所有这些数据合并到一个数据框中，可以使用Pandas导出

希望输出像这样

Word     Frequency
the       200
man       300
is        400
tired     300

我的代码现在一次只能为一个网站执行此操作，我正在努力避免重复我的代码

现在，我可以通过一遍又一遍地重复我的代码，对每个单独的网站进行爬网，然后将每个数据帧的结果连接在一起来手动完成这项工作，但这似乎非常不和谐。我想知道是否有人有更快的方法或建议？谢谢大家!

只需循环并更新主计数器指令：

main_c = Counter() # keep all results here
urls = ["http://www.nerdwallet.com/the-best-credit-cards","http://stackoverflow.com/questions/tagged/python"]
for url in urls:
    website = requests.get(url)
    soup = BeautifulSoup(website.content)
    texts = soup.findAll(text=True)
    a = Counter([x.lower() for y in texts for x in y.split()])
    b = (a.most_common())
    main_c.update(b)
make_a_frame = pd.DataFrame(main_c.most_common())
make_a_frame.columns = ['Words', 'Frequency']
print(make_a_frame)

update

方法与普通的

dict.update

方法不同，它会添加值，但不会替换值

在样式注释中，变量名使用小写字母，并使用下划线的

make_a_frame

尝试：

只需循环并更新主计数器指令：

main_c = Counter() # keep all results here
urls = ["http://www.nerdwallet.com/the-best-credit-cards","http://stackoverflow.com/questions/tagged/python"]
for url in urls:
    website = requests.get(url)
    soup = BeautifulSoup(website.content)
    texts = soup.findAll(text=True)
    a = Counter([x.lower() for y in texts for x in y.split()])
    b = (a.most_common())
    main_c.update(b)
make_a_frame = pd.DataFrame(main_c.most_common())
make_a_frame.columns = ['Words', 'Frequency']
print(make_a_frame)

update

方法与普通的

dict.update

方法不同，它会添加值，但不会替换值

在样式注释中，变量名使用小写字母，并使用下划线的

make_a_frame

尝试：

只需循环并更新主计数器指令：

main_c = Counter() # keep all results here
urls = ["http://www.nerdwallet.com/the-best-credit-cards","http://stackoverflow.com/questions/tagged/python"]
for url in urls:
    website = requests.get(url)
    soup = BeautifulSoup(website.content)
    texts = soup.findAll(text=True)
    a = Counter([x.lower() for y in texts for x in y.split()])
    b = (a.most_common())
    main_c.update(b)
make_a_frame = pd.DataFrame(main_c.most_common())
make_a_frame.columns = ['Words', 'Frequency']
print(make_a_frame)

update

方法与普通的

dict.update

方法不同，它会添加值，但不会替换值

在样式注释中，变量名使用小写字母，并使用下划线的

make_a_frame

尝试：

只需循环并更新主计数器指令：

main_c = Counter() # keep all results here
urls = ["http://www.nerdwallet.com/the-best-credit-cards","http://stackoverflow.com/questions/tagged/python"]
for url in urls:
    website = requests.get(url)
    soup = BeautifulSoup(website.content)
    texts = soup.findAll(text=True)
    a = Counter([x.lower() for y in texts for x in y.split()])
    b = (a.most_common())
    main_c.update(b)
make_a_frame = pd.DataFrame(main_c.most_common())
make_a_frame.columns = ['Words', 'Frequency']
print(make_a_frame)

update

方法与普通的

dict.update

方法不同，它会添加值，但不会替换值

在样式注释中，变量名使用小写字母，并使用下划线的

make_a_frame

尝试：

制作一个函数：

import requests
from bs4 import BeautifulSoup
from collections import Counter
import pandas as pd

cnt = Counter()
def GetData(url):
 Website1 = requests.get(url)
 soup = BeautifulSoup(Website1.content)
 texts = soup.findAll(text=True)
 a = Counter([x.lower() for y in texts for x in y.split()])
 cnt.update(a.most_common())

websites = ['http://www.nerdwallet.com/the-best-credit-cards','http://www.other.com']
for url in websites:
 GetData(url)

makeaframe = pd.DataFrame(cnt.most_common())
makeaframe.columns = ['Words', 'Frequency']
print makeaframe

制作一个函数：

import requests
from bs4 import BeautifulSoup
from collections import Counter
import pandas as pd

cnt = Counter()
def GetData(url):
 Website1 = requests.get(url)
 soup = BeautifulSoup(Website1.content)
 texts = soup.findAll(text=True)
 a = Counter([x.lower() for y in texts for x in y.split()])
 cnt.update(a.most_common())

websites = ['http://www.nerdwallet.com/the-best-credit-cards','http://www.other.com']
for url in websites:
 GetData(url)

makeaframe = pd.DataFrame(cnt.most_common())
makeaframe.columns = ['Words', 'Frequency']
print makeaframe

制作一个函数：

import requests
from bs4 import BeautifulSoup
from collections import Counter
import pandas as pd

cnt = Counter()
def GetData(url):
 Website1 = requests.get(url)
 soup = BeautifulSoup(Website1.content)
 texts = soup.findAll(text=True)
 a = Counter([x.lower() for y in texts for x in y.split()])
 cnt.update(a.most_common())

websites = ['http://www.nerdwallet.com/the-best-credit-cards','http://www.other.com']
for url in websites:
 GetData(url)

makeaframe = pd.DataFrame(cnt.most_common())
makeaframe.columns = ['Words', 'Frequency']
print makeaframe

制作一个函数：

import requests
from bs4 import BeautifulSoup
from collections import Counter
import pandas as pd

cnt = Counter()
def GetData(url):
 Website1 = requests.get(url)
 soup = BeautifulSoup(Website1.content)
 texts = soup.findAll(text=True)
 a = Counter([x.lower() for y in texts for x in y.split()])
 cnt.update(a.most_common())

websites = ['http://www.nerdwallet.com/the-best-credit-cards','http://www.other.com']
for url in websites:
 GetData(url)

makeaframe = pd.DataFrame(cnt.most_common())
makeaframe.columns = ['Words', 'Frequency']
print makeaframe

只需将您的代码转换为带有url输入的函数？那么你就不需要重复代码了。。。。也许可以添加一个somewhere，只需将代码转换成一个带有url输入的函数？那么你就不需要重复代码了。。。。也许可以添加一个somewhere，只需将代码转换成一个带有url输入的函数？那么你就不需要重复代码了。。。。也许可以添加一个somewhere，只需将代码转换成一个带有url输入的函数？那么你就不需要重复代码了。。。。可能会添加一个SomeThis，它将为每个调用创建一个单独的数据帧，OP希望合并到oneHi Vizjerei中，虽然这段代码允许我抓取多个网站，但它并没有按照我需要的方式合并数据。我所希望的是将网站上的所有频率加起来，并创建两个栏：A栏加上单词，B栏加上所有频率。我编辑了我原来的帖子，希望能让它更清楚。谢谢你抽出时间！这将为每个调用创建一个单独的数据帧，OP希望合并到oneHi Vizjerei中，虽然这段代码允许我对多个网站进行爬网，但它并没有按照我所需要的方式合并数据。我所希望的是将网站上的所有频率加起来，并创建两个栏：A栏加上单词，B栏加上所有频率。我编辑了我原来的帖子，希望能让它更清楚。谢谢你抽出时间！这将为每个调用创建一个单独的数据帧，OP希望合并到oneHi Vizjerei中，虽然这段代码允许我对多个网站进行爬网，但它并没有按照我所需要的方式合并数据。我所希望的是将网站上的所有频率加起来，并创建两个栏：A栏加上单词，B栏加上所有频率。我编辑了我原来的帖子，希望能让它更清楚。谢谢你抽出时间！这将为每个调用创建一个单独的数据帧，OP希望合并到oneHi Vizjerei中，虽然这段代码允许我对多个网站进行爬网，但它并没有按照我所需要的方式合并数据。我所希望的是将网站上的所有频率加起来，并创建两个栏：A栏加上单词，B栏加上所有频率。我编辑了我原来的帖子，希望能让它更清楚。谢谢你抽出时间！嗨，Padraic，虽然这段代码允许我抓取多个网站，但它并没有按照我所需要的方式组合数据。我所希望的是将网站上的所有频率加起来，并创建两个栏：A栏加上单词，B栏加上所有频率。我编辑了我原来的帖子，希望能让它更清楚。谢谢你抽出时间！不幸的是，我还是遇到了同样的问题。它将单词从“最频繁”排序为“最少的Padraic”，因此，编辑代码的问题似乎在于，“频率”列中的值不会改变，而不管我输入了多少URL。好像只是在刮第一页？你发现问题了吗？发生的是代码在抓取每个页面，而不是将数据添加到一起，而是为每个网站（即-250、-400、-300）的每个单词创建副本。从那以后，就只需要将数据分组以找到答案。谢谢你的帮助！我已经在Hi Padraic上面编辑了代码，虽然这段代码允许我抓取多个网站，但它并没有按照我所需要的方式组合数据。我所希望的是把所有这些加起来