Python 修复BeautifulSoup4中循环的编码错误?
这是对和的后续行动 我没有使用twitterapi,因为它在这么远以前没有通过标签查看tweets 编辑:此处描述的错误仅发生在Windows 7中。正如bernie所报告的,该代码在Linux上按预期运行,请参见下面的注释,我能够在OSX 10.10.2上运行它,而不会出现编码错误 当我尝试循环刮取tweet内容的代码时,就会出现编码错误 这第一个片段只抓取第一条tweet,并按预期获取Python 修复BeautifulSoup4中循环的编码错误?,python,twitter,web-scraping,beautifulsoup,Python,Twitter,Web Scraping,Beautifulsoup,这是对和的后续行动 我没有使用twitterapi,因为它在这么远以前没有通过标签查看tweets 编辑:此处描述的错误仅发生在Windows 7中。正如bernie所报告的,该代码在Linux上按预期运行,请参见下面的注释,我能够在OSX 10.10.2上运行它,而不会出现编码错误 当我尝试循环刮取tweet内容的代码时,就会出现编码错误 这第一个片段只抓取第一条tweet,并按预期获取标记中的所有内容 amessagetext = soup('p', {'class': 'TweetText
标记中的所有内容
amessagetext = soup('p', {'class': 'TweetTextSize js-tweet-text tweet-text'})
amessage = amessagetext[0]
然而,当我尝试使用循环来使用第二个片段刮取所有tweet时
messagetexts = soup('p', {'class': 'TweetTextSize js-tweet-text tweet-text'})
messages = [messagetext for messagetext in messagetexts]
我得到了这个众所周知的cp437.py
编码错误
File "C:\Anaconda3\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2014' in position 4052: character maps to <undefined>
我通过使用open命令将
encoding=“utf-8”
添加到这两个,消除了用于错误检查的打印语句,并指定了正在刮取的HTML文件和csv输出文件的编码,从而解决了这一问题
from bs4 import BeautifulSoup
import requests
import sys
import csv
import re
from datetime import datetime
from pytz import timezone
url = input("Enter the name of the file to be scraped:")
with open(url, encoding="utf-8") as infile:
soup = BeautifulSoup(infile, "html.parser")
#url = 'https://twitter.com/search?q=%23bangkokbombing%20since%3A2015-08-10%20until%3A2015-09-30&src=typd&lang=en'
#headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
#r = requests.get(url, headers=headers)
#data = r.text.encode('utf-8')
#soup = BeautifulSoup(data, "html.parser")
names = soup('strong', {'class': 'fullname js-action-profile-name show-popup-with-id'})
usernames = [name.contents for name in names]
handles = soup('span', {'class': 'username js-action-profile-name'})
userhandles = [handle.contents[1].contents[0] for handle in handles]
athandles = [('@')+abhandle for abhandle in userhandles]
links = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
urls = [link["href"] for link in links]
fullurls = [permalink for permalink in urls]
timestamps = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
datetime = [timestamp["title"] for timestamp in timestamps]
messagetexts = soup('p', {'class': 'TweetTextSize js-tweet-text tweet-text'})
messages = [messagetext for messagetext in messagetexts]
retweets = soup('button', {'class': 'ProfileTweet-actionButtonUndo js-actionButton js-actionRetweet'})
retweetcounts = [retweet.contents[3].contents[1].contents[1].string for retweet in retweets]
favorites = soup('button', {'class': 'ProfileTweet-actionButtonUndo u-linkClean js-actionButton js-actionFavorite'})
favcounts = [favorite.contents[3].contents[1].contents[1].string for favorite in favorites]
images = soup('div', {'class': 'content'})
imagelinks = [src.contents[5].img if len(src.contents) > 5 else "No image" for src in images]
#print (usernames, "\n", "\n", athandles, "\n", "\n", fullurls, "\n", "\n", datetime, "\n", "\n",retweetcounts, "\n", "\n", favcounts, "\n", "\n", messages, "\n", "\n", imagelinks)
rows = zip(usernames,athandles,fullurls,datetime,retweetcounts,favcounts,messages,imagelinks)
rownew = list(rows)
#print (rownew)
newfile = input("Enter a filename for the table:") + ".csv"
with open(newfile, 'w', encoding='utf-8') as f:
writer = csv.writer(f, delimiter=",")
writer.writerow(['Usernames', 'Handles', 'Urls', 'Timestamp', 'Retweets', 'Favorites', 'Message', 'Image Link'])
for row in rownew:
writer.writerow(row)
我通过使用open
命令将encoding=“utf-8”
添加到这两个,消除了用于错误检查的打印语句,并指定了正在刮取的HTML文件和csv输出文件的编码,从而解决了这一问题
from bs4 import BeautifulSoup
import requests
import sys
import csv
import re
from datetime import datetime
from pytz import timezone
url = input("Enter the name of the file to be scraped:")
with open(url, encoding="utf-8") as infile:
soup = BeautifulSoup(infile, "html.parser")
#url = 'https://twitter.com/search?q=%23bangkokbombing%20since%3A2015-08-10%20until%3A2015-09-30&src=typd&lang=en'
#headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
#r = requests.get(url, headers=headers)
#data = r.text.encode('utf-8')
#soup = BeautifulSoup(data, "html.parser")
names = soup('strong', {'class': 'fullname js-action-profile-name show-popup-with-id'})
usernames = [name.contents for name in names]
handles = soup('span', {'class': 'username js-action-profile-name'})
userhandles = [handle.contents[1].contents[0] for handle in handles]
athandles = [('@')+abhandle for abhandle in userhandles]
links = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
urls = [link["href"] for link in links]
fullurls = [permalink for permalink in urls]
timestamps = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
datetime = [timestamp["title"] for timestamp in timestamps]
messagetexts = soup('p', {'class': 'TweetTextSize js-tweet-text tweet-text'})
messages = [messagetext for messagetext in messagetexts]
retweets = soup('button', {'class': 'ProfileTweet-actionButtonUndo js-actionButton js-actionRetweet'})
retweetcounts = [retweet.contents[3].contents[1].contents[1].string for retweet in retweets]
favorites = soup('button', {'class': 'ProfileTweet-actionButtonUndo u-linkClean js-actionButton js-actionFavorite'})
favcounts = [favorite.contents[3].contents[1].contents[1].string for favorite in favorites]
images = soup('div', {'class': 'content'})
imagelinks = [src.contents[5].img if len(src.contents) > 5 else "No image" for src in images]
#print (usernames, "\n", "\n", athandles, "\n", "\n", fullurls, "\n", "\n", datetime, "\n", "\n",retweetcounts, "\n", "\n", favcounts, "\n", "\n", messages, "\n", "\n", imagelinks)
rows = zip(usernames,athandles,fullurls,datetime,retweetcounts,favcounts,messages,imagelinks)
rownew = list(rows)
#print (rownew)
newfile = input("Enter a filename for the table:") + ".csv"
with open(newfile, 'w', encoding='utf-8') as f:
writer = csv.writer(f, delimiter=",")
writer.writerow(['Usernames', 'Handles', 'Urls', 'Timestamp', 'Retweets', 'Favorites', 'Message', 'Image Link'])
for row in rownew:
writer.writerow(row)
很好地把这个问题组合起来。投票表决。FWIW我无法在Linux(Ubuntu14.04)上重现这个错误,谢谢你提供的信息。我在Windows7上使用控制台2,以防有帮助。这可能会有帮助,谢谢你的提示。我会调查你的情况。在阅读了bernie的评论后,我在运行OSX 10.10.2的笔记本电脑上试用了它,得到了我想要的结果,没有编码错误。不过,我将保留这个问题,因为我对Windows的修复程序感兴趣。编辑原始帖子以包含此信息。我刚刚在Windows 7上再次尝试此操作,但仍无法重现错误。有趣。希望有其他人能来复制。很好地把这个问题组合起来。投票表决。FWIW我无法在Linux(Ubuntu14.04)上重现这个错误,谢谢你提供的信息。我在Windows7上使用控制台2,以防有帮助。这可能会有帮助,谢谢你的提示。我会调查你的情况。在阅读了bernie的评论后,我在运行OSX 10.10.2的笔记本电脑上试用了它,得到了我想要的结果,没有编码错误。不过,我将保留这个问题,因为我对Windows的修复程序感兴趣。编辑原始帖子以包含此信息。我刚刚在Windows 7上再次尝试此操作,但仍无法重现错误。有趣。希望其他人也能来繁殖。