python-unicode编码冲突_Python_Python 3.x_Unicode_Beautifulsoup_Web Crawler

python-unicode编码冲突

python python-3.x unicode web-crawler

python-unicode编码冲突,python,python-3.x,unicode,beautifulsoup,web-crawler,Python,Python 3.x,Unicode,Beautifulsoup,Web Crawler,更新-我试图包含crontab作业的完整路径，但同样的问题再次发生。。。我只对这篇含有拉丁字母“Moët”的文章有异议我是Python3新手，我需要一个与“unicode编码冲突”相关的问题的帮助我正在创建一个网络刮板，它可以将在线文章保存到本地我想做的是：使用Beautifulsoup获取文章标题检查文章标题是否不在本地保存的文章列表中如果标题匹配，则打印“文件存在”，不执行任何操作如果标题不匹配，则捕获文章内容并生成.txt文件代码如下： article_html = s

更新-我试图包含crontab作业的完整路径，但同样的问题再次发生。。。我只对这篇含有拉丁字母“Moët”的文章有异议

我是Python3新手，我需要一个与“unicode编码冲突”相关的问题的帮助

我正在创建一个网络刮板，它可以将在线文章保存到本地

我想做的是：

使用Beautifulsoup获取文章标题
检查文章标题是否不在本地保存的文章列表中
如果标题匹配，则打印“文件存在”，不执行任何操作
如果标题不匹配，则捕获文章内容并生成.txt文件

代码如下：

article_html = self.request(articles_URL)
soup = BeautifulSoup(article_html.text, 'html.parser')
title_modify = soup.title.string
title_real = title_modify + '.txt'
current_path = os.getcwd()
article_names = os.listdir(current_path)
if title_real in article_names:
    print(title_real, 'exists, no need to re-create')
else:
###omit codes for catching article content
    with codecs.open(title_real, "a", encoding='utf-8') as f:
        f.write(XXX)

Aug 26 09:50 XXX with Moët XXX.txt

Aug 27 09:29 XXX with Moët XXX (Unicode Encoding Conflict (1)).txt

Aug 26 20:30 XXX with Moët xxx (Unicode Encoding Conflict).txt

然后我使用预定的Centos 7 crontab作业让它自动运行。它每天都会检测到相同的web URL，并试图将新文章捕获为txt文件

它工作得很好，但是，今天我观察到它不适用于包含拉丁字符的文章标题。理想情况下，系统将打印“文件存在”并转到下一篇文章，但是，它显示程序创建了一些重复的文章：

article_html = self.request(articles_URL)
soup = BeautifulSoup(article_html.text, 'html.parser')
title_modify = soup.title.string
title_real = title_modify + '.txt'
current_path = os.getcwd()
article_names = os.listdir(current_path)
if title_real in article_names:
    print(title_real, 'exists, no need to re-create')
else:
###omit codes for catching article content
    with codecs.open(title_real, "a", encoding='utf-8') as f:
        f.write(XXX)

Aug 26 09:50 XXX with Moët XXX.txt

Aug 27 09:29 XXX with Moët XXX (Unicode Encoding Conflict (1)).txt

Aug 26 20:30 XXX with Moët xxx (Unicode Encoding Conflict).txt

奇怪的是，当我手动运行python脚本时，它工作得很好：

python test.py

XXX with Moët XXX.txt exists, no need to re-create

如果有人能帮忙，我将不胜感激

Cook

Crontab很可能使用了精简的环境，这可能会导致意外行为。看，这很可能会解决你的问题

基本上，您需要提供python可执行文件的完整路径（您可以通过运行

哪个python

来获得它）。因此，您将看到crontab条目如下所示：

20 4 * * * your_python_path your_program_path.py

字符串中的实际字节是多少？我把钱花在了一个Unicode规范化问题上，这可能是非常感谢所有帮助的重复：）