用Python读取包含英语和阿拉伯语文本的文本文件_Python_Python 3.x_Encoding

用Python读取包含英语和阿拉伯语文本的文本文件

python python-3.x encoding

用Python读取包含英语和阿拉伯语文本的文本文件,python,python-3.x,encoding,Python,Python 3.x,Encoding,我试图读取一个文本文件，其中包含Instagram公开发布的图像及其元数据。每行都有一个完整的帖子及其所有元数据。这张图片的一部分是用阿拉伯语写的。当我使用Python读取文件时，但在打印行之后，阿拉伯文本不会显示出来。阿拉伯语文本显示为etc。\xd9\x8a\xd8 这是我用来读取.txt文件的代码 test_file = codecs.open('instagram_info.txt', mode='r', encoding='utf-8') print ("reading ima

我试图读取一个文本文件，其中包含Instagram公开发布的图像及其元数据。每行都有一个完整的帖子及其所有元数据。这张图片的一部分是用阿拉伯语写的。当我使用Python读取文件时，但在打印行之后，阿拉伯文本不会显示出来。阿拉伯语文本显示为etc。\xd9\x8a\xd8

这是我用来读取.txt文件的代码

 test_file = codecs.open('instagram_info.txt', mode='r', encoding='utf-8')
 print ("reading  images URLs file")
 counter = 0
 for line in test_file:
     print("Line: ", line.encode("utf-8"))
     counter += 1
     print(counter)
     if counter == 50:
     break
test_file.close()

这是文本文件中的一个行示例

100158441   25.256887893    51.507485363    Centerpoint 4f09c7a6e4b090ef234993e3               http://scontent.cdninstagram.com/hphotos-xpa1/outbound-distilleryimage9/t0.0-17/OBPTH/9ecde7ecac7811e3b87a12bcaa646ac5_8.jpg sarrah80    25.256887893    51.507485363    2014-03-15 19:37:45 1394912265  16144       ولا راضي يوقف يم الارنوب عشان اصوره dody_nasser said "هههه اكيد خايف الجبان 
Python 3 naturally supports unicode. You do not need codecs.open
. open
 will work.
.encode
 is what's causing it to turn into this: \xd9\x8a\xd8 . You can remove that function call. print("Line: ", line)
The problem not with reading the text. The problem is with print()
. Your console may not be capable to consume the unicode text. Try to write the result to a file and look inside using a unicode-capable text editor.

Firstly, follow the NightShadeQueen suggestions. Then try to copy the lines to another file to check:

#!python3
with open('instagram_info.txt', mode='r', encoding='utf-8') as fin, \
     open('output.txt', 'w', encoding='utf-8') as fout:
    for n, line in enumerate(fin, 1):
        fout.write(line)
        if n == 50:
            break

100158441 25.256887893 51.507485363中心点4f09c7a6e4b090ef234993e3http://scontent.cdninstagram.com/hphotos-xpa1/outbound-distilleryimage9/t0.0-17/OBPTH/9ecde7ecac7811e3b87a12bcaa646ac5_8.jpg sarrah80 25.256887893 51.507485363 2014-03-15 19:37:45 1394912265多迪·纳赛尔说：“你知道吗
Python 3自然支持unicode。您不需要编解码器。open
open
可以工作
.encode
是导致它变成这样的原因：\xd9\x8a\xd8。您可以删除该函数调用。打印（“行：”，行）
不是读取文本的问题。问题在于print（）
。您的控制台可能无法使用unicode文本。请尝试将结果写入文件，并使用支持unicode的文本编辑器查看内部
首先，按照NightShadeQueen的建议进行操作。然后尝试将这些行复制到另一个文件以检查：
#!python3
with open('instagram_info.txt', encoding='utf-8') as f:
    for n, line in enumerate(f, 1):
        print(line, end='')
        if n == 50:
            break

学习自动关闭文件对象的with
结构。enumerate（）
将自动计算行数。使用此代码以及存储在UTF-8中instagram_info.txt
中的示例，您应该会得到相同的output.txt
（前50行）
然后尝试在相同情况下使用print（）
的第二个示例。注意print
中的end='
——它禁止自动添加换行符，因为换行符是行的一部分
c:\...\>chcp 65001

如果您使用的是Windows，请转到cmd
窗口，并尝试使用
#!/usr/bin/env python3
from itertools import islice

with open('instagram_info.txt', encoding='utf-8-sig') as file:
    print("reading  images URLs file")
    for line in islice(file, 50): # read no more than 50 lines from the file
        print("Line: ", line, end='')

然后再次运行Python脚本。控制台可能仍然无法显示所有字符（控制台相当愚蠢）。在某些Python GUI窗口中显示文本可能更容易。
不要对行进行编码；直接打印Unicode文本：
我尝试了您的建议@NightShadeQueen，但它给出了另一个错误，请参见下面的：return codecs.charmap\u encode（input，self.errors，encoding\u table）[0]UnicodeEncodeError:“charmap”编解码器无法对位置0中的字符“\ufeff”进行编码：字符映射到有趣的字符。您确定输入是UTF-8而不是UTF-16吗？请参阅：是的，文本文件编码为UTF-8@NightShadeQueen，不要使用chcp 65001
。要将任意文本打印到Windows控制台，