Python 我正在尝试标记text/json文件。不知道为什么，但只有第一条推特被标记化了。代码如下从nltk.tokenize导入单词\u tokenize 导入json 进口稀土表情符号 (?: 眼睛 [oO\-]？#鼻子（可选） [D\）\]\（\]/\\OpP]#嘴 )""" regex_str=[ 表情符号， r']+>'，#HTML标记 r'（？：@[\w_]+），#@-提到 r“（？：\\\+[\w\+[\w\\-]*[\w\+）”，散列标签 r'http[s]？：/（？：[a-z]|[0-9]|[$-|&+]|[！*\（\），]|（？：%[0-9a-f][0-9a-f]）+，#URL r'（？：（？：\d+，？）+（？：\.？\d+），#数字 r“（？：[a-z][a-z'\-\-\+[a-z]）”，带-和' r'（？：[\w_]+）'，#其他单词 r'（？：\S）#还有别的吗 ] 令牌_re=re.compile（r'（'+'|'.join（regex_str）+'），re.VERBOSE | re.IGNORECASE） emoticon_re=re.compile（r'^'+emoticons_str+'$'，re.VERBOSE | re.IGNORECASE） def标记化：返回令牌\u re.findall（s） def预处理，小写=False：令牌=令牌化如果小写： tokens=[token if emoticon_re.search（token）else token.lower（）用于标记中的标记] 归还代币 #tweet=“RT@marcobonzanini:只是一个例子！：Dhttp://example.com #NLP“ #打印（预处理（推特）） #['RT'、'@marcobonzanini'、'：'、'just'、'an'、'example'、'！'、'：D'、'http://example.com“，”#NLP'] 将open（'../script/iphone.txt'，r'）作为f：对于f中的行： tweet=json.loads（行）令牌=预处理（tweet['text']） #做点别的（代币）打印（json.dumps（标记，缩进=4）_Python_Twitter_Tokenize

Python 我正在尝试标记text/json文件。不知道为什么，但只有第一条推特被标记化了。代码如下从nltk.tokenize导入单词\u tokenize 导入json 进口稀土表情符号 (?: 眼睛 [oO\-]？#鼻子（可选） [D\）\]\（\]/\\OpP]#嘴 )""" regex_str=[ 表情符号， r']+>'，#HTML标记 r'（？：@[\w_]+），#@-提到 r“（？：\\\+[\w\+[\w\\-]*[\w\+）”，散列标签 r'http[s]？：/（？：[a-z]|[0-9]|[$-|&+]|[！*\（\），]|（？：%[0-9a-f][0-9a-f]）+，#URL r'（？：（？：\d+，？）+（？：\.？\d+），#数字 r“（？：[a-z][a-z'\-\-\+[a-z]）”，带-和' r'（？：[\w_]+）'，#其他单词 r'（？：\S）#还有别的吗 ] 令牌_re=re.compile（r'（'+'|'.join（regex_str）+'），re.VERBOSE | re.IGNORECASE） emoticon_re=re.compile（r'^'+emoticons_str+'$'，re.VERBOSE | re.IGNORECASE） def标记化：返回令牌\u re.findall（s） def预处理，小写=False：令牌=令牌化如果小写： tokens=[token if emoticon_re.search（token）else token.lower（）用于标记中的标记] 归还代币 #tweet=“RT@marcobonzanini:只是一个例子！：Dhttp://example.com #NLP“ #打印（预处理（推特）） #['RT'、'@marcobonzanini'、'：'、'just'、'an'、'example'、'！'、'：D'、'http://example.com“，”#NLP'] 将open（'../script/iphone.txt'，r'）作为f：对于f中的行： tweet=json.loads（行）令牌=预处理（tweet['text']） #做点别的（代币）打印（json.dumps（标记，缩进=4）

python twitter

Python 我正在尝试标记text/json文件。不知道为什么，但只有第一条推特被标记化了。代码如下从nltk.tokenize导入单词\u tokenize 导入json 进口稀土表情符号 (?: 眼睛 [oO\-]？#鼻子（可选） [D\）\]\（\]/\\OpP]#嘴 )""" regex_str=[ 表情符号， r']+>'，#HTML标记 r'（？：@[\w_]+），#@-提到 r“（？：\\\+[\w\+[\w\\-]*[\w\+）”，散列标签 r'http[s]？：/（？：[a-z]|[0-9]|[$-|&+]|[！*\（\），]|（？：%[0-9a-f][0-9a-f]）+，#URL r'（？：（？：\d+，？）+（？：\.？\d+），#数字 r“（？：[a-z][a-z'\-\-\+[a-z]）”，带-和' r'（？：[\w_]+）'，#其他单词 r'（？：\S）#还有别的吗 ] 令牌_re=re.compile（r'（'+'|'.join（regex_str）+'），re.VERBOSE | re.IGNORECASE） emoticon_re=re.compile（r'^'+emoticons_str+'$'，re.VERBOSE | re.IGNORECASE） def标记化：返回令牌\u re.findall（s） def预处理，小写=False：令牌=令牌化如果小写： tokens=[token if emoticon_re.search（token）else token.lower（）用于标记中的标记] 归还代币 #tweet=“RT@marcobonzanini:只是一个例子！：Dhttp://example.com #NLP“ #打印（预处理（推特）） #['RT'、'@marcobonzanini'、'：'、'just'、'an'、'example'、'！'、'：D'、'http://example.com“，”#NLP'] 将open（'../script/iphone.txt'，r'）作为f：对于f中的行： tweet=json.loads（行）令牌=预处理（tweet['text']） #做点别的（代币）打印（json.dumps（标记，缩进=4）,python,twitter,tokenize,Python,Twitter,Tokenize,从回溯中，您可以看到。/script/iphone.txt中的第二行不是有效的JSON。这就是代码失败的原因。我也遇到了同样的问题，因为JSON文件中有一个空行，无法保存JSON。请尝试添加：换行符='\r\n' 因此，读取json文件的代码如下所示： from nltk.tokenize import word_tokenize import json import re emoticons_str = r""" (?: [:=;] # Eyes [

从回溯中，您可以看到

。/script/iphone.txt

中的第二行不是有效的JSON。这就是代码失败的原因。

我也遇到了同样的问题，因为JSON文件中有一个空行，无法保存JSON。请尝试添加：

换行符='\r\n'

因此，读取json文件的代码如下所示：

from nltk.tokenize import word_tokenize
import json
import re

emoticons_str = r"""
    (?:
        [:=;] # Eyes
        [oO\-]? # Nose (optional)
        [D\)\]\(\]/\\OpP] # Mouth
    )"""

regex_str = [
    emoticons_str,
    r'<[^>]+>', # HTML tags
    r'(?:@[\w_]+)', # @-mentions
    r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)", # hash-tags
    r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-f][0-  9a-f]))+', # URLs
    r'(?:(?:\d+,?)+(?:\.?\d+)?)', # numbers
    r"(?:[a-z][a-z'\-_]+[a-z])", # words with - and '
    r'(?:[\w_]+)', # other words
    r'(?:\S)' # anything else
    ]

tokens_re = re.compile(r'('+'|'.join(regex_str)+')', re.VERBOSE | re.IGNORECASE)
emoticon_re = re.compile(r'^'+emoticons_str+'$', re.VERBOSE | re.IGNORECASE)


def tokenize(s):
    return tokens_re.findall(s)


def preprocess(s, lowercase=False):
    tokens = tokenize(s)
    if lowercase:
        tokens = [token if emoticon_re.search(token) else token.lower() for token in tokens]
    return tokens


#tweet = "RT @marcobonzanini: just an example! :D http://example.com #NLP"
#print(preprocess(tweet))
# ['RT', '@marcobonzanini', ':', 'just', 'an', 'example', '!', ':D', 'http://example.com', '#NLP']

with open('../script/iphone.txt', 'r') as f: 
   for line in f:
        tweet = json.loads(line)
        tokens = preprocess(tweet['text'])
        #do_something_else(tokens)
        print(json.dumps(tokens, indent=4)

希望我能提供帮助

此外，我还在json文件上尝试了相同的代码，并且在windows上出现了相同的错误，我认为在Linux中，您只需要\n

with open('data/stream_sample.json', 'r', newline='\r\n') as f:
for line in f:
    tweet = json.loads(line)
    tokens = preprocess(tweet['text'])
    print(tokens)