如何在python中使用json到csv的逻辑
我已经从不同报纸的标题中删除了stopwords,只保留报纸的单词、日期和标题,并从json文件中生成csv,因此我使用以下代码如何在python中使用json到csv的逻辑,python,python-2.7,csv,Python,Python 2.7,Csv,我已经从不同报纸的标题中删除了stopwords,只保留报纸的单词、日期和标题,并从json文件中生成csv,因此我使用以下代码 import json import os import nltk import csv # Download nltk packages used in this example nltk.download('stopwords') BLOG_DATA = "resources/ch05-webpages/newspapers/timesofindia.json"
import json
import os
import nltk
import csv
# Download nltk packages used in this example
nltk.download('stopwords')
BLOG_DATA = "resources/ch05-webpages/newspapers/timesofindia.json"
blog_data = json.loads(open(BLOG_DATA).read())
blog_posts = []
stop_words = nltk.corpus.stopwords.words('english') + [
'.',
',',
'--',
'\'s',
'?',
')',
'(',
':',
'\'',
'\'re',
'"',
'-',
'}',
'{',
u'—',
'a', 'able', 'about', 'above', 'according', 'accordingly', 'across', 'actually', 'after'
]
for post in blog_data:
sentences = nltk.tokenize.sent_tokenize(post['title'])
words = [w.lower() for sentence in sentences for w in
nltk.tokenize.word_tokenize(sentence)]
fdist = nltk.FreqDist(words)
sentence1=nltk.tokenize.sent_tokenize(post['date'])
source=nltk.tokenize.sent_tokenize(post['source'])
num_words = sum([i[1] for i in fdist.items()])
num_unique_words = len(fdist.keys())
# Hapaxes are words that appear only once
num_hapaxes = len(fdist.hapaxes())
top_10_words_sans_stop_words = [w for w in fdist.items() if w[0]
not in stop_words][:100]
t=(['%s (%s)'% (w[0], w[1]) for w in top_10_words_sans_stop_words])
#print t
blog_posts.append((t,sentence1,source))
print blog_posts
out_file = os.path.join('resources', 'ch05-webpages','stopwords','timesofindia3.csv')
f = open(out_file, 'wb')
wr = csv.writer(f, quoting=csv.QUOTE_MINIMAL)
#f.write(json.dumps(blog_posts, indent=1))
wr.writerows(blog_posts)
f.close()
print 'Wrote output file to %s' % (f.name, )
这会产生这样的输出
"[u'3 (1)', u'6 (1)', u'acquitted (1)', u'case (1)', u'convicted (1)', u'kandhamal (1)', u'nun (1)', u'rape (1)']",
但我希望csv如下所示:
3 (1), 6 (1),acquitted (1),case (1),convicted (1),kandhamal (1), nun (1)', urape (1)
那么我如何才能做到这一点呢?如果您想为
t
中的每个项目设置一列,并为sentence1
和source
设置一列,请更改
blog_posts.append((t,sentence1,source))
到
为避免Unicode错误,请在将Unicode字符串传递给
csv.writer
之前,将其编码为首选编码,例如:
t= [(u'%s (%s)'% (w[0], w[1])).encode("utf8") for w in top_10_words_sans_stop_words]
Unicode,只需转换为string@MONTYHS:csv
csv
已经可以了。正在进行其他操作,例如在行上使用str()
。UnicodeEncodeError:“ascii”编解码器无法对位置0处的字符u'\u200b'进行编码:序号不在范围内(128)
t= [(u'%s (%s)'% (w[0], w[1])).encode("utf8") for w in top_10_words_sans_stop_words]