如何在python中使用json到csv的逻辑

如何在python中使用json到csv的逻辑,python,python-2.7,csv,Python,Python 2.7,Csv,我已经从不同报纸的标题中删除了stopwords,只保留报纸的单词、日期和标题,并从json文件中生成csv,因此我使用以下代码 import json import os import nltk import csv # Download nltk packages used in this example nltk.download('stopwords') BLOG_DATA = "resources/ch05-webpages/newspapers/timesofindia.json"

我已经从不同报纸的标题中删除了stopwords,只保留报纸的单词、日期和标题,并从json文件中生成csv,因此我使用以下代码

import json
import os
import nltk
import csv
# Download nltk packages used in this example
nltk.download('stopwords')

BLOG_DATA = "resources/ch05-webpages/newspapers/timesofindia.json"

blog_data = json.loads(open(BLOG_DATA).read())
blog_posts = []
stop_words = nltk.corpus.stopwords.words('english') + [
    '.',
    ',',
    '--',
    '\'s',
    '?',
    ')',
    '(',
    ':',
    '\'',
    '\'re',
    '"',
    '-',
    '}',
    '{',
    u'—',
   'a', 'able', 'about', 'above', 'according', 'accordingly', 'across', 'actually', 'after'
]

 for post in blog_data:
     sentences = nltk.tokenize.sent_tokenize(post['title'])

     words = [w.lower() for sentence in sentences for w in
         nltk.tokenize.word_tokenize(sentence)]

     fdist = nltk.FreqDist(words)
     sentence1=nltk.tokenize.sent_tokenize(post['date'])
     source=nltk.tokenize.sent_tokenize(post['source'])
    num_words = sum([i[1] for i in fdist.items()])
    num_unique_words = len(fdist.keys())

    # Hapaxes are words that appear only once

    num_hapaxes = len(fdist.hapaxes())


    top_10_words_sans_stop_words = [w for w in fdist.items() if w[0]
                                not in stop_words][:100]

    t=(['%s (%s)'% (w[0], w[1]) for w in top_10_words_sans_stop_words])
    #print t 

    blog_posts.append((t,sentence1,source))
 print blog_posts
 out_file = os.path.join('resources', 'ch05-webpages','stopwords','timesofindia3.csv')
 f = open(out_file, 'wb')
 wr = csv.writer(f, quoting=csv.QUOTE_MINIMAL)
 #f.write(json.dumps(blog_posts, indent=1))
 wr.writerows(blog_posts)
 f.close()

 print 'Wrote output file to %s' % (f.name, )
这会产生这样的输出

"[u'3 (1)', u'6 (1)', u'acquitted (1)', u'case (1)', u'convicted (1)', u'kandhamal (1)', u'nun (1)', u'rape (1)']",
但我希望csv如下所示:

 3 (1), 6 (1),acquitted (1),case (1),convicted (1),kandhamal (1), nun (1)', urape (1)

那么我如何才能做到这一点呢?

如果您想为
t
中的每个项目设置一列,并为
sentence1
source
设置一列,请更改

blog_posts.append((t,sentence1,source))


为避免Unicode错误,请在将Unicode字符串传递给
csv.writer
之前,将其编码为首选编码,例如:

t= [(u'%s (%s)'% (w[0], w[1])).encode("utf8") for w in top_10_words_sans_stop_words]

Unicode,只需转换为string@MONTYHS:csv
csv
已经可以了。正在进行其他操作,例如在行上使用
str()
。UnicodeEncodeError:“ascii”编解码器无法对位置0处的字符u'\u200b'进行编码:序号不在范围内(128)
t= [(u'%s (%s)'% (w[0], w[1])).encode("utf8") for w in top_10_words_sans_stop_words]