Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/20.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 使用嵌套循环和正则表达式按日期分解文件_Python_Regex_Csv_Nltk - Fatal编程技术网

Python 使用嵌套循环和正则表达式按日期分解文件

Python 使用嵌套循环和正则表达式按日期分解文件,python,regex,csv,nltk,Python,Regex,Csv,Nltk,我有一个装满文本文档的文件夹,我想写一个脚本,将每个文本文件分解成三个单词的集合,计算这些集合,然后将其写入csv文件,如下所示:“日期,三个单词的集合,频率,相对频率。”每个文本文件的标题如下所示: 2014.5.30RNC Chairman Priebus Statement.txt 2012.8.17Homeless Veterans Need More From Obama.txt 2012.9.6GLARING OMISSION #16/ Shinseki Glosses.txt 我

我有一个装满文本文档的文件夹,我想写一个脚本,将每个文本文件分解成三个单词的集合,计算这些集合,然后将其写入csv文件,如下所示:“日期,三个单词的集合,频率,相对频率。”每个文本文件的标题如下所示:

2014.5.30RNC Chairman Priebus Statement.txt
2012.8.17Homeless Veterans Need More From Obama.txt
2012.9.6GLARING OMISSION #16/ Shinseki Glosses.txt
我写了下面的脚本,它什么也不做,但也没有吐出错误消息。我认为这意味着我的正则表达式或嵌套循环有问题,但我不知道如何在没有错误消息的情况下解决这个问题。提前感谢您的帮助

corpus_root = '/Users/jolijttamanaha/Desktop/thesis2/RNC/Data2'

for year in range(2015, 1990, 1):
    for month in range(12, 9, 1):
        speeches = PlaintextCorpusReader(corpus_root, r'^{}\.{}\.\d*[\s\S]*'.format(year,month))
        raw = speeches.raw().lower()
        tokens = nltk.word_tokenize(raw.encode('utf-8').translate(None, string.punctuation))
        tgs = nltk.trigrams(tokens) 
        fdist = nltk.FreqDist(tgs) 
        minscore = 1
        numwords = len(raw)    
        print "Words in corpus:" 
        print numwords
        c = csv.writer(open("RNCngramsbymonth.csv", "a"))
        for k,v in fdist.items():
            if v > minscore:
                rf = Decimal(v)/Decimal(numwords)
                firstword, secondword, thirdword = k #splits up the list hidden in k 
                trigram = firstword + " " + secondword + " " + thirdword #turns the list in k into one string
                time = year + month
                results = time,trigram,v,rf 
                c.writerow(results)
                print firstword, secondword, thirdword, v, rf

        print "Done with month {}".format(month)