使用Python从多个文本文件中提取列_Python_Text_Csv_Extract

使用Python从多个文本文件中提取列

python text csv

使用Python从多个文本文件中提取列,python,text,csv,extract,Python,Text,Csv,Extract,我有一个文件夹，里面有5个文本文件，与不同的网站有关-- 标题的格式如下： Rockspring_18_SW.417712.WRFc36.ET.2000-2050.txt Rockspring_18_SW.417712.WRFc36.RAIN.2000-2050.txt WICA.399347.WRFc36.ET.2000-2050.txt WICA.399347.WRFc36.RAIN.2000-2050.txt 因此，文件名基本上遵循- （站点名称）。（站点编号）。（WRFc36）。

我有一个文件夹，里面有5个文本文件，与不同的网站有关--

标题的格式如下：

Rockspring_18_SW.417712.WRFc36.ET.2000-2050.txt

Rockspring_18_SW.417712.WRFc36.RAIN.2000-2050.txt

WICA.399347.WRFc36.ET.2000-2050.txt

WICA.399347.WRFc36.RAIN.2000-2050.txt

因此，文件名基本上遵循- （站点名称）。（站点编号）。（WRFc36）。（某些变量）。（2000-2050.txt

这些文本文件的格式与之类似，没有标题行：年-月-日值（每个文本文件中约18500行）

我希望Python搜索类似的文件名（站点名称和站点编号匹配），从其中一个文件中选择第一列到第三列数据，并将其粘贴到新的txt文件中。我还希望复制和粘贴站点（rain、et等）每个变量的第四列并将它们按特定顺序粘贴到新文件中

我知道如何使用csv模块从所有文件中获取数据（并定义空间delimeter的新方言）并打印到新的文本文件，但我不知道如何为每个站点名称/编号自动创建新文件，并确保我的变量以正确的顺序绘制--

我想使用的输出是每个站点的一个文本文件（不是5个），格式如下（年、月、日、variable1、variable2、variable3、variable4、variable5），大约18500行

我确信我在这里看到的是一些非常简单的东西…这似乎是非常基本的…但是-任何帮助都将不胜感激！

更新
==========

我已更新代码以反映以下评论。

从集合导入defaultdict 导入glob 导入csv

#Create dictionary of lists--   [A] = [Afilename1, Afilename2, Afilename3...]
#                               [B] = [Bfilename1, Bfilename2, Bfilename3...] 
def get_site_files():
    sites = defaultdict(list)
    #to start, I have a bunch of files in this format ---
    #"site name(unique)"."site num(unique)"."WRFc36"."Variable(5 for each site name)"."2000-2050"
    for fname in glob.glob("*.txt"):
        #split name at every instance of "."
        parts = fname.split(".")
        #check to make sure i only use the proper files-- having 6 parts to name and having WRFc36 as 3rd part
        if len(parts)==6 and parts[2]=='WRFc36':
            #Make sure site name is the full unique identifier, the first and second "parts"
            sites[parts[0]+"."+parts[1]].append(fname)
    return sites

#hardcode the variables for method 2, below
Var=["TAVE","RAIN","SMOIS_INST","ET","SFROFF"]

def main():
    for site_name, files in get_site_files().iteritems():
        print "Working on *****"+site_name+"*****"
####Method 1- I'd like to not hardcode in my variables (as in method 2), so I can use this script in other applications.
        for filename in files:
            reader = csv.reader(open(filename, "rb"))
            WriteFile = csv.writer(open("XX_"+site_name+"_combined.txt","wb"))
            for row in reader:
                row = reader.next()
####Method 2 works (mostly), but skips a LOT of random lines of first file, and doesn't utilize the functionality built into my dictionary of lists...            
##        reader0 = csv.reader(open(site_name+".WRFc36."+Var[0]+".2000-2050.txt", "rb"))    #I'd like to copy ALL columns from the first file
##        reader1 = csv.reader(open(site_name+".WRFc36."+Var[1]+".2000-2050.txt", "rb"))    #    and just the fourth column from all the rest of the files
##        reader2 = csv.reader(open(site_name+".WRFc36."+Var[2]+".2000-2050.txt", "rb"))    #    (the columns 1-3 are the same for all files)
##        reader3 = csv.reader(open(site_name+".WRFc36."+Var[3]+".2000-2050.txt", "rb"))
##        reader4 = csv.reader(open(site_name+".WRFc36."+Var[4]+".2000-2050.txt", "rb"))
##        WriteFile = csv.writer(open("XX_"+site_name+"_COMBINED.txt", "wb"))               #creates new command to write a text file
##
##        for row in reader0:
##            row  = reader0.next()
##            row1 = reader1.next()
##            row2 = reader2.next()
##            row3 = reader3.next()
##            row4 = reader4.next()
##            WriteFile.writerow(row + row1 + row2 + row3 + row4)
##        print "***finished with site***"

if __name__=="__main__":
    main()

就获取文件名而言，我将使用以下内容：

import os

# Gets a list of all file names that end in .txt
# ON *nix
file_names = os.popen('ls *.txt').read().split('\n')

# ON Windows
file_names = os.popen('dir /b *.txt').read().split('\n')

然后，要获得通常由句点分隔的元素，请使用：

# For some file_name in file_names
file_name.split('.')

然后，您可以继续比较并提取所需的列（通过使用open（file_name，'r'）或您的CSV解析器）

Michael G.

这里有一种更简单的方法，可以按站点分组遍历文件

from collections import defaultdict
import glob

def get_site_files():
    sites = defaultdict(list)
    for fname in glob.glob('*.txt'):
        parts = fname.split('.')
        if len(parts)==6 and parts[2]=='WRFc36':
            sites[parts[0]].append(fname)
    return sites

def main():
    for site,files in get_site_files().iteritems():
        # you need to better explain what you are trying to do here!
        print site, files

if __name__=="__main__":
    main()

我仍然不理解您的剪切和粘贴列-您需要更清楚地解释您试图完成的任务。

您还需要从文件名列表中删除“”（空字符串）。您对我编写的代码有何想法--（）它没有使用你的代码，但也许你对这个版本有一些了解？我在上面添加了一些新代码来反映你的模式。至于剪切和粘贴列，我有几个研究站点-每个研究站点有5个文本文件（5个变量中的每一个都有一个）因此，对于5个研究站点，我将有25个文本文件。每个文本文件的列的格式相同：年-月-日变量值。我想复制一个文件中的日期，只复制每个研究站点所有其他文件中的变量值。因此，对于5个研究站点，我将只得到一个文本文件，列的格式为：年-月-日变量1 Var2 Var3 Var4 Var5.别忘了在这种情况下，

glob.iglob（'*.txt'）

将创建一个迭代器，并避免创建一个值列表。@hochl我想glob.iglob如果使用方法2会更简单（请参见这里的代码codepad.org/3mQEM75e），但我想使用方法1…glob.glob对这两种方法都有效--比如说，我如何将更新后的代码粘贴到我的原始问题中？我想不出来，所以我提供了指向它的链接（codepad.org/3mQEM75e）。不确定你的意思，但你可以编辑你的问题，仅此而已。如果链接消失，链接到代码可能会降低你文章的价值，因此通常最好将相关代码直接包含在你的文章中。@hochl明白了！我现在用代码编辑我的问题，而不是将来可能死去的人Andy知道为什么我会在上面代码中的方法2中丢失随机行吗？