Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/docker/9.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
用Python解析写得不好的CSV(制表符和空格)_Python_Parsing_Csv - Fatal编程技术网

用Python解析写得不好的CSV(制表符和空格)

用Python解析写得不好的CSV(制表符和空格),python,parsing,csv,Python,Parsing,Csv,因此,我编写了一个脚本,下载一个文件,保存该文件,然后读取该文件,并将值存储在变量中,以便进一步使用。我的问题是原始文件的格式不是很好。我原以为它是以制表符分隔的,但似乎也有一些多余的空格 下面是我的代码的一个子集: for link in soup.find_all('a', href=re.compile('emailLogins_20130930')): parsedURL = str(link.get('href')).strip() fileURL = 'https:/

因此,我编写了一个脚本,下载一个文件,保存该文件,然后读取该文件,并将值存储在变量中,以便进一步使用。我的问题是原始文件的格式不是很好。我原以为它是以制表符分隔的,但似乎也有一些多余的空格

下面是我的代码的一个子集:

for link in soup.find_all('a', href=re.compile('emailLogins_20130930')):
    parsedURL = str(link.get('href')).strip()
    fileURL = 'https://xxxxx.com'+parsedURL
    request = opener.open(fileURL)
    out = open(fileName, 'a')

    for row in request:
        if re.search('Source', row): #strip out the header on the original file before rewriting the file to disk
            continue
        else:
            if row.strip():
                for column in row:
                    out.write(column)
            else:
                continue
    out.close()
    time.sleep(4)


#This part removes some null lines (if they exist) I found previously in the files  
with open(fileName, 'rb') as f_origin:
    data = f_origin.read()

with open('cleanCSV.csv', 'wb') as f_clean: 
    f_clean.write(data.replace('\x00', ''))

#Attempting to remove the tabs and replace with commas.  My thought was that the spaces would just be included in the strings
#but it looks as though those are being converted to ',' as well.

in_txt = csv.reader(open('cleanCSV.csv', 'rb'), delimiter = '\t')
out_csv = csv.writer(open('new-csv-test.csv', 'wb'))
out_csv.writerows(in_txt)

filereader = open('new-csv-test.csv', 'rb')

reader = csv.reader(filereader, delimiter=',', quoting=csv.QUOTE_NONE)

for row in reader:
    rowlist = list(row)
    source = rowlist[0]
    print '0: ' + source
    #start_date = rowlist[1]
    #print '1: ' + start_date
    #start_time = rowlist[2]
    #print '2: ' + start_time
    #start = start_date + ' ' + start_time
    #print 'START: ' + start
    start = rowlist[1]
    print '1: ' + rowlist[1]
    start_dt = datetime.strptime(start, '%Y-%m-%d %H:%M:%S')
    start_ts = start_ts = start_dt.strftime('%b %d %Y %H:%M:%S')
    upstreamIP = rowlist[2]
    print '2: ' + upstreamIP
    username = rowlist[3]
    print '3: ' + username
    emailLogins = rowlist[4]
    print '4: ' + emailLogins
    emailProvider = rowlist[5]
    print '5: ' + emailProvider + '\n'

    mergedEmail = emailLogins+'@'+emailProvider
以下是原始文件的示例:

11.111.111.111_vpn_ 2013-09-29 19:50:35     NULL    Pxxx    aol.com
11.111.111.111_vpn_ 2013-09-29 19:49:50     NULL    Dxxxxxxx    aol.com
11.111.111.111_vpn_ 2013-09-29 19:54:24     NULL    fxxxxxxx_governmentgrant    aol.com
11.111.111.111_vpn__parsed  2013-09-30 10:58:48 98506   mxxxxx05    hxxxxxyen   yahoo.com
mace3_vpn_11.11.111.111     2013-09-30 11:14:48     NULL    mxxxxxys00  aol.com
11.111.111.111_vpn__parsed  2013-09-30 11:10:08 98506   mxxxxx05    hhxxxxxen   yahoo.com
mace3_vpn_111.111.111.1     2013-09-30 11:38:57     NULL    Fndxxxxxa   aol.com
mace3_vpn_11.11.111.111     2013-09-30 11:24:49     NULL    myxxxxxx00  aol.com
mace3_vpn_11.11.111.111     2013-09-30 11:25:16     NULL    mxxxxxxxxxx01   yahoo.com
下面是我的代码所做的(注意第一列后面的double',,我希望第二组double',因为经常有一列没有数据

111.111.111.1_vpn_,2013-09-29 19:50:35,,NULL,Pxxxx0,aol.com
111.111.111.1_vpn_,2013-09-29 19:49:50,,NULL,Dxxxxxen,aol.com
111.111.111.1_vpn_,2013-09-29 19:54:24,,NULL,fxxxxxxk_governmentgrant,aol.com
111.111.111.1_vpn__parsed,2013-09-30 10:58:48,98506,mxxxxxx5,hxxxxxxen,yahoo.com
mace3_vpn_111.111.111.1,,2013-09-30 11:14:48,,NULL,mxxxxxxs00,aol.com
111.111.111.1_vpn__parsed,2013-09-30 11:10:08,98506,mxxxxxx5,hxxxxxxen,yahoo.com
mace3_vpn_111.111.111.1,,2013-09-30 11:38:57,,NULL,Fxxxxxxxa,aol.com
mace3_vpn_111.111.111.1,,2013-09-30 11:24:49,,NULL,mxxxxxxs00,aol.com
mace3_vpn_111.111.111.1,,2013-09-30 11:25:16,,NULL,mxxxxxxxxx1,yahoo.com

我回去查看了VI中的原始文件,查看了第二行第一列和第二列之间空格的ASCII值,它似乎有两个选项卡,而不是一个空格和一个选项卡。不确定如何删除此处的额外选项卡,但保留额外选项卡,当列没有数据时,该选项卡应显示出来。

不是清楚您的问题。输入文件是否可能包含空字段?如果不可能,您可以简单地对其进行预处理,并用单个选项卡替换任何空白。否则,我不确定如何区分不正确的空白和合法的空白字段。我原以为所有字段都由选项卡分隔呃,似乎有些字段实际上可能被空格和制表符或类似的东西隔开,这使得python认为应该将其转换为逗号。如果在不应该的地方留下两个逗号,当我认为处理每一行时,它就会抛出我的变量。是的,日期字段也可能有一个合法的空格。我还尝试将所有空格转换为逗号,但遇到了相同的情况,即有一个空格和一个逗号,并插入了两个逗号,这导致csv的解析异常,认为不止一个字段没有。