Python 将文本文件中的某些数据导出为CSV文件，而文本文件具有不同类型的分隔符_Python_Csv_Text

Python 将文本文件中的某些数据导出为CSV文件，而文本文件具有不同类型的分隔符

python csv text

Python 将文本文件中的某些数据导出为CSV文件，而文本文件具有不同类型的分隔符,python,csv,text,Python,Csv,Text,我想将部分信息导出到excel工作表或CSV文件。我预期的CSV文件如下所示： TT1 4444 | Drowsy | 9 19 bit drowsy TT2 45888 | Blurred see - hazy | 29 50 little seeing vision TT4 45933 | Excessive upper pain | 62 78 pain problems 如您所见，我不需要文本文件的第一、第四和第五列中的信息问题更新：某些行中的信息结构如下所示： Co

我想将部分信息导出到excel工作表或CSV文件。我预期的CSV文件如下所示：

TT1  4444 | Drowsy | 9 19   bit drowsy
TT2  45888 | Blurred see - hazy | 29 50 little seeing vision
TT4  45933 | Excessive upper pain  | 62 78  pain problems

如您所见，我不需要文本文件的第一、第四和第五列中的信息

问题更新： 某些行中的信息结构如下所示：

Column 1    Column 2                      column 3
4444        Drowsy                        bit drowsy 
45888       Blurred see - hazy            little seeing vision
45933       Excessive upper pain          pain problems

TT6 112397013 | ari | or 76948002|pain| 22 345  agony

预期产出如下：

Column 1    Column 2                      column 3
4444        Drowsy                        bit drowsy 
45888       Blurred see - hazy            little seeing vision
45933       Excessive upper pain          pain problems

TT6 112397013 | ari | or 76948002|pain| 22 345  agony

问题的第二次更新：文本文件中存在另一个异常：

Column 1    Column 2                      column 3
112397013     air                          agony
76948002      pain                         agony

我只希望这一行的输出如下：

TT9 CONCEPT_LESS 336 344    mobility

有什么建议吗？谢谢

我假设您可以以字符串列表的形式读入数据。代码使用正则表达式（re）将它们解析为所需的输出，然后将其写入csv文件：

CONCEPT_LESS   mobility

输出：

import re

#read lines from file using:
#lines = my_file.readlines()
lines = ["TT1  4444 | Drowsy | 9 19   bit drowsy",
         "TT2  45888 | Blurred see - hazy | 29 50 little seeing vision",
         "TT4  45933 | Excessive upper pain  | 62 78  pain problems"]

#Looks for TT some whitespace then numbers until another whitespace and vertical bar
tt_num_pattern = "TT.*\s([0-9].*?)\s"

#Only looks for letters after a space
describe_pattern = "\s(\D.*)"

#Format the output lines
out_lines = []
for line in lines:
    split_line = line.split("|")
    tt_num = re.findall(tt_num_pattern,split_line[0])[0]

    state = split_line[1].strip() #Just trim edges of whitespace
    describe = re.findall(describe_pattern,split_line[2])[0]
    describe = describe.strip()

    out_line = tt_num+","+state+","+describe
    out_lines.append(out_line)

#Print them out (would normally want to write to file after header line)
for out_line in out_lines:
    print out_line

很高兴这有帮助。这是您要求的更新。老实说，这不是很好（灵活）的代码，但它可以工作：

4444,Drowsy,bit drowsy
45888,Blurred see - hazy,little seeing vision
45933,Excessive upper pain,pain problems

更新输出：

import re

#read lines from file using:
#lines = my_file.readlines()
lines = ["TT1  4444 | Drowsy | 9 19   bit drowsy",
         "TT2  45888 | Blurred see - hazy | 29 50 little seeing vision",
         "TT4  45933 | Excessive upper pain  | 62 78  pain problems",
         "TT6 112397013 | air | or 76948002|pain| 22 345  agony"]

#Looks for TT some whitespace then numbers until another whitespace and vertical bar
tt_num_pattern = "TT.*\s([0-9].*?)\s"

#Only looks for letters after a space
describe_pattern = "\s(\D.*)"

#Format the output lines
out_lines = []
for line in lines:

    split_line = line.split("|")

    #If there is an 'or'
    if len(split_line) == 5:
        tt_num = split_line[2].replace("or","").strip()
        state = split_line[3].strip()
        describe = re.findall(describe_pattern,split_line[4])[0].strip()
        out_line = tt_num+","+state+","+describe
        out_lines.append(out_line)

        tt_num = re.findall(tt_num_pattern,split_line[0])[0]
        state = split_line[1].strip()
        out_line = tt_num+","+state+","+describe
        out_lines.append(out_line)


    #If there is no 'or'
    elif len(split_line) == 3:
        tt_num = re.findall(tt_num_pattern,split_line[0])[0]

        state = split_line[1].strip() #Just trim edges of whitespace
        describe = re.findall(describe_pattern,split_line[2])[0]
        describe = describe.strip()

        out_line = tt_num+","+state+","+describe
        out_lines.append(out_line)

#Print them out (would normally want to write to file after header line)
for out_line in out_lines:
    print out_line

我假设您可以将数据作为字符串列表读取。代码使用正则表达式（re）将它们解析为所需的输出，然后将其写入csv文件：

CONCEPT_LESS   mobility

输出：

import re

#read lines from file using:
#lines = my_file.readlines()
lines = ["TT1  4444 | Drowsy | 9 19   bit drowsy",
         "TT2  45888 | Blurred see - hazy | 29 50 little seeing vision",
         "TT4  45933 | Excessive upper pain  | 62 78  pain problems"]

#Looks for TT some whitespace then numbers until another whitespace and vertical bar
tt_num_pattern = "TT.*\s([0-9].*?)\s"

#Only looks for letters after a space
describe_pattern = "\s(\D.*)"

#Format the output lines
out_lines = []
for line in lines:
    split_line = line.split("|")
    tt_num = re.findall(tt_num_pattern,split_line[0])[0]

    state = split_line[1].strip() #Just trim edges of whitespace
    describe = re.findall(describe_pattern,split_line[2])[0]
    describe = describe.strip()

    out_line = tt_num+","+state+","+describe
    out_lines.append(out_line)

#Print them out (would normally want to write to file after header line)
for out_line in out_lines:
    print out_line

很高兴这有帮助。这是您要求的更新。老实说，这不是很好（灵活）的代码，但它可以工作：

4444,Drowsy,bit drowsy
45888,Blurred see - hazy,little seeing vision
45933,Excessive upper pain,pain problems

更新输出：

import re

#read lines from file using:
#lines = my_file.readlines()
lines = ["TT1  4444 | Drowsy | 9 19   bit drowsy",
         "TT2  45888 | Blurred see - hazy | 29 50 little seeing vision",
         "TT4  45933 | Excessive upper pain  | 62 78  pain problems",
         "TT6 112397013 | air | or 76948002|pain| 22 345  agony"]

#Looks for TT some whitespace then numbers until another whitespace and vertical bar
tt_num_pattern = "TT.*\s([0-9].*?)\s"

#Only looks for letters after a space
describe_pattern = "\s(\D.*)"

#Format the output lines
out_lines = []
for line in lines:

    split_line = line.split("|")

    #If there is an 'or'
    if len(split_line) == 5:
        tt_num = split_line[2].replace("or","").strip()
        state = split_line[3].strip()
        describe = re.findall(describe_pattern,split_line[4])[0].strip()
        out_line = tt_num+","+state+","+describe
        out_lines.append(out_line)

        tt_num = re.findall(tt_num_pattern,split_line[0])[0]
        state = split_line[1].strip()
        out_line = tt_num+","+state+","+describe
        out_lines.append(out_line)


    #If there is no 'or'
    elif len(split_line) == 3:
        tt_num = re.findall(tt_num_pattern,split_line[0])[0]

        state = split_line[1].strip() #Just trim edges of whitespace
        describe = re.findall(describe_pattern,split_line[2])[0]
        describe = describe.strip()

        out_line = tt_num+","+state+","+describe
        out_lines.append(out_line)

#Print them out (would normally want to write to file after header line)
for out_line in out_lines:
    print out_line

由于输入文本文件没有一种特定类型的分隔符，即管道分隔符、空格分隔符或逗号分隔符，所以我们需要将文件作为字符串读取

为了提取所需的信息，使用regex

csv模块用于创建数据并将数据写入csv

请查看csv模块的更多信息

xyz.txt的内容：

4444,Drowsy,bit drowsy
45888,Blurred see - hazy,little seeing vision
45933,Excessive upper pain,pain problems
76948002,pain,agony
112397013,air,agony

代码（内联注释）：

TT1  4444 | Drowsy | 9 19   bit drowsy
TT2  45888 | Blurred see - hazy | 29 50 little seeing vision
TT4  45933 | Excessive upper pain  | 62 78  pain problems
TT6 112397013 | air | or 76948002|pain| 22 345  agony
TT9 CONCEPT_LESS 336 344    mobility

import re
import csv


def extract_data(val):
    tmp1,tmp2,tmp3 = val[0],val[1],val[2]
    tmp1 = re.findall(r'.*\s+(\w+)',tmp1.strip())[0]
    tmp2 = tmp2.strip()
    tmp3 = re.findall(r'\s+(\D+)',tmp3.strip())[0]
    return (tmp1,tmp2,tmp3)

#Open CSV file for wrting data
csv_fh = open("demo.csv", 'w')
writer = csv.writer(csv_fh)
#Write Header to csv file
writer.writerow( ('Column 1', 'Column 2', 'Column 3') )

#Start reading text file line by line
with open("xyz.txt","r") as fh:
    for line in fh.readlines():
        #Check or in line
        if "or" in line:
            val_list = line.split('|')
            val1 = val_list[:2]
            val2 = val_list[2:]
            val1.append(val2[-1])
            for v in [val1,val2]:
                l = extract_data(v)
                writer.writerow( l )
        elif '|' in line and 'or' not in line:
            #Split on basis of pipe(|)
            val = line.split('|')
            l = extract_data(val)
            writer.writerow( l )
        elif '|' not in line:
            val = line.split()
            data = [val[1],val[4],'']
            writer.writerow( data )
        else:
            pass

#Close CSV file
csv_fh.close()

demo.csv的内容：

TT1  4444 | Drowsy | 9 19   bit drowsy
TT2  45888 | Blurred see - hazy | 29 50 little seeing vision
TT4  45933 | Excessive upper pain  | 62 78  pain problems
TT6 112397013 | air | or 76948002|pain| 22 345  agony
TT9 CONCEPT_LESS 336 344    mobility

import re
import csv


def extract_data(val):
    tmp1,tmp2,tmp3 = val[0],val[1],val[2]
    tmp1 = re.findall(r'.*\s+(\w+)',tmp1.strip())[0]
    tmp2 = tmp2.strip()
    tmp3 = re.findall(r'\s+(\D+)',tmp3.strip())[0]
    return (tmp1,tmp2,tmp3)

#Open CSV file for wrting data
csv_fh = open("demo.csv", 'w')
writer = csv.writer(csv_fh)
#Write Header to csv file
writer.writerow( ('Column 1', 'Column 2', 'Column 3') )

#Start reading text file line by line
with open("xyz.txt","r") as fh:
    for line in fh.readlines():
        #Check or in line
        if "or" in line:
            val_list = line.split('|')
            val1 = val_list[:2]
            val2 = val_list[2:]
            val1.append(val2[-1])
            for v in [val1,val2]:
                l = extract_data(v)
                writer.writerow( l )
        elif '|' in line and 'or' not in line:
            #Split on basis of pipe(|)
            val = line.split('|')
            l = extract_data(val)
            writer.writerow( l )
        elif '|' not in line:
            val = line.split()
            data = [val[1],val[4],'']
            writer.writerow( data )
        else:
            pass

#Close CSV file
csv_fh.close()

由于输入文本文件没有一种特定类型的分隔符，即管道分隔符、空格分隔符或逗号分隔符，所以我们需要将文件作为字符串读取

为了提取所需的信息，使用regex

csv模块用于创建数据并将数据写入csv

请查看csv模块的更多信息

xyz.txt的内容：

4444,Drowsy,bit drowsy
45888,Blurred see - hazy,little seeing vision
45933,Excessive upper pain,pain problems
76948002,pain,agony
112397013,air,agony

代码（内联注释）：

TT1  4444 | Drowsy | 9 19   bit drowsy
TT2  45888 | Blurred see - hazy | 29 50 little seeing vision
TT4  45933 | Excessive upper pain  | 62 78  pain problems
TT6 112397013 | air | or 76948002|pain| 22 345  agony
TT9 CONCEPT_LESS 336 344    mobility

import re
import csv


def extract_data(val):
    tmp1,tmp2,tmp3 = val[0],val[1],val[2]
    tmp1 = re.findall(r'.*\s+(\w+)',tmp1.strip())[0]
    tmp2 = tmp2.strip()
    tmp3 = re.findall(r'\s+(\D+)',tmp3.strip())[0]
    return (tmp1,tmp2,tmp3)

#Open CSV file for wrting data
csv_fh = open("demo.csv", 'w')
writer = csv.writer(csv_fh)
#Write Header to csv file
writer.writerow( ('Column 1', 'Column 2', 'Column 3') )

#Start reading text file line by line
with open("xyz.txt","r") as fh:
    for line in fh.readlines():
        #Check or in line
        if "or" in line:
            val_list = line.split('|')
            val1 = val_list[:2]
            val2 = val_list[2:]
            val1.append(val2[-1])
            for v in [val1,val2]:
                l = extract_data(v)
                writer.writerow( l )
        elif '|' in line and 'or' not in line:
            #Split on basis of pipe(|)
            val = line.split('|')
            l = extract_data(val)
            writer.writerow( l )
        elif '|' not in line:
            val = line.split()
            data = [val[1],val[4],'']
            writer.writerow( data )
        else:
            pass

#Close CSV file
csv_fh.close()

demo.csv的内容：

TT1  4444 | Drowsy | 9 19   bit drowsy
TT2  45888 | Blurred see - hazy | 29 50 little seeing vision
TT4  45933 | Excessive upper pain  | 62 78  pain problems
TT6 112397013 | air | or 76948002|pain| 22 345  agony
TT9 CONCEPT_LESS 336 344    mobility

import re
import csv


def extract_data(val):
    tmp1,tmp2,tmp3 = val[0],val[1],val[2]
    tmp1 = re.findall(r'.*\s+(\w+)',tmp1.strip())[0]
    tmp2 = tmp2.strip()
    tmp3 = re.findall(r'\s+(\D+)',tmp3.strip())[0]
    return (tmp1,tmp2,tmp3)

#Open CSV file for wrting data
csv_fh = open("demo.csv", 'w')
writer = csv.writer(csv_fh)
#Write Header to csv file
writer.writerow( ('Column 1', 'Column 2', 'Column 3') )

#Start reading text file line by line
with open("xyz.txt","r") as fh:
    for line in fh.readlines():
        #Check or in line
        if "or" in line:
            val_list = line.split('|')
            val1 = val_list[:2]
            val2 = val_list[2:]
            val1.append(val2[-1])
            for v in [val1,val2]:
                l = extract_data(v)
                writer.writerow( l )
        elif '|' in line and 'or' not in line:
            #Split on basis of pipe(|)
            val = line.split('|')
            l = extract_data(val)
            writer.writerow( l )
        elif '|' not in line:
            val = line.split()
            data = [val[1],val[4],'']
            writer.writerow( data )
        else:
            pass

#Close CSV file
csv_fh.close()

您的数据是如何存储在python中的？@depperm，您是在问我之前的问题吗？如果是，星期三，我将访问数据，并测试解决方案。非常感谢。您的数据是如何存储在python中的？@depperm，您是在问我之前的问题吗？如果是，星期三，我将访问数据，并测试解决方案。非常感谢。非常感谢你！，代码中有两个错误：对于行：writer.writerow（（'Column 1'，'Column 2'，'Column 3'））和writer.writerow（[tmp1，tmp2，tmp3]），它表示“TypeError：需要一个类似字节的对象，而不是'str'”@Mary-请检查更新的代码。我已经对它进行了更改，以便它也适用于您的相关更新。@Mary-我认为您正在使用Python 3。这就是为什么会出现错误（在评论2中）。只需要将csv_fh=open（“demo.csv”和“wb”）更改为csv_fh=open（“demo.csv”和“w”）。我在更新的代码中更改了此项。完美的代码。它工作得很好。请检查我的第二次更新好吗？@Mary-更新了代码。请检查一下，非常感谢！，代码中有两个错误：对于行：writer.writerow（（'Column 1'，'Column 2'，'Column 3'））和writer.writerow（[tmp1，tmp2，tmp3]），它表示“TypeError：需要一个类似字节的对象，而不是'str'”@Mary-请检查更新的代码。我已经对它进行了更改，以便它也适用于您的相关更新。@Mary-我认为您正在使用Python 3。这就是为什么会出现错误（在评论2中）。只需要将csv_fh=open（“demo.csv”和“wb”）更改为csv_fh=open（“demo.csv”和“w”）。我在更新的代码中更改了此项。完美的代码。它工作得很好。请检查我的第二次更新好吗？@Mary-更新了代码。请检查。非常感谢您的回答，请检查我问题的更新部分好吗？非常感谢您抽出时间。文件中还有另一个异常。请你再检查一遍好吗？嗨@Mary，看起来Dinesh已经涵盖了你：）请你回答这个问题：非常感谢你的回答，请你检查一下我问题的更新部分好吗？我非常感谢你的时间。文件中还有另一个异常。请你再检查一遍好吗？嗨，玛丽，看起来迪内什已经介绍了你：）请你回答这个问题：