Python csv读取与其他列值对应的列_Python_Csv_Parsing_Python 3.5

Python csv读取与其他列值对应的列

python csv parsing

Python csv读取与其他列值对应的列,python,csv,parsing,python-3.5,Python,Csv,Parsing,Python 3.5,我需要解析一个csv文件输入：文件+名称 l = ['python is wonderful second line', 'this is another row'] Index | writer | year |单词菲利普1994这是第一排 1 | Heinz | 2000 | python是美妙的（新行）第二行 2 |托马斯| 1993 |我不喜欢这样 3 |亨氏| 1898 |这是另一排 . | . | . | . . |

我需要解析一个

csv

文件

输入：文件+名称

l = ['python is wonderful second line', 'this is another row']

Index | writer | year |单词
菲利普1994这是第一排
1 | Heinz | 2000 | python是美妙的（新行）第二行
2 |托马斯| 1993 |我不喜欢这样
3 |亨氏| 1898 |这是另一排
.     |     .      |    .   |      .
.     |     .      |    .   |      .
N | Fritz | 2014 |我讨厌曼联

输出：与名称对应的所有单词的列表

l = ['python is wonderful second line', 'this is another row']

我试过什么？

import csv
import sys

class artist:
    def __init__(self, name, file):
        self.file = file 
        self.name = name
        self.list = [] 

    def extractText(self):
        with open(self.file, 'rb') as f:
            reader = csv.reader(f)
            temp = list(reader)
        k = len(temp)
        for i in range(1, k):
            s = temp[i]
            if s[1] == self.name:
                self.list.append(str(s[3]))


if __name__ == '__main__':
    # arguements
    inputFile = str(sys.argv[1])
    Heinz = artist('Heinz', inputFile)
    Heinz.extractText()
    print(Heinz.list)

输出为：

["python is wonderful\r\nsecond line", 'this is another row']

对于包含多行单词的单元格，如何去除

\r\n

，并且循环速度非常慢，可以改进吗？

您可以简单地使用pandas获取列表：

import pandas
df = pandas.read_csv('test1.csv')
index = df[df['writer'] == "Heinz"].index.tolist() # get the specific name's index
l = list()
for i in index:
    l.append(df.iloc[i, 3].replace('\n','')) # get the cell and strip new line '\n', append to list.
l

输出：

['python is wonderful second line', 'this is another row']

您可以简单地使用pandas获取列表：

import pandas
df = pandas.read_csv('test1.csv')
index = df[df['writer'] == "Heinz"].index.tolist() # get the specific name's index
l = list()
for i in index:
    l.append(df.iloc[i, 3].replace('\n','')) # get the cell and strip new line '\n', append to list.
l

输出：

['python is wonderful second line', 'this is another row']

去掉

s[3]

中的换行符：我建议

'.join（s[3].splitlines（））

。有关详细信息，请参阅文档，另请参阅

改进循环：

def extractText(self):
    with open(self.file, 'rb') as f:
        for s in csv.reader(f):
            s = temp[i]
            if s[1] == self.name:
                self.list.append(str(s[3]))

这将保存一次数据传递

请考虑“小建议”并给熊猫一个尝试。

< P>在<代码> S（3）< /代码>中删除新行：我建议<代码> ''连接（S（3）.SPLILLITES（））< /代码>。有关详细信息，请参阅文档，另请参阅

改进循环：

def extractText(self):
    with open(self.file, 'rb') as f:
        for s in csv.reader(f):
            s = temp[i]
            if s[1] == self.name:
                self.list.append(str(s[3]))

这将保存一次数据传递

但是请考虑“小建议”并给熊猫一个尝试。

这至少应该更快，因为当你正在读取文件时，你正在解析，然后去掉不需要的回车和新的行字符，如果它们在那里。p>

with open(self.file) as csv_fh:
     for n in csv.reader(csv_fh):
         if n[1] == self.name:
            self.list.append(n[3].replace('\r\n', ' ')

这至少应该更快，因为您在读取文件时进行解析，然后去掉不需要的回车符和新行字符（如果有）

with open(self.file) as csv_fh:
     for n in csv.reader(csv_fh):
         if n[1] == self.name:
            self.list.append(n[3].replace('\r\n', ' ')

若要折叠多个空白，可以使用正则表达式，若要加快速度，请尝试循环理解：

import re

def extractText(self):
    RE_WHITESPACE = re.compile(r'[ \t\r\n]+')
    with open(self.file, 'rU') as f:
        reader = csv.reader(f)

        # skip the first line
        next(reader)

        # put all of the words into a list if the artist matches
        self.list = [RE_WHITESPACE.sub(' ', s[3])
                     for s in reader if s[1] == self.name]

若要折叠多个空白，可以使用正则表达式，若要加快速度，请尝试循环理解：

import re

def extractText(self):
    RE_WHITESPACE = re.compile(r'[ \t\r\n]+')
    with open(self.file, 'rU') as f:
        reader = csv.reader(f)

        # skip the first line
        next(reader)

        # put all of the words into a list if the artist matches
        self.list = [RE_WHITESPACE.sub(' ', s[3])
                     for s in reader if s[1] == self.name]

那不是我想要的。我需要特定作家/艺术家的话。不是所有的词。@TonyTannous用特定的作者更新了答案。这不是我想要的。我需要特定作家/艺术家的话。不是所有的单词。@TonyTannous用特定的书写者更新了答案。但是在删除一些行之前，我必须在每个对象中保留整个文本。不是吗？我需要特定的单词，而不是全部。原始代码将所有文件内容复制到

temp=list（reader）

的内存中；这里检查每一行的s[1]==self.name；大多数行都会被丢弃，但在删除一些行之前，我必须在每个对象中保留整个文本。不是吗？我需要特定的单词，而不是全部。原始代码将所有文件内容复制到

temp=list（reader）

的内存中；这里检查每一行的s[1]==self.name；大多数行被丢弃。