python中使用引号的正则表达式

python中使用引号的正则表达式,python,regex,Python,Regex,我正在尝试为存储在文件中的类似于下面的字符串创建正则表达式模式。目的是为任何行获取任何列,这些行不必在一行上。例如,考虑下面的文件: "column1a","column2a","column 3a,", #entity 1 "column\"this is, a test\"4a" "column1b","colu mn2b,","column3b", #entity 2 "column\"this is, a test\"4b"

我正在尝试为存储在文件中的类似于下面的字符串创建正则表达式模式。目的是为任何行获取任何列,这些行不必在一行上。例如,考虑下面的文件:

"column1a","column2a","column
  3a,",             #entity 1
"column\"this is, a test\"4a"
"column1b","colu
     mn2b,","column3b",             #entity 2
"column\"this is, a test\"4b"
"column1c,","column2c","column3c",             #entity 3
"column\"this is, a test\"4c"

每个实体由四列组成,实体2的第4列为“列”,即测试“4b”,实体3的第2列为“列2C”。每一列都以引号开始,以引号结束,但是您必须小心,因为有些列有转义引号。提前谢谢

你可以这样做

  • 阅读整个文件

  • 根据不带逗号的换行符拆分输入

  • 迭代spitted元素,并再次对逗号(以及下面可选的换行符)进行拆分,该逗号前后都有双引号

  • 代码:

    这是支票

    $ cat f
    "column1a","column2a","column3a,",
    "column\"this is, a test\"4a"
    "column1b","column2b,","column3b",
    "column\"this is, a test\"4b"
    "column1c,","column2c","column3c",
    "column\"this is, a test\"4c"
    $ python3 f.py
    ['"column1a"', '"column2a"', '"column3a,"', '"column\\"this is, a test\\"4a"']
    ['"column1b"', '"column2b,"', '"column3b"', '"column\\"this is, a test\\"4b"']
    ['"column1c,"', '"column2c"', '"column3c"', '"column\\"this is, a test\\"4c"']
    

    f
    是输入文件名,
    f.py
    是包含python脚本的文件名。

    您的问题对于我每个月都要处理三次的问题非常熟悉:)除了我没有使用python来解决它,但我可以“翻译”我通常做的事情:

    text = r'''"column1a","column2a","column
      3a,",
    "column\"this is, a test\"4a"
    "column1a2","column2a2","column3a2","column4a2"
    "column1b","colu
         mn2b,","column3b",             
    "column\"this is, a test\"4b"
    "column1c,","column2c","column3c",
    "column\"this is, a test\"4c"'''
    
    import re
    
    # Number of columns one line is supposed to have
    columns = 4
    # Temporary variable to hold partial lines
    buffer = ""
    # Our regex to check for each column
    check = re.compile(r'"(?:[^"\\]*|\\.)*"')
    
    # Read the file line by line
    for line in text.split("\n"):
        # If there's no stored partial line, this is a new line
        if buffer == "":
            # Check if we get 4 columns and print, if not, put the line
            # into buffer so we store a partial line for later
            if len(check.findall(line)) == columns:
                print matches
            else:
                # use line.strip() if you need to trim whitespaces
                buffer = line
        else:
            # Update the variable (containing a partial line) with the
            # next line and recheck if we get 4 columns
            # use line.strip() if you need to trim whitespaces
            buffer = buffer + line
            # If we indeed get 4, our line is complete and print
            # We must not forget to empty buffer now that we got a whole line
            if len(check.findall(buffer)) == columns:
                print matches
                buffer = ""
            # Optional; always good to have a safety backdoor though
            # If there is a problem with the csv itself like a weird unescaped
            # quote, you send it somewhere else
            elif len(check.findall(buffer)) > columns:
                print "Error: cannot parse line:\n" + buffer
                buffer = ""
    

    到目前为止,您尝试了什么?我对正则表达式一无所知,我不知道从哪里开始解决这样的问题。它周围的代码不应该是问题标准CSV阅读器库有什么问题?IIRC它支持CSV的这种变体,只要有正确的选项就可以了。谢谢Avinash,只有一个问题-如果一个实体覆盖多行怎么办?我们能把它也考虑进去吗?所以你要找的不是新行,而是引用注:如果发生这种情况,即使一个实体(4列)被拆分成几百行,上述方法也能起作用。谢谢Jerry,看起来不错,但是我在尝试运行时看到了这个错误:raise error,v#无效表达式您使用的是哪一版本的python?回溯(最后一次调用):文件“test.py”,第16行,在check=re.compile(r'([^”\]*.\124;\\)+“')文件“/System/Library/Frameworks/python.framework/Versions/2.7/lib/python2.7/re.py”,第190行,在编译返回的“u compile(pattern,flags)文件”/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py”,第242行,在“编译引发错误”中,v#无效表达式sre_常量。错误:无需重复我不确定字符类的含义,但我的re版本也是2.2.1
    $ cat f
    "column1a","column2a","column3a,",
    "column\"this is, a test\"4a"
    "column1b","column2b,","column3b",
    "column\"this is, a test\"4b"
    "column1c,","column2c","column3c",
    "column\"this is, a test\"4c"
    $ python3 f.py
    ['"column1a"', '"column2a"', '"column3a,"', '"column\\"this is, a test\\"4a"']
    ['"column1b"', '"column2b,"', '"column3b"', '"column\\"this is, a test\\"4b"']
    ['"column1c,"', '"column2c"', '"column3c"', '"column\\"this is, a test\\"4c"']
    
    text = r'''"column1a","column2a","column
      3a,",
    "column\"this is, a test\"4a"
    "column1a2","column2a2","column3a2","column4a2"
    "column1b","colu
         mn2b,","column3b",             
    "column\"this is, a test\"4b"
    "column1c,","column2c","column3c",
    "column\"this is, a test\"4c"'''
    
    import re
    
    # Number of columns one line is supposed to have
    columns = 4
    # Temporary variable to hold partial lines
    buffer = ""
    # Our regex to check for each column
    check = re.compile(r'"(?:[^"\\]*|\\.)*"')
    
    # Read the file line by line
    for line in text.split("\n"):
        # If there's no stored partial line, this is a new line
        if buffer == "":
            # Check if we get 4 columns and print, if not, put the line
            # into buffer so we store a partial line for later
            if len(check.findall(line)) == columns:
                print matches
            else:
                # use line.strip() if you need to trim whitespaces
                buffer = line
        else:
            # Update the variable (containing a partial line) with the
            # next line and recheck if we get 4 columns
            # use line.strip() if you need to trim whitespaces
            buffer = buffer + line
            # If we indeed get 4, our line is complete and print
            # We must not forget to empty buffer now that we got a whole line
            if len(check.findall(buffer)) == columns:
                print matches
                buffer = ""
            # Optional; always good to have a safety backdoor though
            # If there is a problem with the csv itself like a weird unescaped
            # quote, you send it somewhere else
            elif len(check.findall(buffer)) > columns:
                print "Error: cannot parse line:\n" + buffer
                buffer = ""