Python PySpark列表中有条件拆分逗号分隔的值

Python PySpark列表中有条件拆分逗号分隔的值,python,list,csv,pyspark,Python,List,Csv,Pyspark,我正试图在PySpark中运行一个作业。我的数据位于使用PySpark上下文类(sc)创建的RDD中,如下所示: directory\u file=sc.textFile('directory.csv') *我认为Python的csv模块不能用于RDD中的数据 这将为csv中的每一行生成一个列表。我知道这很讨厌,但这里有一个列表示例(相当于原始csv中的一行): 我想使用逗号作为分隔符拆分每个列表项,除非逗号位于双引号之间(例如“,”) parsed=directory_file.map(lam

我正试图在PySpark中运行一个作业。我的数据位于使用PySpark上下文类(sc)创建的RDD中,如下所示:

directory\u file=sc.textFile('directory.csv')

*我认为Python的csv模块不能用于RDD中的数据

这将为csv中的每一行生成一个列表。我知道这很讨厌,但这里有一个列表示例(相当于原始csv中的一行):

我想使用逗号作为分隔符拆分每个列表项,除非逗号位于双引号之间(例如“,”)

parsed=directory_file.map(lambda x:x.split(','))
显然没有处理双引号之间的逗号。有办法做到这一点吗?我看到这个问题是针对csv的,但由于在本例中csv首先加载到Spark RDD中,我非常确定
csv
模块在这里不适用


谢谢。

使用您的数据,应该可以:

new_csv = [""]
inside_quotes = False
pos = 0
for letter in csv:
    if letter == ",":
        if inside_quotes:
            new_csv[pos] += letter
        else:
            new_csv.append("")
            pos += 1
    elif letter == '"':
        inside_quotes = not inside_quotes  # Switch inside_quotes to True if False or vice versa.
    else:
        new_csv[pos] += letter

new_csv = [x for x in new_csv if x != ''] # Remove all '' 's.
print(new_csv)
输出

['14K685', 'El Puente Academy for Peace and Justice', 'Brooklyn', 'K778', '718-387-1125', '718-387-4229', '9', '12', 'B24, B39, B44, B44-SBS, B46, B48, B57, B60, B62, Q54, Q59', 'G to Broadway ; J, M to Hewes St ; Z to Marcy Ave', '250 Hooper Street', 'Brooklyn', 'NY', '11211', 'www.elpuente.us', '225', 'N/A', 'Consortium School', 'We are a small, innovative learning community that promotes comprehensive academic excellence for all students while inspiring and nurturing leadership for peace and justice. Our state-of-the-art facility allows for a creative and intellectually challenging environment where every student thrives. Our project-based curriculum is designed to prepare students to be active citizens and independent thinkers who share a passion for transforming their communities and the world into a better place. Our trimester system allows students to complete most of their high school credits by the 11th grade, opening opportunities for exciting internships and college courses during the school day in their senior year.', "Accelerated credit accumulation (up to 18 credits per year), iLearn, iZone 360, Year-long SAT (Scholastic Aptitude Test) preparatory course, Individualized college counseling, Early College Awareness & Preparatory Program (ECAPP). Visits to college campuses in NYC, Visits to colleges outside NYC in partnership with the El Puente Leadership Center, Internships, Community-based Projects, Portfolio Assessment, Integrated-Arts Projects, Before- and After-school Tutoring; Elective courses include: Drama, Dance (Men's and Women's Groups), Debate Team partnership with Tufts University, Guitar, Filmmaking, Architecture, Glee", 'Spanish', 'AM and PM Academic Support, B-Boy/B-Girl, Chorus, College and Vocational Counseling and Placement, College Prep, Community Development Project, Computers, Dance Level 1 and 2, Individual Drama; Education for Public Inquiry and International Citizenship (EPIIC), El Puente Leadership Center, Film, Fine Arts, Liberation, Media, Men’s and Women’s Groups, Movement Theater Level 1, Movement Theater Level 2, Music, Music Production, Pre-professional training in Dance, PSAT/SAT Prep, Spoken Word, Student Council, Teatro El Puente, Visual Art', 'Boys & Girls Basketball, Baseball, Softball, Volleyball', 'El Puente Williamsburg Leadership Center; The El Puente Bushwick Center; Leadership Center at Taylor-Wythe Houses; Beacon Leadership Center at MS50.', 'Woodhull Medical Center, Governor Hospital', 'Hunter College (CUNY), Eugene Lang College The New School for Liberal Arts, Pratt College of Design, Tufts University, and Touro College.', 'El Puente Leadership Center, El Puente Bushwick Center, Beacon Leadership Center at MS50, Leadership Center at Taylor-Wythe Houses, Center for Puerto Rican Studies, Hip- Hop Theatre Festival, Urban Word, and Summer Search.', 'Our school requires assessment of an Academic Portfolio for graduation.', '9:00 AM', '3:30 PM', 'This school will provide students with disabilities the supports and services indicated on their IEPs.', 'ESL', 'Not Functionally Accessible', '1', 'Priority to Brooklyn students or residents', 'Then to New York City residents', '250 Hooper Street']

它的工作原理

  • 初始化包含一个空字符串元素的
    列表
    new\u csv
    。这将在以后存储我们的最终输出

  • 初始化一个
    bool
    inside_quotes
    ,当我们对引号内或引号外的字母进行解析时,它将告诉我们的程序

  • 初始化一个
    int
    pos
    ,它将告诉我们我们在
    new\u csv
    列表中的位置

  • 迭代字符串中的每个字母

  • 检查字母是否为

    • 检查是否正在分析引号内的字符串

      • 如果这是
        True
        ,我们将
        添加到
        new\u csv
        中的字符串中

      • 如果这是
        False
        ,我们不添加它,我们添加一个新的空白字符串,然后我们
        pos+=1

  • 如果不是,检查字母是否为

    • 如果
      True
      ,我们使用方便的
      not
      关键字将
      bool
      内引号
      切换为True If False或False If True
  • 如果它是任何其他字符,我们只需将该字符添加到列表中的任何字符串中

  • 进行一些清理,并从列表中删除所有空白字符串,
    '

  • 打印:)


  • 这是阅读表格时的一个常见问题。谢天谢地,Python有一个库可以为您执行此操作,因此您不必手动执行。您说csv模块无法工作,为什么?如果无法工作,请尝试以下代码和注释

    import csv
    
    # please note: KEEP YOUR FILE AS STRINGS when you read in your data.
    # Don't do anything to it to try to split it or something.
    my_rdd = sc.textFile("/your/file/location/*)
    split_with_quotes = my_rdd.map(lambda row: next(csv.reader(row.splitlines(), skipinitialspace=True))
    
    您应该注意,csv包中的csv解析器的字符串长度限制为131072个字符,因此,如果您有非常长的字符串,则需要做更多的工作


    要检查是否是这种情况,请运行以下命令:
    my_rdd.filter(lambda x:len(x)>=131072).count()
    。如果
    count
    不是0,则字符串太长。

    可以使用正则表达式。在PySpark中,它的运行速度非常快:

    import re
    rdd=sc.textFile("factbook.csv")
    
    # Get rid of those commas we do not need
    cleanedRdd=rdd.map(lambda x:re.match( r'(.*".*)(,)(.*".*)', x, re.M|re.I).group(1)+" "re.match( r'(.*".*)(,)(.*".*)', x, re.M|re.I).group(3) if re.match( r'(.*".*)(,)(.*".*)', x, re.M|re.I) !=None else x)
    
    因此,对于与此类似的每一行:

    col1,"col2,blabla",col3
    
    此代码与正则表达式模式匹配。如果找到该模式,它将创建3个组:

    • 第一组:col1,“col2”
    • 第2组:
    • 第3组:布拉布拉”,col3
    最后,我们将组1和组2连接起来,输出为:

    col1,"col2 blabla",col3
    

    感谢您的回复,但这是一个特定于RDD对象的列表,它是不可编辑的,所以这对它不起作用。这是一个非常奇怪的情况。很抱歉,我不熟悉Spark。有没有办法使它可编辑?另外,我在这里只迭代了一个
    str
    。谢谢@Katya…我首先研究Spark特定的I可能会阻止在函数之外导入模块的异构化。您的解决方案肯定会在Spark之外工作。@dstar,我不知道您想要什么,但我会让您知道,这实际上在Spark内部工作。我在Spark RDD上多次使用了这段代码。您是否担心使用外部软件包,因为您需要你没办法找到他们吗?
    col1,"col2 blabla",col3