使用Python将列从csv转换为列表

使用Python将列从csv转换为列表,python,list,csv,python-3.x,pearson,Python,List,Csv,Python 3.x,Pearson,因此,我对Python基本上是新手,我已经阅读了很多文章,但我仍然不确定如何忽略带有“#”的行 我需要: 将此tsv文件中的四列(col2-col5)放入单独的列表中。 (由于夏威夷的数据不完整,因此使用49个数据点,我如何选择忽略该线。) 然后定义一个函数Pearson(X,Y),该函数将两个列表作为参数并返回Pearson相关系数。设X=[x1,x2,…,xn]和Y=[y1,y2,…,yn]。X和Y之间的皮尔逊相关系数由下式给出: r=(n×席一-西席一)/((√(n∑xi^2-(∑xi^

因此,我对Python基本上是新手,我已经阅读了很多文章,但我仍然不确定如何忽略带有“#”的行

我需要:

  • 将此tsv文件中的四列(col2-col5)放入单独的列表中。 (由于夏威夷的数据不完整,因此使用49个数据点,我如何选择忽略该线。)

  • 然后定义一个函数Pearson(X,Y),该函数将两个列表作为参数并返回Pearson相关系数。设X=[x1,x2,…,xn]和Y=[y1,y2,…,yn]。X和Y之间的皮尔逊相关系数由下式给出:

  • r=(n×席一-西席一)/((√(n∑xi^2-(∑xi^2)^2(n∑yi^2-(∑yi)^2)

    定义函数时,如何写出∑符号

    #------------------------------------------------------------------------
    # Data from the CDC -- http://www.cdc.gov/ -- reports on prevalence of
    # smoking, incidence of lung cancer, and deaths attributed to smoking.
    # 
    # Col1: state
    # Col2: cases of lung cancer (per 100,000 inhabitants)
    # Col3: smoking among adults (%)
    # Col4: attempts to quit (%)
    # Col5: smoking-related deaths (per 100,000 inhabitants)
    #------------------------------------------------------------------------
    Alabama 107.1   24.9    47.1    321.1
    Alaska  89  24.9    54.2    296.2
    Arizona 63.4    18.6    49.4    248.9
    Arkansas    105 25.7    45.6    334.1
    California  64.4    14.8    51.4    261
    Colorado    56.9    20.1    42.4    252.7
    Connecticut 81.1    18.1    49  253.8/
    
    
    Delaware    98.8    24.5    48.7    296
    District of Columbia    80.2    21  54.2    257.3
    Florida 85.5    20.4    44.2    275.5
    Georgia 98.3    20.1    54.8    312.3
    Hawaii  68.2    NA  NA  185.1
    Idaho   62.7    17.5    47.2    254.1
    Illinois    92  22.2    49.3    278.4
    Indiana 102.8   25  47.5    322.2
    Iowa    91.8    20.8    42.9    256.7
    Kansas  84.8    19.8    43.7    270.8
    Kentucky    132.6   27.6    47.6    378.1
    Louisiana   108 23.6    51.8    309.1
    Maine   99.3    21  55.3    303.8
    Maryland    80.1    19.7    51.1    279.5
    Massachusetts   83.3    18.5    52.5    258.6
    Michigan    90  23.4    55.6    296.3
    Minnesota   65  20.7    43.6    225.3
    Mississippi 115.4   24.6    48.9    343.2
    Missouri    103.9   24.1    43  325
    Montana 73.1    20.4    45.4    292.6
    Nebraska    82.8    20.3    46.7    251.9
    Nevada  82.7    23.2    41.4    370.4
    New Hampshire   80.6    21.8    53.2    294.8
    New Jersey  78.8    18.9    49.6    253.1
    New Mexico  57.6    20.3    45.6    250.8
    New York    76.7    20  51.5    259.6
    North Carolina  104.1   23.2    49.2    307
    North Dakota    71.5    19.9    43.9    233
    Ohio    97.4    25.9    41.3    310.6
    Oklahoma    102.6   26.1    45.1    321.7
    Oregon  77.6    20  46  277.5
    Pennsylvania    89.4    22.7    47.1    269.1
    Rhode Island    84.6    21.3    53.1    283
    South Carolina  99.4    24.5    49.1    303.3
    South Dakota    78.8    20.3    46.4    253.8
    Tennessee   111.1   26.1    46.6    333.6
    Texas   83.2    20.6    46.4    287.4
    Utah    33.1    10.5    53.7    144.9
    Vermont 90.2    20  55.4    272.2
    Virginia    86.7    20.9    44.8    288.7
    Washington  76.2    19.2    51.6    279.1
    West Virginia   120 26.9    46.2    361.6
    Wisconsin   75  22  47.7    258.2
    Wyoming 57.8    21.7    48.5    294.2
    
    这就是我到目前为止所做的:

    import csv
    import operator
    import math
    import sys
    
    cases_lung_cancer = [] #blank list for 1st column
    
    smoking_adults = [] #blank list for 2nd column
    
    attempts_quit = [] #blank list for 3rd column
    
    smoking_deaths = [] #blank list for 4th column
    
    def Pearson(X, Y)
    
    with open('cdc_data.tsv', newline= ' ') as csv_f:
    
        for row in csv.DictReader(csv_f, delimiter='\t'):
    
    将此tsv文件中的四列(col2-col5)放入单独的列表中, 我选择忽略与夏威夷的线路,因为它不完整 数据,因此使用49个数据点

    col0=[]
    col1=[]
    col2=[]
    col3=[]
    col4=[]
    f=打开('cdc_data.tsv','r')
    contents=f.read()
    lines=contents.split('\n')#将文件拆分为单独的行
    对于行中的行:
    如果(第[0:1]行=='#'):#过滤掉注释
    持续
    split_line=line.split('\t')#将行拆分为单独的单词,用制表符分隔
    如果(len(split_line)<5):#删除任何不是5列的行
    持续
    #将每列分配到单独的列表中
    col0.append(拆分_行[0])
    col1.append(拆分_行[1])
    col2.append(拆分_行[2])
    col3.append(拆分_行[3])
    col4.append(拆分_行[4])
    
    我将把这个问题留给夏威夷和你的#2问题作为练习,让你来完成

    import math
    
    col0 = []
    col1 = []
    col2 = []
    col3 = []
    col4 = []
    
    f = open('cdc_data.tsv', 'r')
    
    def Pearson(X,Y):
        n=50 
        a=0
        b=0
        c=0 
        d=0
        e=0 
        f=0
        g=0 
        for i in range(n):
            a+=X[i]*Y[i]
            b+=X[i]
            c+=Y[i]
            d+=X[i]**2
            e+=X[i] #remember to square this
            f+=Y[i]**2
            g+=Y[i] #remember to square this
    
        return (n*a-b*c)/(math.sqrt((n*d-e**2)*(n*f-g**2)))
    
    
    contents = f.read()
    
    lines = contents.split('\n')    # split file into seperate lines
    for line in lines:
        if (line[0:1] == '#'):   # filter out comments
           continue
        if (line[0:1] == 'H'): #filter out Hawaii
            continue
    
        split_line = line.split('\t')   # split line into seperate words seperated by TAB
    
        if (len(split_line) < 5): # drop any line that isn't 5 columns
           continue
    
    # assign each column into a separate list
    col0.append(split_line[0])
    col1.append(float(split_line[1]))
    col2.append(float(split_line[2]))
    col3.append(float(split_line[3]))
    col4.append(float(split_line[4]))
    
    
    print("Correlation for col1 and col2: %.4f" %(Pearson(col1,col2)))
    print("Correlation for col1 and col3: %.4f" %(Pearson(col1,col3)))
    print("Correlation for col1 and col4: %.4f" %(Pearson(col1,col4)))
    print("Correlation for col2 and col3: %.4f" %(Pearson(col2,col3)))
    print("Correlation for col2 and col4: %.4f" %(Pearson(col2,col4)))
    print("Correlation for col3 and col4: %.4f" %(Pearson(col3,col4)))
    
    将此tsv文件中的四列(col2-col5)放入单独的列表中, 我选择忽略与夏威夷的线路,因为它不完整 数据,因此使用49个数据点

    col0=[]
    col1=[]
    col2=[]
    col3=[]
    col4=[]
    f=打开('cdc_data.tsv','r')
    contents=f.read()
    lines=contents.split('\n')#将文件拆分为单独的行
    对于行中的行:
    如果(第[0:1]行=='#'):#过滤掉注释
    持续
    split_line=line.split('\t')#将行拆分为单独的单词,用制表符分隔
    如果(len(split_line)<5):#删除任何不是5列的行
    持续
    #将每列分配到单独的列表中
    col0.append(拆分_行[0])
    col1.append(拆分_行[1])
    col2.append(拆分_行[2])
    col3.append(拆分_行[3])
    col4.append(拆分_行[4])
    
    我将把这个问题留给夏威夷和你的#2问题作为练习,让你来完成

    import math
    
    col0 = []
    col1 = []
    col2 = []
    col3 = []
    col4 = []
    
    f = open('cdc_data.tsv', 'r')
    
    def Pearson(X,Y):
        n=50 
        a=0
        b=0
        c=0 
        d=0
        e=0 
        f=0
        g=0 
        for i in range(n):
            a+=X[i]*Y[i]
            b+=X[i]
            c+=Y[i]
            d+=X[i]**2
            e+=X[i] #remember to square this
            f+=Y[i]**2
            g+=Y[i] #remember to square this
    
        return (n*a-b*c)/(math.sqrt((n*d-e**2)*(n*f-g**2)))
    
    
    contents = f.read()
    
    lines = contents.split('\n')    # split file into seperate lines
    for line in lines:
        if (line[0:1] == '#'):   # filter out comments
           continue
        if (line[0:1] == 'H'): #filter out Hawaii
            continue
    
        split_line = line.split('\t')   # split line into seperate words seperated by TAB
    
        if (len(split_line) < 5): # drop any line that isn't 5 columns
           continue
    
    # assign each column into a separate list
    col0.append(split_line[0])
    col1.append(float(split_line[1]))
    col2.append(float(split_line[2]))
    col3.append(float(split_line[3]))
    col4.append(float(split_line[4]))
    
    
    print("Correlation for col1 and col2: %.4f" %(Pearson(col1,col2)))
    print("Correlation for col1 and col3: %.4f" %(Pearson(col1,col3)))
    print("Correlation for col1 and col4: %.4f" %(Pearson(col1,col4)))
    print("Correlation for col2 and col3: %.4f" %(Pearson(col2,col3)))
    print("Correlation for col2 and col4: %.4f" %(Pearson(col2,col4)))
    print("Correlation for col3 and col4: %.4f" %(Pearson(col3,col4)))
    
    将此tsv文件中的四列(col2-col5)放入单独的列表中, 我选择忽略与夏威夷的线路,因为它不完整 数据,因此使用49个数据点

    col0=[]
    col1=[]
    col2=[]
    col3=[]
    col4=[]
    f=打开('cdc_data.tsv','r')
    contents=f.read()
    lines=contents.split('\n')#将文件拆分为单独的行
    对于行中的行:
    如果(第[0:1]行=='#'):#过滤掉注释
    持续
    split_line=line.split('\t')#将行拆分为单独的单词,用制表符分隔
    如果(len(split_line)<5):#删除任何不是5列的行
    持续
    #将每列分配到单独的列表中
    col0.append(拆分_行[0])
    col1.append(拆分_行[1])
    col2.append(拆分_行[2])
    col3.append(拆分_行[3])
    col4.append(拆分_行[4])
    
    我将把这个问题留给夏威夷和你的#2问题作为练习,让你来完成

    import math
    
    col0 = []
    col1 = []
    col2 = []
    col3 = []
    col4 = []
    
    f = open('cdc_data.tsv', 'r')
    
    def Pearson(X,Y):
        n=50 
        a=0
        b=0
        c=0 
        d=0
        e=0 
        f=0
        g=0 
        for i in range(n):
            a+=X[i]*Y[i]
            b+=X[i]
            c+=Y[i]
            d+=X[i]**2
            e+=X[i] #remember to square this
            f+=Y[i]**2
            g+=Y[i] #remember to square this
    
        return (n*a-b*c)/(math.sqrt((n*d-e**2)*(n*f-g**2)))
    
    
    contents = f.read()
    
    lines = contents.split('\n')    # split file into seperate lines
    for line in lines:
        if (line[0:1] == '#'):   # filter out comments
           continue
        if (line[0:1] == 'H'): #filter out Hawaii
            continue
    
        split_line = line.split('\t')   # split line into seperate words seperated by TAB
    
        if (len(split_line) < 5): # drop any line that isn't 5 columns
           continue
    
    # assign each column into a separate list
    col0.append(split_line[0])
    col1.append(float(split_line[1]))
    col2.append(float(split_line[2]))
    col3.append(float(split_line[3]))
    col4.append(float(split_line[4]))
    
    
    print("Correlation for col1 and col2: %.4f" %(Pearson(col1,col2)))
    print("Correlation for col1 and col3: %.4f" %(Pearson(col1,col3)))
    print("Correlation for col1 and col4: %.4f" %(Pearson(col1,col4)))
    print("Correlation for col2 and col3: %.4f" %(Pearson(col2,col3)))
    print("Correlation for col2 and col4: %.4f" %(Pearson(col2,col4)))
    print("Correlation for col3 and col4: %.4f" %(Pearson(col3,col4)))
    
    将此tsv文件中的四列(col2-col5)放入单独的列表中, 我选择忽略与夏威夷的线路,因为它不完整 数据,因此使用49个数据点

    col0=[]
    col1=[]
    col2=[]
    col3=[]
    col4=[]
    f=打开('cdc_data.tsv','r')
    contents=f.read()
    lines=contents.split('\n')#将文件拆分为单独的行
    对于行中的行:
    如果(第[0:1]行=='#'):#过滤掉注释
    持续
    split_line=line.split('\t')#将行拆分为单独的单词,用制表符分隔
    如果(len(split_line)<5):#删除任何不是5列的行
    持续
    #将每列分配到单独的列表中
    col0.append(拆分_行[0])
    col1.append(拆分_行[1])
    col2.append(拆分_行[2])
    col3.append(拆分_行[3])
    col4.append(拆分_行[4])
    
    我将把夏威夷问题和你的#2问题作为练习留给你完成。

    import math
    
    import math
    
    col0 = []
    col1 = []
    col2 = []
    col3 = []
    col4 = []
    
    f = open('cdc_data.tsv', 'r')
    
    def Pearson(X,Y):
        n=50 
        a=0
        b=0
        c=0 
        d=0
        e=0 
        f=0
        g=0 
        for i in range(n):
            a+=X[i]*Y[i]
            b+=X[i]
            c+=Y[i]
            d+=X[i]**2
            e+=X[i] #remember to square this
            f+=Y[i]**2
            g+=Y[i] #remember to square this
    
        return (n*a-b*c)/(math.sqrt((n*d-e**2)*(n*f-g**2)))
    
    
    contents = f.read()
    
    lines = contents.split('\n')    # split file into seperate lines
    for line in lines:
        if (line[0:1] == '#'):   # filter out comments
           continue
        if (line[0:1] == 'H'): #filter out Hawaii
            continue
    
        split_line = line.split('\t')   # split line into seperate words seperated by TAB
    
        if (len(split_line) < 5): # drop any line that isn't 5 columns
           continue
    
    # assign each column into a separate list
    col0.append(split_line[0])
    col1.append(float(split_line[1]))
    col2.append(float(split_line[2]))
    col3.append(float(split_line[3]))
    col4.append(float(split_line[4]))
    
    
    print("Correlation for col1 and col2: %.4f" %(Pearson(col1,col2)))
    print("Correlation for col1 and col3: %.4f" %(Pearson(col1,col3)))
    print("Correlation for col1 and col4: %.4f" %(Pearson(col1,col4)))
    print("Correlation for col2 and col3: %.4f" %(Pearson(col2,col3)))
    print("Correlation for col2 and col4: %.4f" %(Pearson(col2,col4)))
    print("Correlation for col3 and col4: %.4f" %(Pearson(col3,col4)))
    
    col0=[] col1=[] col2=[] col3=[] col4=[] f=打开('cdc_data.tsv','r') def Pearson(X,Y): n=50 a=0 b=0 c=0 d=0 e=0 f=0 g=0 对于范围(n)中的i: a+=X[i]*Y[i] b+=X[i] c+=Y[i] d+=X[i]**2 e+=X[i]#记住将其平方 f+=Y[i]**2 g+=Y[i]#记住将此平方 返回(n*a-b*c)/(数学sqrt((n*d-e**2)*(n*f-g**2))) contents=f.read() lines=contents.split('\n')#将文件拆分为单独的行 对于行中的行: 如果(第[0:1]行=='#'):#过滤掉注释 持续 如果(第[0:1]行=='H'):#过滤掉夏威夷 持续 split_line=line.split('\t')#将行拆分为单独的单词,用制表符分隔 如果(len(split_line)<5):#删除任何不是5列的行 持续 #将每列分配到单独的列表中 col0.append(拆分_行[0]) col1.append(float(split_行[1])) col2.append(float(split_行[2])) col3.append(float(split_行[3])) col4.append(float(split_行[4])) 打印(“col1和col2的相关性:%.4f”%(皮尔逊(col1,col2))) 打印(“col1和col3的相关性:%.4f”%(皮尔逊(col1,col3))) 打印(“col1和col4的相关性:%.4f”%(皮尔逊(col1,col4))) 打印(“col2和col3的相关性:%.4f”%(皮尔逊(col2,col3))) 打印(“col2和col4的相关性:%.4f”%(皮尔逊(col2,col4))) 打印(“col3和col4的相关性:%.4f”%(皮尔逊(col3,col4)))
    导入数学
    col0=[