Python 将分块文件读入数据帧

Python 将分块文件读入数据帧,python,r,pandas,tidyr,readr,Python,R,Pandas,Tidyr,Readr,我对pandas/r相当陌生,我不太确定如何将这些数据读入pandas或r进行分析 目前,我想我可以使用readr的read_chunkwise,或者pandas的chunksize,但这可能不是我需要的。使用for循环或使用purr迭代所有元素真的很容易解决吗 数据: wine/name: 1981 Château de Beaucastel Châteauneuf-du-Pape wine/wineId: 18856 wine/variant: Red

我对pandas/r相当陌生,我不太确定如何将这些数据读入
pandas
r
进行分析

目前,我想我可以使用readr的
read_chunkwise
,或者pandas的
chunksize
,但这可能不是我需要的。使用for循环或使用purr迭代所有元素真的很容易解决吗

数据:

wine/name: 1981 Château de Beaucastel Châteauneuf-du-Pape
wine/wineId: 18856 
wine/variant: Red Rhone Blend 
wine/year: 1981 
review/points: 96   
review/time: 1160179200   
review/userId: 1 
review/userName: Eric 
review/text: Olive, horse sweat, dirty saddle, and smoke. This actually got quite a bit more spicy and expressive with significant aeration. This was a little dry on the palate first but filled out considerably in time, lovely, loaded with tapenade, leather, dry and powerful, very black olive, meaty. This improved considerably the longer it was open. A terrific bottle of 1981, 96+ and improving. This may well be my favorite vintage of Beau except for perhaps the 1990.

wine/name: 1995 Château Pichon-Longueville Baron 
wine/wineId: 3495 wine/variant: Red Bordeaux Blend 
wine/year: 1995 
review/points: 93 
review/time: 1063929600 
review/userId: 1 
review/userName: Eric 
review/text: A remarkably floral nose with violet and chambord. On the palate this is super sweet and pure with a long, somewhat searing finish. My notes are very terse, but this was a lovely wine.
print(read_records(data))
                                           wine/name  wine/wineId  \
0  1981 Château de Beaucastel Châteaune...        18856   
1         1995 Château Pichon-Longueville Baron         3495   

         wine/variant  wine/year  review/points  review/time  review/userId  \
0     Red Rhone Blend       1981             96   1160179200              1   
1  Red Bordeaux Blend       1995             93   1063929600              1   

  review/userName                                        review/text  
0            Eric  Olive, horse sweat, dirty saddle, and smoke. T...  
1            Eric  A remarkably floral nose with violet and chamb...  
data = [x.strip() for x in """
    wine/name: 1981 Château de Beaucastel Châteauneuf-du-Pape
    wine/wineId: 18856
    wine/variant: Red Rhone Blend
    wine/year: 1981
    review/points: 96
    review/time: 1160179200
    review/userId: 1
    review/userName: Eric
    review/text: Olive, horse sweat, dirty saddle, and smoke. This actually got quite a bit more spicy and expressive with significant aeration. This was a little dry on the palate first but filled out considerably in time, lovely, loaded with tapenade, leather, dry and powerful, very black olive, meaty. This improved considerably the longer it was open. A terrific bottle of 1981, 96+ and improving. This may well be my favorite vintage of Beau except for perhaps the 1990.

    wine/name: 1995 Château Pichon-Longueville Baron
    wine/wineId: 3495
    wine/variant: Red Bordeaux Blend
    wine/year: 1995
    review/points: 93
    review/time: 1063929600
    review/userId: 1
    review/userName: Eric
    review/text: A remarkably floral nose with violet and chambord. On the palate this is super sweet and pure with a long, somewhat searing finish. My notes are very terse, but this was a lovely wine.
""".split('\n')[1:-1]]
目前,这是我的函数,但我遇到了一个错误:

>

convertchunkfile#当任何行的长度不为0时,使用以下循环处理它
>而(nchar(df[[i]])!=0){
>什么时候(
>         
>#当x索引处的数据==wine/name时,则提取该子句后面的数据
>#葡萄酒名称解析
>cleandf$WineName[[i]#葡萄酒ID解析
>cleandf$WineID[[i]]#其他属性的格式相同
>       )
>     }   
>    }
>  } 
cleandf$BeerName[[i]]cillartracker-iconv.txt中出错
#检查文件中的行数
wc-l-iconv.txt
20259950-iconv.txt
#验证文件的新编码
文件-I-clean.txt
ReadEmAndWeep%
单独(文本,c(“变量”、“值”),“:”,extra=“合并”)%>%
变异(
chunk_id=rep(1:(nrow(.)/9),每个=9),
值=trimws(值)
) %>%
价差(风险值、价值)
}
读取\u行\u分块(文件,DataFrameCallback$new(f),分块大小=分块大小)
}
#在文件中读取的最后一个函数调用

dataframe这里有一些代码,可以将这些记录读入pandas.dataframe
。这些记录的结构类似于
yaml
记录,因此该代码利用了这一事实。空行用作记录分隔符

import pandas as pd
import collections
import yaml

def read_records(lines):
    # keep track of the columns in an ordered set
    columns = collections.OrderedDict()

    record = []
    records = []
    for line in lines:
        if line:
            # gather each line of text until a blank line
            record.append(line)

            # keep track of the columns seen in an ordered set
            columns[line.split(':')[0].strip()] = None

        # if the line is empty and we have a record, then convert it 
        elif record:

            # use yaml to convert the lines into a dict
            records.append(yaml.load('\n'.join(record)))
            record = []

    # record last record
    if record:
        records.append(yaml.load('\n'.join(record)))

    # return a pandas dataframe from the list of dicts
    return pd.DataFrame(records, columns=list(columns.keys()))
测试代码:

wine/name: 1981 Château de Beaucastel Châteauneuf-du-Pape
wine/wineId: 18856 
wine/variant: Red Rhone Blend 
wine/year: 1981 
review/points: 96   
review/time: 1160179200   
review/userId: 1 
review/userName: Eric 
review/text: Olive, horse sweat, dirty saddle, and smoke. This actually got quite a bit more spicy and expressive with significant aeration. This was a little dry on the palate first but filled out considerably in time, lovely, loaded with tapenade, leather, dry and powerful, very black olive, meaty. This improved considerably the longer it was open. A terrific bottle of 1981, 96+ and improving. This may well be my favorite vintage of Beau except for perhaps the 1990.

wine/name: 1995 Château Pichon-Longueville Baron 
wine/wineId: 3495 wine/variant: Red Bordeaux Blend 
wine/year: 1995 
review/points: 93 
review/time: 1063929600 
review/userId: 1 
review/userName: Eric 
review/text: A remarkably floral nose with violet and chambord. On the palate this is super sweet and pure with a long, somewhat searing finish. My notes are very terse, but this was a lovely wine.
print(read_records(data))
                                           wine/name  wine/wineId  \
0  1981 Château de Beaucastel Châteaune...        18856   
1         1995 Château Pichon-Longueville Baron         3495   

         wine/variant  wine/year  review/points  review/time  review/userId  \
0     Red Rhone Blend       1981             96   1160179200              1   
1  Red Bordeaux Blend       1995             93   1063929600              1   

  review/userName                                        review/text  
0            Eric  Olive, horse sweat, dirty saddle, and smoke. T...  
1            Eric  A remarkably floral nose with violet and chamb...  
data = [x.strip() for x in """
    wine/name: 1981 Château de Beaucastel Châteauneuf-du-Pape
    wine/wineId: 18856
    wine/variant: Red Rhone Blend
    wine/year: 1981
    review/points: 96
    review/time: 1160179200
    review/userId: 1
    review/userName: Eric
    review/text: Olive, horse sweat, dirty saddle, and smoke. This actually got quite a bit more spicy and expressive with significant aeration. This was a little dry on the palate first but filled out considerably in time, lovely, loaded with tapenade, leather, dry and powerful, very black olive, meaty. This improved considerably the longer it was open. A terrific bottle of 1981, 96+ and improving. This may well be my favorite vintage of Beau except for perhaps the 1990.

    wine/name: 1995 Château Pichon-Longueville Baron
    wine/wineId: 3495
    wine/variant: Red Bordeaux Blend
    wine/year: 1995
    review/points: 93
    review/time: 1063929600
    review/userId: 1
    review/userName: Eric
    review/text: A remarkably floral nose with violet and chambord. On the palate this is super sweet and pure with a long, somewhat searing finish. My notes are very terse, but this was a lovely wine.
""".split('\n')[1:-1]]
结果:

wine/name: 1981 Château de Beaucastel Châteauneuf-du-Pape
wine/wineId: 18856 
wine/variant: Red Rhone Blend 
wine/year: 1981 
review/points: 96   
review/time: 1160179200   
review/userId: 1 
review/userName: Eric 
review/text: Olive, horse sweat, dirty saddle, and smoke. This actually got quite a bit more spicy and expressive with significant aeration. This was a little dry on the palate first but filled out considerably in time, lovely, loaded with tapenade, leather, dry and powerful, very black olive, meaty. This improved considerably the longer it was open. A terrific bottle of 1981, 96+ and improving. This may well be my favorite vintage of Beau except for perhaps the 1990.

wine/name: 1995 Château Pichon-Longueville Baron 
wine/wineId: 3495 wine/variant: Red Bordeaux Blend 
wine/year: 1995 
review/points: 93 
review/time: 1063929600 
review/userId: 1 
review/userName: Eric 
review/text: A remarkably floral nose with violet and chambord. On the palate this is super sweet and pure with a long, somewhat searing finish. My notes are very terse, but this was a lovely wine.
print(read_records(data))
                                           wine/name  wine/wineId  \
0  1981 Château de Beaucastel Châteaune...        18856   
1         1995 Château Pichon-Longueville Baron         3495   

         wine/variant  wine/year  review/points  review/time  review/userId  \
0     Red Rhone Blend       1981             96   1160179200              1   
1  Red Bordeaux Blend       1995             93   1063929600              1   

  review/userName                                        review/text  
0            Eric  Olive, horse sweat, dirty saddle, and smoke. T...  
1            Eric  A remarkably floral nose with violet and chamb...  
data = [x.strip() for x in """
    wine/name: 1981 Château de Beaucastel Châteauneuf-du-Pape
    wine/wineId: 18856
    wine/variant: Red Rhone Blend
    wine/year: 1981
    review/points: 96
    review/time: 1160179200
    review/userId: 1
    review/userName: Eric
    review/text: Olive, horse sweat, dirty saddle, and smoke. This actually got quite a bit more spicy and expressive with significant aeration. This was a little dry on the palate first but filled out considerably in time, lovely, loaded with tapenade, leather, dry and powerful, very black olive, meaty. This improved considerably the longer it was open. A terrific bottle of 1981, 96+ and improving. This may well be my favorite vintage of Beau except for perhaps the 1990.

    wine/name: 1995 Château Pichon-Longueville Baron
    wine/wineId: 3495
    wine/variant: Red Bordeaux Blend
    wine/year: 1995
    review/points: 93
    review/time: 1063929600
    review/userId: 1
    review/userName: Eric
    review/text: A remarkably floral nose with violet and chambord. On the palate this is super sweet and pure with a long, somewhat searing finish. My notes are very terse, but this was a lovely wine.
""".split('\n')[1:-1]]
测试数据:

wine/name: 1981 Château de Beaucastel Châteauneuf-du-Pape
wine/wineId: 18856 
wine/variant: Red Rhone Blend 
wine/year: 1981 
review/points: 96   
review/time: 1160179200   
review/userId: 1 
review/userName: Eric 
review/text: Olive, horse sweat, dirty saddle, and smoke. This actually got quite a bit more spicy and expressive with significant aeration. This was a little dry on the palate first but filled out considerably in time, lovely, loaded with tapenade, leather, dry and powerful, very black olive, meaty. This improved considerably the longer it was open. A terrific bottle of 1981, 96+ and improving. This may well be my favorite vintage of Beau except for perhaps the 1990.

wine/name: 1995 Château Pichon-Longueville Baron 
wine/wineId: 3495 wine/variant: Red Bordeaux Blend 
wine/year: 1995 
review/points: 93 
review/time: 1063929600 
review/userId: 1 
review/userName: Eric 
review/text: A remarkably floral nose with violet and chambord. On the palate this is super sweet and pure with a long, somewhat searing finish. My notes are very terse, but this was a lovely wine.
print(read_records(data))
                                           wine/name  wine/wineId  \
0  1981 Château de Beaucastel Châteaune...        18856   
1         1995 Château Pichon-Longueville Baron         3495   

         wine/variant  wine/year  review/points  review/time  review/userId  \
0     Red Rhone Blend       1981             96   1160179200              1   
1  Red Bordeaux Blend       1995             93   1063929600              1   

  review/userName                                        review/text  
0            Eric  Olive, horse sweat, dirty saddle, and smoke. T...  
1            Eric  A remarkably floral nose with violet and chamb...  
data = [x.strip() for x in """
    wine/name: 1981 Château de Beaucastel Châteauneuf-du-Pape
    wine/wineId: 18856
    wine/variant: Red Rhone Blend
    wine/year: 1981
    review/points: 96
    review/time: 1160179200
    review/userId: 1
    review/userName: Eric
    review/text: Olive, horse sweat, dirty saddle, and smoke. This actually got quite a bit more spicy and expressive with significant aeration. This was a little dry on the palate first but filled out considerably in time, lovely, loaded with tapenade, leather, dry and powerful, very black olive, meaty. This improved considerably the longer it was open. A terrific bottle of 1981, 96+ and improving. This may well be my favorite vintage of Beau except for perhaps the 1990.

    wine/name: 1995 Château Pichon-Longueville Baron
    wine/wineId: 3495
    wine/variant: Red Bordeaux Blend
    wine/year: 1995
    review/points: 93
    review/time: 1063929600
    review/userId: 1
    review/userName: Eric
    review/text: A remarkably floral nose with violet and chambord. On the palate this is super sweet and pure with a long, somewhat searing finish. My notes are very terse, but this was a lovely wine.
""".split('\n')[1:-1]]

这里有一个在R中相当惯用的方法:

library(readr)
library(tidyr)
library(dplyr)

out <- data_frame(text = read_lines(the_text)) %>%
  filter(text != "") %>% 
  separate(text, c("var", "value"), ":", extra = "merge") %>% 
  mutate(
    chunk_id = rep(1:(nrow(.) / 9), each = 9),
    value    = trimws(value)
  ) %>% 
  spread(var, value)
库(readr)
图书馆(tidyr)
图书馆(dplyr)
超出%
过滤器(文本!=“”)%>%
单独(文本,c(“变量”、“值”),“:”,extra=“合并”)%>%
变异(
chunk_id=rep(1:(nrow(.)/9),每个=9),
值=trimws(值)
) %>% 
价差(风险值、价值)

以下是我建议的方法:

y <- readLines("your_file")
y <- unlist(strsplit(gsub("(wine\\/|review\\/)", "~~~\\1", y), "~~~", TRUE))

library(data.table)
dcast(fread(paste0(y[y != ""], collapse = "\n"), header = FALSE)[
  , rn := cumsum(V1 == "wine/name")], rn ~ V1, value.var = "V2")
请注意,第二个数据集缺少第一个葡萄酒的
wine/variant:


在awk或类似的东西中执行
gsub
,然后直接在上面执行
fread
,可能会更好。

感谢您的尝试!我之前在r中尝试了上面列出的尝试,但我开始怀疑熊猫是否真的是完成这项任务的更好工具…让我知道你对我的代码的看法可悲的是,我从未深入研究过
r
,因此我在那里没有任何价值。干杯。不用担心,我想我可能需要研究一下熊猫,因为这个R代码已经相当复杂了……熊猫有没有办法读取这样格式的数据块?我已经在其他数据源中看到了这一点,并认为有一种更简单的方法来处理这一点。您似乎有一些带有多个变量的行(例如:
wine/year:1981 review/points:96
)。对吗?谢谢你指出,它实际上是格式错误的。我已经更新了问题中的数据。谢谢,你能评论一下chunk id行在这方面做了什么吗?我不确定我是否理解这一行代码是怎么回事。@petergensler
chunk\u id
变量就是用来为每次审核创建一个唯一的记录编号的。每次审阅都是一组9行文本,但代码会将每行转换为一列。为什么要使用rep()/9?那部分正是让我讨厌的地方。