如何在R中将一个数据帧转换为另一个数据帧？_R_Split_Dataframe

如何在R中将一个数据帧转换为另一个数据帧？

r dataframe

如何在R中将一个数据帧转换为另一个数据帧？,r,split,dataframe,R,Split,Dataframe,我已经下载了txt。肯尼思R.法兰西图书馆的文件，可通过以下链接找到我需要使用这些所谓的SIC代码，根据行业因素将我的样本划分为不同的投资组合。下载的文件如下所示： 1 Food 0100-0199 Agric production - crops 0200-0299 Agric production - livestock 0700-0799 Agricultural services 0900-0999 Fishing, hu

我已经下载了txt。肯尼思R.法兰西图书馆的文件，可通过以下链接找到

我需要使用这些所谓的SIC代码，根据行业因素将我的样本划分为不同的投资组合。下载的文件如下所示：

      1 Food  
      0100-0199 Agric production - crops
      0200-0299 Agric production - livestock
      0700-0799 Agricultural services
      0900-0999 Fishing, hunting & trapping
      2000-2009 Food and kindred products
      2010-2019 Meat products
      2020-2029 Dairy products
      2030-2039 Canned-preserved fruits-vegs
      2040-2046 Flour and other grain mill products
      2047-2047 Dog and cat food
      2048-2048 Prepared feeds for animals
      2050-2059 Bakery products
      2060-2063 Sugar and confectionery products
      2064-2068 Candy and other confectionery
      2070-2079 Fats and oils
      2080-2080 Beverages
      2082-2082 Malt beverages
      2083-2083 Malt
      2084-2084 Wine
      2085-2085 Distilled and blended liquors
      2086-2086 Bottled-canned soft drinks
      2087-2087 Flavoring syrup
      2090-2092 Misc food preps
      2095-2095 Roasted coffee
      2096-2096 Potato chips
      2097-2097 Manufactured ice
      2098-2099 Misc food preparations
      5140-5149 Wholesale - groceries & related prods
      5150-5159 Wholesale - farm products
      5180-5182 Wholesale - beer, wine
      5191-5191 Wholesale - farm supplies

      2 Mines 
      1000-1009 Metal mining
      1010-1019 Iron ores
      1020-1029 Copper ores
      1030-1039 Lead and zinc ores
      1040-1049 Gold & silver ores
      1060-1069 Ferroalloy ores
      1080-1089 Mining services
      1090-1099 Misc metal ores
      1200-1299 Bituminous coal
      1400-1499 Mining and quarrying non-metalic minerals
      5050-5052 Wholesale - metals and minerals

      3 Oil
      1300-1300 Oil and gas extraction
      1310-1319 Crude petroleum & natural gas
      1320-1329 Natural gas liquids
      1380-1380 Oil and gas field services
      1381-1381 Drilling oil & gas wells
      1382-1382 Oil-gas field exploration
      1389-1389 Oil and gas field services
      2900-2912 Petroleum refining
      5170-5172 Wholesale - petroleum and petro prods

      4 Clths 
      2200-2269 Textile mill products
      2270-2279 Floor covering mills
      2280-2284 Yarn and thread mills
      2290-2295 Misc textile goods
      2296-2296 Tire cord and fabric
      2297-2297 Nonwoven fabrics
      2298-2298 Cordage and twine
      2299-2299 Misc textile products
      2300-2390 Apparel and other finished products
      2391-2392 Curtains, home furnishings
      2393-2395 Textile bags, canvas products
      2396-2396 Auto trim
      2397-2399 Misc textile products
      3020-3021 Rubber and plastics footwear
      3100-3111 Leather tanning and finishing
      3130-3131 Boot, shoe cut stock, findings
      3140-3149 Footware except rubber
      3150-3151 Leather gloves and mittens
      3963-3965 Fasteners, buttons, needles, pins
      5130-5139 Wholesale - apparel

我想做的事情是创建数据框架，其中第一列给出行业名称（例如，食品、采矿和矿产等），第二列给出与该行业相关的所有SIC代码（标准行业代码）（因为大多数SIC代码都是通过as 5130-5139的方式给出的，这使得执行起来有点困难）

这个数据框架将使我的分析更容易实现

任何建议都将非常值得注意。

这将产生一个2列数据框

df.new

，其中包含第2列中逗号分隔的代码：

df <- read.fwf("Siccodes48.txt", widths=c(3, 7, 60), stringsAsFactors=FALSE)
df <- df[!is.na(df$V3), ]
library(zoo)
df$V1 <- na.locf(df$V1)
l <- split(df, df$V1)
l <- setNames(lapply(l, function(x) {
  m <- regexec("([0-9]{4})-([0-9]{4}) .*", x$V3[-1]) # omit headline
  r <- regmatches(x$V3[-1], m)
  fromTo <- t(sapply(r, "[", 2:3))
  paste(sprintf("%04d", unlist(mapply(":", fromTo[, 1], fromTo[, 2]))), collapse=", ")
}), sapply(l, "[", 1, 3))
df.new <- data.frame(name=names(l), sic=unlist(l))

df这将生成一个2列数据帧df.new
，其中包含第2列中逗号分隔的代码：
df <- read.fwf("Siccodes48.txt", widths=c(3, 7, 60), stringsAsFactors=FALSE)
df <- df[!is.na(df$V3), ]
library(zoo)
df$V1 <- na.locf(df$V1)
l <- split(df, df$V1)
l <- setNames(lapply(l, function(x) {
  m <- regexec("([0-9]{4})-([0-9]{4}) .*", x$V3[-1]) # omit headline
  r <- regmatches(x$V3[-1], m)
  fromTo <- t(sapply(r, "[", 2:3))
  paste(sprintf("%04d", unlist(mapply(":", fromTo[, 1], fromTo[, 2]))), collapse=", ")
}), sapply(l, "[", 1, 3))
df.new <- data.frame(name=names(l), sic=unlist(l))

df这将生成一个2列数据帧df.new
，其中包含第2列中逗号分隔的代码：
df <- read.fwf("Siccodes48.txt", widths=c(3, 7, 60), stringsAsFactors=FALSE)
df <- df[!is.na(df$V3), ]
library(zoo)
df$V1 <- na.locf(df$V1)
l <- split(df, df$V1)
l <- setNames(lapply(l, function(x) {
  m <- regexec("([0-9]{4})-([0-9]{4}) .*", x$V3[-1]) # omit headline
  r <- regmatches(x$V3[-1], m)
  fromTo <- t(sapply(r, "[", 2:3))
  paste(sprintf("%04d", unlist(mapply(":", fromTo[, 1], fromTo[, 2]))), collapse=", ")
}), sapply(l, "[", 1, 3))
df.new <- data.frame(name=names(l), sic=unlist(l))

df这将生成一个2列数据帧df.new
，其中包含第2列中逗号分隔的代码：
df <- read.fwf("Siccodes48.txt", widths=c(3, 7, 60), stringsAsFactors=FALSE)
df <- df[!is.na(df$V3), ]
library(zoo)
df$V1 <- na.locf(df$V1)
l <- split(df, df$V1)
l <- setNames(lapply(l, function(x) {
  m <- regexec("([0-9]{4})-([0-9]{4}) .*", x$V3[-1]) # omit headline
  r <- regmatches(x$V3[-1], m)
  fromTo <- t(sapply(r, "[", 2:3))
  paste(sprintf("%04d", unlist(mapply(":", fromTo[, 1], fromTo[, 2]))), collapse=", ")
}), sapply(l, "[", 1, 3))
df.new <- data.frame(name=names(l), sic=unlist(l))

df这个怎么样
df<-readLines("Siccodes48.txt")
df<-data.frame(t=df[df!=""])              # delete blanks and make data frame
df$prefix<-c(substr(df$t,1,10))           # break out the prefix (first 10 char)
df$index<-cumsum(df$prefix!="          ") # make an index
ind<-df[df$prefix!="          ",]         # make an index table
ind$desc<-substring(ind$t,11,100)         # parse descriptions
final<-merge(ind[,c("index","desc")],     # merge the index table
             df[df$prefix=="          ",c("index","t")],  # with all non-title rows of the list
             by="index")                                  # by index

head(final,10)

   index          desc                                                       t
1      1   Agriculture                      0100-0199 Agric production - crops
2      1   Agriculture                  0200-0299 Agric production - livestock
3      1   Agriculture                         0700-0799 Agricultural services
4      1   Agriculture                            0910-0919 Commercial fishing
5      1   Agriculture                    2048-2048 Prepared feeds for animals
6      2 Food Products                     2000-2009 Food and kindred products
7      2 Food Products                                 2010-2019 Meat products
8      2 Food Products                                2020-2029 Dairy products
9      2 Food Products                  2030-2039 Canned-preserved fruits-vegs
10     2 Food Products           2040-2046 Flour and other grain mill products

df这个怎么样
df<-readLines("Siccodes48.txt")
df<-data.frame(t=df[df!=""])              # delete blanks and make data frame
df$prefix<-c(substr(df$t,1,10))           # break out the prefix (first 10 char)
df$index<-cumsum(df$prefix!="          ") # make an index
ind<-df[df$prefix!="          ",]         # make an index table
ind$desc<-substring(ind$t,11,100)         # parse descriptions
final<-merge(ind[,c("index","desc")],     # merge the index table
             df[df$prefix=="          ",c("index","t")],  # with all non-title rows of the list
             by="index")                                  # by index

head(final,10)

   index          desc                                                       t
1      1   Agriculture                      0100-0199 Agric production - crops
2      1   Agriculture                  0200-0299 Agric production - livestock
3      1   Agriculture                         0700-0799 Agricultural services
4      1   Agriculture                            0910-0919 Commercial fishing
5      1   Agriculture                    2048-2048 Prepared feeds for animals
6      2 Food Products                     2000-2009 Food and kindred products
7      2 Food Products                                 2010-2019 Meat products
8      2 Food Products                                2020-2029 Dairy products
9      2 Food Products                  2030-2039 Canned-preserved fruits-vegs
10     2 Food Products           2040-2046 Flour and other grain mill products

df这个怎么样
df<-readLines("Siccodes48.txt")
df<-data.frame(t=df[df!=""])              # delete blanks and make data frame
df$prefix<-c(substr(df$t,1,10))           # break out the prefix (first 10 char)
df$index<-cumsum(df$prefix!="          ") # make an index
ind<-df[df$prefix!="          ",]         # make an index table
ind$desc<-substring(ind$t,11,100)         # parse descriptions
final<-merge(ind[,c("index","desc")],     # merge the index table
             df[df$prefix=="          ",c("index","t")],  # with all non-title rows of the list
             by="index")                                  # by index

head(final,10)

   index          desc                                                       t
1      1   Agriculture                      0100-0199 Agric production - crops
2      1   Agriculture                  0200-0299 Agric production - livestock
3      1   Agriculture                         0700-0799 Agricultural services
4      1   Agriculture                            0910-0919 Commercial fishing
5      1   Agriculture                    2048-2048 Prepared feeds for animals
6      2 Food Products                     2000-2009 Food and kindred products
7      2 Food Products                                 2010-2019 Meat products
8      2 Food Products                                2020-2029 Dairy products
9      2 Food Products                  2030-2039 Canned-preserved fruits-vegs
10     2 Food Products           2040-2046 Flour and other grain mill products

df这个怎么样
df<-readLines("Siccodes48.txt")
df<-data.frame(t=df[df!=""])              # delete blanks and make data frame
df$prefix<-c(substr(df$t,1,10))           # break out the prefix (first 10 char)
df$index<-cumsum(df$prefix!="          ") # make an index
ind<-df[df$prefix!="          ",]         # make an index table
ind$desc<-substring(ind$t,11,100)         # parse descriptions
final<-merge(ind[,c("index","desc")],     # merge the index table
             df[df$prefix=="          ",c("index","t")],  # with all non-title rows of the list
             by="index")                                  # by index

head(final,10)

   index          desc                                                       t
1      1   Agriculture                      0100-0199 Agric production - crops
2      1   Agriculture                  0200-0299 Agric production - livestock
3      1   Agriculture                         0700-0799 Agricultural services
4      1   Agriculture                            0910-0919 Commercial fishing
5      1   Agriculture                    2048-2048 Prepared feeds for animals
6      2 Food Products                     2000-2009 Food and kindred products
7      2 Food Products                                 2010-2019 Meat products
8      2 Food Products                                2020-2029 Dairy products
9      2 Food Products                  2030-2039 Canned-preserved fruits-vegs
10     2 Food Products           2040-2046 Flour and other grain mill products

<代码> DFI将考虑一个真实的数据预处理工具，如谷歌精炼（离线和免费）。R并不真正适合这种任务，即使你可以用R来完成，但要付出更多的痛苦。我认为使用其他程序来处理这个问题会更好，因为你的数据看起来不像一个数据帧（中间有“4个Clth”之类的东西）。这不是一种非常有效的方法，但您可以手动执行。我可以看到，所有的SIC代码都是xxxx xxxx的形式，后跟一个空格。因此，如果你使用SEP=“读取”文件，第一列应该是你的SiC代码，第二列应该是你的行业名称（我不确定所有的名字是否是一个单一的字符串，从你的例子中，他们是），其余的将是他们卖什么？我会考虑一个真实的数据预处理工具，如谷歌精炼（脱机和免费）。R并不真正适合这种任务，即使你可以用R来完成，但要付出更多的痛苦。我认为使用其他程序来处理这个问题会更好，因为你的数据看起来不像一个数据帧（中间有“4个Clth”之类的东西）。这不是一种非常有效的方法，但您可以手动执行。我可以看到，所有的SIC代码都是xxxx xxxx的形式，后跟一个空格。因此，如果你使用SEP=“读取”文件，第一列应该是你的SiC代码，第二列应该是你的行业名称（我不确定所有的名字是否是一个单一的字符串，从你的例子中，他们是），其余的将是他们卖什么？我会考虑一个真实的数据预处理工具，如谷歌精炼（脱机和免费）。R并不真正适合这种任务，即使你可以用R来完成，但要付出更多的痛苦。我认为使用其他程序来处理这个问题会更好，因为你的数据看起来不像一个数据帧（中间有“4个Clth”之类的东西）。这不是一种非常有效的方法，但您可以手动执行。我可以看到，所有的SIC代码都是xxxx xxxx的形式，后跟一个空格。因此，如果你使用SEP=“读取”文件，第一列应该是你的SiC代码，第二列应该是你的行业名称（我不确定所有的名字是否是一个单一的字符串，从你的例子中，他们是），其余的将是他们卖什么？我会考虑一个真实的数据预处理工具，如谷歌精炼（脱机和免费）。R并不真正适合这种任务，即使你可以用R来完成，但要付出更多的痛苦。我认为使用其他程序来处理这个问题会更好，因为你的数据看起来不像一个数据帧（中间有“4个Clth”之类的东西）。这不是一种非常有效的方法，但您可以手动执行。我可以看到，所有的SIC代码都是xxxx xxxx的形式，后跟一个空格。因此，如果您使用sep=“”）阅读该文件，第一列应该是您的SIC代码，第二列应该是您的行业名称（我不确定是否所有名称都是单个字符串，从您的示例来看，它们都是），其余的将是它们销售的产品？我感到惊讶。它非常强大。谢谢你，我很惊讶。它非常强大。谢谢你，我很惊讶。它非常强大。谢谢你，我很惊讶。它非常强大。谢谢大家，非常感谢你们抽出时间！它提供了宝贵的见解。非常感谢您抽出时间！它提供了宝贵的见解。非常感谢您抽出时间！它提供了宝贵的见解。非常感谢您抽出时间！它提供了宝贵的洞察力。