提取R中向量中COMA（及更多）之间的内容_R_Regex

提取R中向量中COMA（及更多）之间的内容

r regex

提取R中向量中COMA（及更多）之间的内容,r,regex,R,Regex,我有一个来自.csv文件的数据帧，其中包含4个变量： str(statementGS) $ X : int ... $ statement_type_cd: Factor ... $ statement_text : Factor ... $ serial_no : int ... 我需要使用语句\文本向量（9629704行）：我一直试图用正则表达式将COMA之间的每个产品名称提取到一个新的向量中，但没有成功（使用dataframe的子集）我

我有一个来自.csv文件的数据帧，其中包含4个变量：

str(statementGS)
$ X                : int ...
$ statement_type_cd: Factor ...
$ statement_text   : Factor ...
$ serial_no        : int ...

我需要使用

语句\文本向量（9629704行）：
我一直试图用正则表达式将COMA之间的每个产品名称提取到一个新的向量中，但没有成功（使用dataframe的子集）
我认为正则表达式的序列应该是这样的：
删除单元格末尾的每个
更改每个[
]
（（
）
）
适用于comas，
删除*
*
和*
之间的所有内容
删除每个即
或-即
每次昏迷后删除和

如果（
以基于的开头，则删除（）
和（）
本身中的所有内容

现在，看看向量，如果单元格中有，
，则将它们之间的内容复制到新向量中，但如果，之间只有空格，则跳过（不知道如何为第一个和最后一个元素编程），如果没有，则将单元格复制到新向量中。

（如果一个元素已经在新向量中，最好不要复制它，即不要复制t-shirt
1000次，但可能更容易获得新向量，然后删除与前面另一个向量具有相同字符的单元格）

我已经阅读了文档，如果我没有弄错的话，前5个步骤将使用gsub
函数完成，然后需要一个if/else循环来获得新向量
预期结果：
         Products
1        pistols
2        CORDS 
3        LINES
4        TWINES
5        ROPES
6        POCKET AND TABLE CUTLERY
7        Nail brushes
8        Lip brushes 
9        Make-up brushes
10       ICE CREAM FREEZERS
...
20000000 ADVANCED COMBAT SURVEILLANCE DROW (LOW ENDURANCE)
20000001 Health spa 
20000002 cosmetic body care services 
20000003 beauty salon
20000004 Contract workflows 
20000005 data analytics 
20000006 The SAAS feature technology for contracts

PS：我对R（和编程）不熟悉，但我注意到当对向量使用typeof
时，它返回的是一个整数，这不是很奇怪吗
typeof(statementGS$statement_text)
[1] "integer"

谢谢你的帮助：）
我不久前解决了这个问题，但忘了回答
gsub("\\.(?=\\n$)", "", statement_text);
gsub(";", ",", statement_text);
gsub("((", ",", statement_text);
gsub("))", ",", statement_text);
gsub("[", ",", statement_text);
gsub("]", ",", statement_text);
gsub("namely", "", statement_text, ignore.case=T);
gsub("-namely", "", statement_text, ignore.case=T);
gsub("namely:", "", statement_text, ignore.case=T);
gsub("namely,", "", statement_text, ignore.case=T);
gsub(",and", "", statement_text, ignore.case=T);
gsub(";and", "", statement_text, ignore.case=T);
gsub("\(Based on.*\)", "", statement_text, ignore.case=T);
gsub("^ ", "", statement_text);
gsub("\*.*2\*", "", statement_text);
gsub("\{.*2\}", "", statement_text);
#Replace commas with new lines, when doing this if the dataframe has X rows
#it won't add new rows (a lot of info would be lost), so I did it with notepad++ 
#find and replace function.
#If you now how to do this in R say so in comments please. 
gsub(",", "\\n", statement_text);
gsub(""", "", statement_text);

对于这样的问题，如果您提供一个可复制的示例，这将有所帮助。例如，发布一段代码片段，有人可以复制/粘贴到他们的r会话中，以获得您拥有的示例数据帧或向量。1）在读取csv时添加stringsAsFactors=FALSE，以将oclumnas字符处理。2） 包stringr具有更强大的字符串函数。3） 这是关于regex的最新信息-建议您找到一个允许您测试regex的在线站点。@epi99您完全正确，不知道read.table函数的参数，它工作得很好。我会用你建议的网站来解决我的问题，谢谢。丹：我写的语句文本的例子不好吗？我想保持简单。
gsub("\\.(?=\\n$)", "", statement_text);
gsub(";", ",", statement_text);
gsub("((", ",", statement_text);
gsub("))", ",", statement_text);
gsub("[", ",", statement_text);
gsub("]", ",", statement_text);
gsub("namely", "", statement_text, ignore.case=T);
gsub("-namely", "", statement_text, ignore.case=T);
gsub("namely:", "", statement_text, ignore.case=T);
gsub("namely,", "", statement_text, ignore.case=T);
gsub(",and", "", statement_text, ignore.case=T);
gsub(";and", "", statement_text, ignore.case=T);
gsub("\(Based on.*\)", "", statement_text, ignore.case=T);
gsub("^ ", "", statement_text);
gsub("\*.*2\*", "", statement_text);
gsub("\{.*2\}", "", statement_text);
#Replace commas with new lines, when doing this if the dataframe has X rows
#it won't add new rows (a lot of info would be lost), so I did it with notepad++ 
#find and replace function.
#If you now how to do this in R say so in comments please. 
gsub(",", "\\n", statement_text);
gsub(""", "", statement_text);