Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/65.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
格式化结果R中的位数_R_Decimal_Extraction_Digits_Term Document Matrix - Fatal编程技术网

格式化结果R中的位数

格式化结果R中的位数,r,decimal,extraction,digits,term-document-matrix,R,Decimal,Extraction,Digits,Term Document Matrix,我创建了一个文档术语矩阵,用于搜索从100000到600000的数字,以查找一些数据挖掘问题,但我提到,它不会将所需的数字作为结果,而是将每个数字与空格或小数组合成6位数字,并将其作为单个数字返回 这是我的密码 library(text2vec) docs = c(doc1 = " letter ltetter (-è) 323.456 1 789 ", dc2 = "letters 123.45 1letters 100000 98 76 54 ", d

我创建了一个文档术语矩阵,用于搜索从
100000到600000的数字,以查找一些数据挖掘问题,但我提到,它不会将所需的数字作为结果,而是将每个数字与空格或小数组合成6位数字,并将其作为单个数字返回

这是我的密码

    library(text2vec)

 docs = c(doc1 = " letter ltetter (-è)  323.456 1  789 ",
     dc2 = "letters 123.45 1letters 100000  98 76 54  ",
     dc3 = "123456789  454321 letters 124 258 ")
#delete every thing but numbers
    docs = gsub("[^0-9 ]", "", docs, perl = T)
#creating the dtm
    itoken = itoken(docs, tokenizer = word_tokenizer, ids = names(docs))
    vector = create_vocabulary(itoken)
    vectorizer = vocab_vectorizer(vector)
    dtm = create_dtm(itoken, vectorizer)

     (dtm[, colnames(dtm) %in% 100000:600000])
3 x 4 sparse Matrix of class "dgCMatrix"
     100000  454321 323456
doc1      .     .      1
dc2       1     .      .
dc3       .     1      .
提取的
100000
是正确的=它在所需的空白
(100000和600000)

454321
是正确的=它在所需的保证金中
(100000和600000)

323456
为false=文档中的数字为323.456不在页边空白处,而是提取出来的
如何调整它以仅返回从
100000到600000的数字?

您可以搜索具有6位数字的单词边界
\\b
,以1-6
[1-6]
中的数字开始,后跟任意5位
[0-9]{5}

library(stringr)
docs_list <- lapply(docs, 
                   function(x){str_extract_all(x,"\\b[1-6][0-9]{5}\\b", simplify = TRUE)})

docs_list[sapply(docs_list, function(x) length(x)==0L)] <- NA

unlist(docs_list)
doc1      dc2      dc3 
  NA "100000" "454321" 
库(stringr)

docs_list如果我正确理解您的问题,您希望从文档中提取所有数字,包括小数点

所以你想做一些像

docs <- sapply(docs, function(doc) {
  nums <- regmatches(doc, gregexpr("[0-9]+\\.*[0-9]*", doc))
  paste(unlist(nums), collapse = " ")
})
docs
#                       doc1                        dc2 
#            "323.456 1 789" "123.45 1 100000 98 76 54" 
#                        dc3 
# "123456789 454321 124 258"

你必须考虑GBASE函数中的小数点。< /P>

library(text2vec)

docs = c(doc1 = " letter ltetter (-è)  323.456 1  789 ",
     dc2 = "letters 123.45 1letters 100000  98 76 54  ",
     dc3 = "123456789  454321 letters 124 258 ")

#If you have decimal commas first do this
docs = sub(',','.',docs,perl = T)
#Here what i've changed
docs = gsub("[^0-9^.^ ]", "", docs, perl = T)

#creating the dtm
itoken = itoken(docs, tokenizer = word_tokenizer, ids = names(docs))
vector = create_vocabulary(itoken)
vectorizer = vocab_vectorizer(vector)
dtm = create_dtm(itoken, vectorizer)
dtm_1 <- as.numeric(colnames(dtm))
table <- as.matrix(dtm[, (dtm_1 < 600000 & dtm_1>10000)])

library(reshape)
df_melted <- melt(table)
df_melted <- df_melted[which(df_melted$value != 0),]
colnames(df_melted) <- c("Document","Number Found","times")

你的问题不清楚。您的代码似乎只返回100000到600000之间的数字。另外,您应该将
create\u dtm(it,vectorizer)
替换为
create\u dtm(itoken,vectorizer)
。我想将数字从
100000
提取到
600000
,但即使中间有逗号,代码也会返回6个术语的每个组合。它返回这个
123.456
作为一个6位数字,但它不是no,先生,我想要它的反面。我只想要100万到600万之间的数字。提取的十进制数字是假的(它们不在需要的范围内)。我想纠正提取小数的问题。这不是我所做的吗?如果您像我所展示的那样处理
文档
,然后运行其余的代码,您就不会有十进制数字。如果有十进制逗号?如何添加代码。首先,为了处理十进制逗号,我在gsub步骤之前添加了一个步骤。此外,我们只比较整数值(我们在比较中不包括十进制数),所以我稍微修改了代码。现在它应该可以正常工作了。
library(text2vec)

docs = c(doc1 = " letter ltetter (-è)  323.456 1  789 ",
     dc2 = "letters 123.45 1letters 100000  98 76 54  ",
     dc3 = "123456789  454321 letters 124 258 ")

#If you have decimal commas first do this
docs = sub(',','.',docs,perl = T)
#Here what i've changed
docs = gsub("[^0-9^.^ ]", "", docs, perl = T)

#creating the dtm
itoken = itoken(docs, tokenizer = word_tokenizer, ids = names(docs))
vector = create_vocabulary(itoken)
vectorizer = vocab_vectorizer(vector)
dtm = create_dtm(itoken, vectorizer)
dtm_1 <- as.numeric(colnames(dtm))
table <- as.matrix(dtm[, (dtm_1 < 600000 & dtm_1>10000)])

library(reshape)
df_melted <- melt(table)
df_melted <- df_melted[which(df_melted$value != 0),]
colnames(df_melted) <- c("Document","Number Found","times")
  Document Number Found times
2      dc2       100000     1
6      dc3       454321     1