R 在创建源代码语料库时拆分标识符和方法名称

R 在创建源代码语料库时拆分标识符和方法名称,r,text-mining,tm,R,Text Mining,Tm,我试图从Java源代码创建一个语料库 我遵循本文中的预处理步骤 根据第[2.1]节,应删除以下内容: -与编程语言语法相关的字符[已通过删除标点符号完成] -编程语言关键字[已由tm_地图(dsc、removeWords、javaKeywords)完成] -通用英语stopwords[已由tm_map(dsc、removeWords、stopwords(“英语”)完成)] -词干分析[已由tm_地图(dsc,stemDocument)完成] 剩下的部分是根据通用命名约定将标识符和方法名称拆分为多

我试图从Java源代码创建一个语料库
我遵循本文中的预处理步骤

根据第[2.1]节,应删除以下内容:
-与编程语言语法相关的字符[已通过删除标点符号完成]
-编程语言关键字[已由tm_地图(dsc、removeWords、javaKeywords)完成]
-通用英语stopwords[已由tm_map(dsc、removeWords、stopwords(“英语”)完成)]
-词干分析[已由tm_地图(dsc,stemDocument)完成]

剩下的部分是根据通用命名约定将标识符和方法名称拆分为多个部分

例如,“firstName”应分为“first”和“name”。

另一个示例“calculateAge”应分为“calculate”和“age”。
有人能帮我吗

    library(tm)
    dd = DirSource(pattern = ".java", recursive = TRUE)
    javaKeywords = c("abstract","continue","for","new","switch","assert","the","default","package","synchronized","boolean","do","if","private","this","break","double","implements","protected","throw","byte","else","the","null","NULL","TRUE","FALSE","true","false","import","public","throws","case","enum", "instanceof","return","transient","catch","extends","int","short","try","char","final","interface","static","void","class","finally","long","volatile","const","float","native","super","while")
    dsc <- Corpus(dd)
    dsc <- tm_map(dsc, stripWhitespace)
    dsc <- tm_map(dsc, removePunctuation)
    dsc <- tm_map(dsc, removeNumbers)
    dsc <- tm_map(dsc, removeWords, stopwords("english"))
    dsc <- tm_map(dsc, removeWords, javaKeywords)
    dsc = tm_map(dsc, stemDocument)
    dtm<- DocumentTermMatrix(dsc, control = list(weighting = weightTf, stopwords = FALSE))
library(tm)
dd=DirSource(pattern=“.java”,recursive=TRUE)
javaKeywords=c(“抽象”、“继续”、“for”、“新建”、“切换”、“断言”、“the”、“默认”、“包”、“同步”、“布尔”、“do”、“if”、“private”、“this”、“break”、“double”、“implements”、“protected”、“throw”、“byte”、“else”、“the”、“null”、“null”、“TRUE”、“TRUE”、“FALSE”、“FALSE”、“import”、“public”、“throws”、“case”、“enum”、“instanceof”、“return”、“transient”、“catch”,“扩展”,“int”,“short”,“try”,“char”,“final”,“interface”,“static”,“void”,“class”,“finally”,“long”,“volatile”,“const”,“float”,“native”,“super”,“while”)

dsc您可以创建一个自定义函数,以按大写字母拆分单词(此处矢量化):

splitCapital  <- function(x) 
     unlist(strsplit(tolower(sub('(.*)([A-Z].*)','\\1 \\2',x)),' '))
然后,您可以迭代您的语料库:

corpus.split <- lapply(dsc,splitCapital)

corpus.split我已经用Perl编写了一个工具来进行各种源代码预处理,包括标识符拆分:

那里的相关代码是:

=head2 tokenize
 Title    : tokenize
 Usage    : tokenize($wordsIn)
 Function : Splits words based on camelCase, under_scores, and dot.notation.
          : Leaves other words alone.
 Returns  : $wordsOut => string, the tokenized words
 Args     : named arguments:
          : $wordsIn => string, the white-space delimited words to process
=cut
sub tokenize{
    my $wordsIn  = shift;
    my $wordsOut = "";

    for my $w (split /\s+/, $wordsIn) {
        # Split up camel case: aaA ==> aa A
        $w =~ s/([a-z]+)([A-Z])/$1 $2/g;

        # Split up camel case: AAa ==> A Aa
        # Split up camel case: AAAAa ==> AAA Aa
        $w =~ s/([A-Z]{1,100})([A-Z])([a-z]+)/$1 $2$3/g;

        # Split up underscores 
        $w =~ s/_/ /g;

        # Split up dots
        $w =~ s/([a-zA-Z0-9])\.+([a-zA-Z0-9])/$1 $2/g;

        $wordsOut = "$wordsOut $w";
    }

    return removeDuplicateSpaces($wordsOut);
}

上述黑客行为是基于我自己对文本分析的源代码进行预处理的经验。请随意窃取和修改。

我意识到这是一个老问题,OP已经解决了他们的问题或继续前进,但如果其他人遇到这个问题并正在寻找标识符拆分包,我想退出它是用Python编写的,但附带了一个命令行实用程序,可以读取标识符文件(每行一个)并拆分每个标识符

拆分标识符看似困难。实际上这是一个研究级的问题,目前还没有完美的解决方案。即使在输入由遵循某种约定的标识符组成的情况下,例如驼峰式,也会出现歧义。当然,当源代码不遵循一致的约定时,事情会变得更加困难通风

Spiral实现了许多标识符拆分算法,包括一个名为Ronin的新算法。它使用各种启发式规则、英语词典和从挖掘源代码存储库获得的令牌频率表。Ronin可以拆分不使用驼峰大小写或其他命名约定的标识符,包括h将
J2SEProjectTypeProfiler
拆分为[
J2SE
Project
Type
Profiler
],这要求读者将
J2SE
作为一个单元来识别。以下是Ronin可以拆分的更多示例:

# spiral mStartCData nonnegativedecimaltype getUtf8Octets GPSmodule savefileas nbrOfbugs
mStartCData: ['m', 'Start', 'C', 'Data']
nonnegativedecimaltype: ['nonnegative', 'decimal', 'type']
getUtf8Octets: ['get', 'Utf8', 'Octets']
GPSmodule: ['GPS', 'module']
savefileas: ['save', 'file', 'as']
nbrOfbugs: ['nbr', 'Of', 'bugs']

如果您想要简单的严格驼峰式拆分器或其他更简单的拆分器,Spiral也提供了其中的一些。有关更多信息,请参阅GitHub页面。

只需将函数调用放在类似“dtm@Fawaz”的控件中就可以了。好奇的是,您为什么要用java代码进行文本挖掘?我的意思是,您的Objectove和java有什么不同从文本挖掘的侧面看,比如C++,呃……我正在做一些研究。我的主要问题是:“我们能解释文本演化中的源代码演变吗?”源代码可以被看作是自然语言或规则文本。我希望我已经满足了你的好奇心:)agstudy@Fawaz谢谢。是的:)(我不知道你说的“进化”是什么意思,但祝你好运。
# spiral mStartCData nonnegativedecimaltype getUtf8Octets GPSmodule savefileas nbrOfbugs
mStartCData: ['m', 'Start', 'C', 'Data']
nonnegativedecimaltype: ['nonnegative', 'decimal', 'type']
getUtf8Octets: ['get', 'Utf8', 'Octets']
GPSmodule: ['GPS', 'module']
savefileas: ['save', 'file', 'as']
nbrOfbugs: ['nbr', 'Of', 'bugs']