VB.net中字符串中字符的识别语言
我的web服务中的一个函数接收不同语言的数据VB.net中字符串中字符的识别语言,vb.net,Vb.net,我的web服务中的一个函数接收不同语言的数据 俄罗斯人 罗马尼亚人 英式 阿拉伯文 我想写一个函数来标识接收字符串中的字符所属的语言 我已经为阿拉伯语找到了一个: Public Function IsGenericArabic(ByVal Msg As String) As Boolean Dim ch As Char IsGenericArabic = False For Each ch In Msg Dim ch1 As Integer = CInt(
Public Function IsGenericArabic(ByVal Msg As String) As Boolean
Dim ch As Char
IsGenericArabic = False
For Each ch In Msg
Dim ch1 As Integer = CInt(AscW(ch))
If ch1 >= &H621 AndAlso ch1 <= &H64A Then
IsGenericArabic = True
Exit For
End If
Next
End Function
Public函数是布尔型的GenericaRabic(ByVal Msg作为字符串)
灰烬炭
IsGenericArabic=False
味精中的每一个ch
Dim ch1作为整数=CInt(AscW(ch))
如果ch1>=&H621和ch1对于许多语言来说,比较单个字符没有帮助。想想所有使用拉丁字母的语言。对于这些语言,您必须检测这种语言的单词。问题是找到最有可能出现在输入文本中的单词列表。全文搜索算法通常排除出现频率太高的单词,因为这些单词出现在大多数句子中,因此没有足够的选择性。这些词如“and”、“the”、“a”和“of”。这些单词的列表称为停止词列表。但这正是我们需要的。查找要检测的所有语言的停止词列表(Google帮助)
然后,算法将如下所示(在伪代码中,即缺少一些细节):
这是一个名为franc的javascript库,可以帮助您:可能是@Ammaroff的重复,它是一个web服务,所以我只在服务器端工作。
Class LanguageInfo
Public Property LanguageCode As String
Public Property Words As HashSet(Of String)
End Class
Dim infoList = New List(Of LanguageInfo)()
'Prepare the language information
For Each language In { "rus", "rom", ... }
'Assuming one stop word per line
Dim stopWords() As String = File.ReadAllLines(language + ".txt")
Dim info = New LanguageInfo()
info.LanguageCode = language
info.Words = New HashSet(Of String)(stopWords)
infoList.Add(info)
Next
'Detect language of input
Dim bestLanguageGuess As String = ""
Dim maxWeight As Integer = 0
Dim inputWords() As String = SplitIntoSingleWords(input)
For Each info In infoList
Dim weight As Integer = 0
For Each w In inputWords
If info.Words.Contains(w) Then
weight = weight + 1
End If
Next
If weight > maxWeight Then
bestLanguageGuess = info.LanguageCode
maxWeight = weight
End If
Next
If maxWeight > 0 Then
bestLanguageGuess is the language we are looking for
End If