JAVA中的Dbpedia资源解析_Java_Sparql_Dbpedia

JAVA中的Dbpedia资源解析

java sparql

JAVA中的Dbpedia资源解析,java,sparql,dbpedia,Java,Sparql,Dbpedia,通过使用，我得到了dbpediauri。比如说我需要用Java请求这个URI，这样它就可以返回一些json/xml，并且我可以从响应中获取必要的信息例如，在上面提到的URI中，我需要dct:subject 下面是我在浏览器中得到的响应的屏幕截图我不确定您要查找哪些值，但您应该能够做到这一点，而无需依赖于从页面源中获取所需内容。下面提供的四种Java方法应该可以满足您的需要（其中一种方法是支持方法）获取网页HTML源代码：首先，我们使用getWebPageSource（）方法获取网页H

通过使用，我得到了dbpediauri。比如说

我需要用Java请求这个URI，这样它就可以返回一些json/xml，并且我可以从响应中获取必要的信息

例如，在上面提到的URI中，我需要

dct:subject

下面是我在浏览器中得到的响应的屏幕截图

我不确定您要查找哪些值，但您应该能够做到这一点，而无需依赖于从页面源中获取所需内容。下面提供的四种Java方法应该可以满足您的需要（其中一种方法是支持方法）

获取网页HTML源代码：

首先，我们使用getWebPageSource（）方法获取网页HTML源代码。此方法将获取构成位于所提供链接字符串处的网页的整个HTML源代码。源在列表接口对象（List）中返回。示例用法如下：

String sourceLinkString = "http://dbpedia.org/resource/Part-of-speech_tagging";
List<String> pageSource = getWebPageSource(sourceLinkString);

String sourceLinkString = "http://dbpedia.org/resource/Part-of-speech_tagging";
List<String> pageSource = getWebPageSource(sourceLinkString);
String[] relatedLinksTo = getRelatedLinks("rel=\"dct:subject\"", pageSource, "href=\"", "\">");

使用引用字符串获取相关链接：

现在您已经有了网页源，您可以找到并获取所需的数据。下一个方法是getRelatedLinks（）方法。此方法将检索特定提供的字符串标记之间包含的所有链接，其中所需链接可能位于所提供的引用字符串之间并与之相关。在您的例子中，引用字符串应该是：

“rel=\“dct:subject\”

。字符串开始标记为

“href=\”

，字符串结束标记为

“\”>“

。因此，将查看包含引用字符串

“rel=\”dct:subject\”“

的任何网页源代码行，如果在同一源代码行上找到提供的开始标记字符串（

“href=\”“

）和提供的结束标记字符串（

“\”>”

），则检索这些标记之间的文本。示例用法如下：

String sourceLinkString = "http://dbpedia.org/resource/Part-of-speech_tagging";
List<String> pageSource = getWebPageSource(sourceLinkString);

String sourceLinkString = "http://dbpedia.org/resource/Part-of-speech_tagging";
List<String> pageSource = getWebPageSource(sourceLinkString);
String[] relatedLinksTo = getRelatedLinks("rel=\"dct:subject\"", pageSource, "href=\"", "\">");

你会看到：

http://dbpedia.org/resource/Category:Corpus_linguistics
http://dbpedia.org/resource/Category:Markov_models
http://dbpedia.org/resource/Category:Tasks_of_natural_language_processing
http://dbpedia.org/resource/Category:Word-sense_disambiguation

如果您只需要与每个链接相关的标题，而不是整个链接字符串，那么您可以这样做：

// Display Related Links Titles...
for (int i = 0; i < relatedLinksTo.length; i++) {
    String rLink = relatedLinksTo[i].substring(relatedLinksTo[i].lastIndexOf(":") + 1);
    System.out.println(rLink);
}

此方法使用下面提供的名为getBetween（）的支持方法
从相关链接列表中获取特定链接：
您可能不需要整个相关链接列表，而只需要指向特定标题的一个或多个特定链接，如：
Tasks\u of\u natural\u language\u processing
。要获取这一个或多个链接，可以使用getFromRelatedLinksHatContain（）方法。以下是您将如何实现这一目标：

String sourceLinkString = "http://dbpedia.org/resource/Part-of-speech_tagging"; List<String> pageSource = getWebPageSource(sourceLinkString); String[] relatedLinksTo = getRelatedLinks("rel=\"dct:subject\"", pageSource, "href=\"", "\">"); String[] desiredLinks = getFromRelatedLinksThatContain(relatedLinksTo, "Tasks_of_natural_language_processing");
您将在控制台窗口中看到以下链接字符串：

// Display Related Links... for (int i = 0; i < relatedLinksTo.length; i++) { System.out.println(relatedLinksTo[i]); }

http://dbpedia.org/resource/Category:Tasks_of_natural_language_processing.
测试方法：

/** *返回包含所提供站点的页面源的列表ArrayList *页面链接。 * *@param link（String）要处理的网页的URL地址。 * *@return（List-ArrayList）包含的页面源的列表ArrayList *提供的网页链接。 */ 公共列表getWebPageSource（字符串webLink）{ if（webLink.equals（“”）{ 返回null； } 试一试{ URL=新URL（网络链接）； urlyc； //如果url是SSL端点（使用https等安全套接字层）。。。 if（webLink.startsWith（“https:））{ yc=新URL（webLink.openConnection（）； //发送页面数据请求。。。 yc.setRequestProperty（“用户代理”、“Mozilla/5.0（Windows NT 6.1；WOW64）AppleWebKit/537.11（KHTML，类似Gecko）Chrome/23.0.1271.95 Safari/537.11”）； yc.connect（）； } //如果不是SLL端点（只是http）。。。否则{ yc=url.openConnection（）； } InputStream InputStream=yc.getInputStream（）； InputStreamReader streamReader=null；字符串编码=空；试一试{ encoding=yc.getContentEncoding（）.toLowerCase（）； } 捕获（例外情况除外）{ } if（null==编码）{ encoding=“UTF-8”； streamReader=新的InputStreamReader（yc.getInputStream（），编码）； } 否则{ 开关（编码）{ 案例“gzip”： //使用GZip压缩：包装读取器 inputStream=新的GZIPInputStream（inputStream）； streamReader=新的InputStreamReader（inputStream）；打破 //streamReader=新的InputStreamReader（inputStream）；案例“utf-8”： encoding=“UTF-8”； streamReader=新的InputStreamReader（yc.getInputStream（），编码）；打破案例“utf-16”： encoding=“UTF-16”； streamReader=新的InputStreamReader（yc.getInputStream（），编码）；打破违约：打破 } } 列出源文本； try（BufferedReader in=新的BufferedReader（streamReader））{ 字符串输入线； sourceText=newarraylist（）；而（（inputLine=in.readLine（））！=null）{ 添加（输入行）； } } 返回源文本； } 捕获（格式错误）{ //你想做什么就做什么，除了例外。例如printStackTrace（）； } 捕获（IOEX异常）{ //你想做什么就做什么，除了例外。例如printStackTrace（）； } 返回null； } /** *此方法将检索特定服务器之间包含的所有链接 *所提供的字符串标记，其中所需链接可能位于和之间并相互关联 *指向提供的引用字符串。字符串开始标记和字符串结束标记 *也需要。 * *因此，如果任何网页源行包含以下引用字符串： PREFIX dbr: <http://dbpedia.org/resource/> SELECT DISTINCT ?subject WHERE { dbr:Part-of-speech_tagging dct:subject ?subject } LIMIT 100 * *“rel=\”dct:subject\” * * /** * Returns a List ArrayList containing the page source for the supplied web * page link. * * @param link (String) The URL address of the web page to process. * * @return (List ArrayList) A List ArrayList containing the page source for * the supplied web page link. */ public List<String> getWebPageSource(String webLink) { if (webLink.equals("")) { return null; } try { URL url = new URL(webLink); URLConnection yc; //If url is a SSL Endpoint (using a Secure Socket Layer such as https)... if (webLink.startsWith("https:")) { yc = new URL(webLink).openConnection(); //send request for page data... yc.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11"); yc.connect(); } //and if not a SLL Endpoint (just http)... else { yc = url.openConnection(); } InputStream inputStream = yc.getInputStream(); InputStreamReader streamReader = null; String encoding = null; try { encoding = yc.getContentEncoding().toLowerCase(); } catch (Exception ex) { } if (null == encoding) { encoding = "UTF-8"; streamReader = new InputStreamReader(yc.getInputStream(), encoding); } else { switch (encoding) { case "gzip": // Is compressed using GZip: Wrap the reader inputStream = new GZIPInputStream(inputStream); streamReader = new InputStreamReader(inputStream); break; //streamReader = new InputStreamReader(inputStream); case "utf-8": encoding = "UTF-8"; streamReader = new InputStreamReader(yc.getInputStream(), encoding); break; case "utf-16": encoding = "UTF-16"; streamReader = new InputStreamReader(yc.getInputStream(), encoding); break; default: break; } } List<String> sourceText; try (BufferedReader in = new BufferedReader(streamReader)) { String inputLine; sourceText = new ArrayList<>(); while ((inputLine = in.readLine()) != null) { sourceText.add(inputLine); } } return sourceText; } catch (MalformedURLException ex) { // Do whatever you want with exception. ex.printStackTrace(); } catch (IOException ex) { // Do whatever you want with exception. ex.printStackTrace(); } return null; } /** * This method will retrieve all links which are contained between specifically * supplied String Tags where the desired Links may reside between and are related * to the supplied Reference String. A String Start Tag and a String End Tag * would be required as well. * * So, if any Web Page Source line that contains the Reference String of:<pre> * * "rel=\"dct:subject\""</pre> * * is looked at and if on the same source line the supplied Start Tag * String (ie: "href=\"") and the supplied End Tag String (ie: "\">") are found then * the text between those tags is retrieved. * * This method utilizes the support method named getBetween(). * * @param referenceString (String) The reference string to look for on any web * page source line. * * @param pageSource (List Interface of String) The List which contains all the * HTML Web Page Source. * * @param desiredLinkStartTag (String) The Start Tag String where the desired * Link or links may reside after. This can be any string. Links are retrieved * from between the Start Tag and the End Tag. * * @param desiredLinkEndTag (String) The End Tag String where the desired * Link or links may reside before. This can be any string. Links are retrieved * from between the Start Tag and the End Tag. * * @return (1D String Array) A String Array containing the Links Found. * * @see #getBetween(java.lang.String, java.lang.String, java.lang.String, boolean...) getBetween() */ public String[] getRelatedLinks(String referenceString, List<String> pageSource, String desiredLinkStartTag, String desiredLinkEndTag) { List<String> links = new ArrayList<>(); for (int i = 0; i < pageSource.size(); i++) { if (pageSource.get(i).contains(referenceString)) { String[] lnks = getBetween(pageSource.get(i), desiredLinkStartTag, desiredLinkEndTag); links.addAll(Arrays.asList(lnks)); } } return links.toArray(new String[0]); } /** * Retrieves a specific Link from within the Related Links List generated by * the getRelatedLinks() method. * * This method requires the use of the getRelatedLinks() method. * * @param relatedArray (1D String Array) The array returned from the getRelatedLinks() * method. * * @param desiredStringInLink (String - Letter Case Sensitive) The string title * contained within the link to retrieve. * * @return (1D String Array) Containing any links found. * * @see #getRelatedLinks(java.lang.String, java.util.List, java.lang.String, java.lang.String) getRelatedLinks() * */ public String[] getFromRelatedLinksThatContain(String[] relatedArray, String desiredStringInLink) { List<String> desiredLinks = new ArrayList<>(); for (int i = 0; i < relatedArray.length; i++) { if (relatedArray[i].contains(desiredStringInLink)) { desiredLinks.add(relatedArray[i]); } } return desiredLinks.toArray(new String[0]); } /** * Retrieves any string data located between the supplied string leftString * parameter and the supplied string rightString parameter. * This method will return all instances of a substring located between the * supplied Left String and the supplied Right String which may be found * within the supplied Input String. * * @param inputString (String) The string to look for substring(s) in. * * @param leftString (String) What may be to the Left side of the substring * we want within the main input string. Sometimes the * substring you want may be contained at the very * beginning of a string and therefore there is no * Left-String available. In this case you would simply * pass a Null String ("") to this parameter which * basically informs the method of this fact. Null can * not be supplied and will ultimately generate a * NullPointerException. * * @param rightString (String) What may be to the Right side of the * substring we want within the main input string. * Sometimes the substring you want may be contained at * the very end of a string and therefore there is no * Right-String available. In this case you would simply * pass a Null String ("") to this parameter which * basically informs the method of this fact. Null can * not be supplied and will ultimately generate a * NullPointerException. * * @param options (Optional - Boolean - 2 Parameters):<pre> * * ignoreLetterCase - Default is false. This option works against the * string supplied within the leftString parameter * and the string supplied within the rightString * parameter. If set to true then letter case is * ignored when searching for strings supplied in * these two parameters. If left at default false * then letter case is not ignored. * * trimFound - Default is true. By default this method will trim * off leading and trailing white-spaces from found * sub-string items. General sentences which obviously * contain spaces will almost always give you a white- * space within an extracted sub-string. By setting * this parameter to false, leading and trailing white- * spaces are not trimmed off before they are placed * into the returned Array.</pre> * * @return (1D String Array) Returns a Single Dimensional String Array * containing all the sub-strings found within the supplied Input * String which are between the supplied Left String and supplied * Right String. You can shorten this method up a little by * returning a List<String> ArrayList and removing the 'List * to 1D Array' conversion code at the end of this method. This * method initially stores its findings within a List object * anyways. */ public static String[] getBetween(String inputString, String leftString, String rightString, boolean... options) { // Return nothing if nothing was supplied. if (inputString.equals("") || (leftString.equals("") && rightString.equals(""))) { return null; } // Prepare optional parameters if any supplied. // If none supplied then use Defaults... boolean ignoreCase = false; // Default. boolean trimFound = true; // Default. if (options.length > 0) { if (options.length >= 1) { ignoreCase = options[0]; } if (options.length >= 2) { trimFound = options[1]; } } // Remove any ASCII control characters from the // supplied string (if they exist). String modString = inputString.replaceAll("\\p{Cntrl}", ""); // Establish a List String Array Object to hold // our found substrings between the supplied Left // String and supplied Right String. List<String> list = new ArrayList<>(); // Use Pattern Matching to locate our possible // substrings within the supplied Input String. String regEx = Pattern.quote(leftString) + (!rightString.equals("") ? "(.*?)" : "(.*)?") + Pattern.quote(rightString); if (ignoreCase) { regEx = "(?i)" + regEx; } Pattern pattern = Pattern.compile(regEx); Matcher matcher = pattern.matcher(modString); while (matcher.find()) { // Add the found substrings into the List. String found = matcher.group(1); if (trimFound) { found = found.trim(); } list.add(found); } String[] res; // Convert the ArrayList to a 1D String Array. // If the List contains something then convert if (list.size() > 0) { res = new String[list.size()]; res = list.toArray(res); } // Otherwise return Null. else { res = null; } // Return the String Array. return res; } PREFIX dbr: <http://dbpedia.org/resource/> SELECT DISTINCT ?subject WHERE { dbr:Part-of-speech_tagging dct:subject ?subject } LIMIT 100