JAVA中的Dbpedia资源解析
通过使用,我得到了dbpediauri。比如说 我需要用Java请求这个URI,这样它就可以返回一些json/xml,并且我可以从响应中获取必要的信息 例如,在上面提到的URI中,我需要JAVA中的Dbpedia资源解析,java,sparql,dbpedia,Java,Sparql,Dbpedia,通过使用,我得到了dbpediauri。比如说 我需要用Java请求这个URI,这样它就可以返回一些json/xml,并且我可以从响应中获取必要的信息 例如,在上面提到的URI中,我需要dct:subject 下面是我在浏览器中得到的响应的屏幕截图 我不确定您要查找哪些值,但您应该能够做到这一点,而无需依赖于从页面源中获取所需内容。下面提供的四种Java方法应该可以满足您的需要(其中一种方法是支持方法) 获取网页HTML源代码: 首先,我们使用getWebPageSource()方法获取网页H
dct:subject
下面是我在浏览器中得到的响应的屏幕截图
我不确定您要查找哪些值,但您应该能够做到这一点,而无需依赖于从页面源中获取所需内容。下面提供的四种Java方法应该可以满足您的需要(其中一种方法是支持方法) 获取网页HTML源代码: 首先,我们使用getWebPageSource()方法获取网页HTML源代码。此方法将获取构成位于所提供链接字符串处的网页的整个HTML源代码。源在列表接口对象(List)中返回。示例用法如下:
String sourceLinkString = "http://dbpedia.org/resource/Part-of-speech_tagging";
List<String> pageSource = getWebPageSource(sourceLinkString);
String sourceLinkString = "http://dbpedia.org/resource/Part-of-speech_tagging";
List<String> pageSource = getWebPageSource(sourceLinkString);
String[] relatedLinksTo = getRelatedLinks("rel=\"dct:subject\"", pageSource, "href=\"", "\">");
使用引用字符串获取相关链接:
现在您已经有了网页源,您可以找到并获取所需的数据。下一个方法是getRelatedLinks()方法。此方法将检索特定提供的字符串标记之间包含的所有链接,其中所需链接可能位于所提供的引用字符串之间并与之相关。在您的例子中,引用字符串应该是:“rel=\“dct:subject\”
。字符串开始标记为“href=\”
,字符串结束标记为“\”>“
。因此,将查看包含引用字符串“rel=\”dct:subject\”“
的任何网页源代码行,如果在同一源代码行上找到提供的开始标记字符串(“href=\”“
)和提供的结束标记字符串(“\”>”
),则检索这些标记之间的文本。示例用法如下:
String sourceLinkString = "http://dbpedia.org/resource/Part-of-speech_tagging";
List<String> pageSource = getWebPageSource(sourceLinkString);
String sourceLinkString = "http://dbpedia.org/resource/Part-of-speech_tagging";
List<String> pageSource = getWebPageSource(sourceLinkString);
String[] relatedLinksTo = getRelatedLinks("rel=\"dct:subject\"", pageSource, "href=\"", "\">");
你会看到:
http://dbpedia.org/resource/Category:Corpus_linguistics
http://dbpedia.org/resource/Category:Markov_models
http://dbpedia.org/resource/Category:Tasks_of_natural_language_processing
http://dbpedia.org/resource/Category:Word-sense_disambiguation
如果您只需要与每个链接相关的标题,而不是整个链接字符串,那么您可以这样做:
// Display Related Links Titles...
for (int i = 0; i < relatedLinksTo.length; i++) {
String rLink = relatedLinksTo[i].substring(relatedLinksTo[i].lastIndexOf(":") + 1);
System.out.println(rLink);
}
此方法使用下面提供的名为getBetween()的支持方法
从相关链接列表中获取特定链接:
您可能不需要整个相关链接列表,而只需要指向特定标题的一个或多个特定链接,如:Tasks\u of\u natural\u language\u processing
。要获取这一个或多个链接,可以使用getFromRelatedLinksHatContain()方法。以下是您将如何实现这一目标:
String sourceLinkString = "http://dbpedia.org/resource/Part-of-speech_tagging";
List<String> pageSource = getWebPageSource(sourceLinkString);
String[] relatedLinksTo = getRelatedLinks("rel=\"dct:subject\"", pageSource, "href=\"", "\">");
String[] desiredLinks = getFromRelatedLinksThatContain(relatedLinksTo, "Tasks_of_natural_language_processing");
您将在控制台窗口中看到以下链接字符串:
// Display Related Links...
for (int i = 0; i < relatedLinksTo.length; i++) {
System.out.println(relatedLinksTo[i]);
}
http://dbpedia.org/resource/Category:Tasks_of_natural_language_processing.
测试方法:
/**
*返回包含所提供站点的页面源的列表ArrayList
*页面链接。
*
*@param link(String)要处理的网页的URL地址。
*
*@return(List-ArrayList)包含的页面源的列表ArrayList
*提供的网页链接。
*/
公共列表getWebPageSource(字符串webLink){
if(webLink.equals(“”){
返回null;
}
试一试{
URL=新URL(网络链接);
urlyc;
//如果url是SSL端点(使用https等安全套接字层)。。。
if(webLink.startsWith(“https:)){
yc=新URL(webLink.openConnection();
//发送页面数据请求。。。
yc.setRequestProperty(“用户代理”、“Mozilla/5.0(Windows NT 6.1;WOW64)AppleWebKit/537.11(KHTML,类似Gecko)Chrome/23.0.1271.95 Safari/537.11”);
yc.connect();
}
//如果不是SLL端点(只是http)。。。
否则{
yc=url.openConnection();
}
InputStream InputStream=yc.getInputStream();
InputStreamReader streamReader=null;
字符串编码=空;
试一试{
encoding=yc.getContentEncoding().toLowerCase();
}
捕获(例外情况除外){
}
if(null==编码){
encoding=“UTF-8”;
streamReader=新的InputStreamReader(yc.getInputStream(),编码);
}
否则{
开关(编码){
案例“gzip”:
//使用GZip压缩:包装读取器
inputStream=新的GZIPInputStream(inputStream);
streamReader=新的InputStreamReader(inputStream);
打破
//streamReader=新的InputStreamReader(inputStream);
案例“utf-8”:
encoding=“UTF-8”;
streamReader=新的InputStreamReader(yc.getInputStream(),编码);
打破
案例“utf-16”:
encoding=“UTF-16”;
streamReader=新的InputStreamReader(yc.getInputStream(),编码);
打破
违约:
打破
}
}
列出源文本;
try(BufferedReader in=新的BufferedReader(streamReader)){
字符串输入线;
sourceText=newarraylist();
而((inputLine=in.readLine())!=null){
添加(输入行);
}
}
返回源文本;
}
捕获(格式错误){
//你想做什么就做什么,除了例外。
例如printStackTrace();
}
捕获(IOEX异常){
//你想做什么就做什么,除了例外。
例如printStackTrace();
}
返回null;
}
/**
*此方法将检索特定服务器之间包含的所有链接
*所提供的字符串标记,其中所需链接可能位于和之间并相互关联
*指向提供的引用字符串。字符串开始标记和字符串结束标记
*也需要。
*
*因此,如果任何网页源行包含以下引用字符串:
PREFIX dbr: <http://dbpedia.org/resource/>
SELECT DISTINCT ?subject
WHERE { dbr:Part-of-speech_tagging dct:subject ?subject }
LIMIT 100
*
*“rel=\”dct:subject\”
*
*
/**
* Returns a List ArrayList containing the page source for the supplied web
* page link.<br><br>
*
* @param link (String) The URL address of the web page to process.<br>
*
* @return (List ArrayList) A List ArrayList containing the page source for
* the supplied web page link.
*/
public List<String> getWebPageSource(String webLink) {
if (webLink.equals("")) {
return null;
}
try {
URL url = new URL(webLink);
URLConnection yc;
//If url is a SSL Endpoint (using a Secure Socket Layer such as https)...
if (webLink.startsWith("https:")) {
yc = new URL(webLink).openConnection();
//send request for page data...
yc.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11");
yc.connect();
}
//and if not a SLL Endpoint (just http)...
else {
yc = url.openConnection();
}
InputStream inputStream = yc.getInputStream();
InputStreamReader streamReader = null;
String encoding = null;
try {
encoding = yc.getContentEncoding().toLowerCase();
}
catch (Exception ex) {
}
if (null == encoding) {
encoding = "UTF-8";
streamReader = new InputStreamReader(yc.getInputStream(), encoding);
}
else {
switch (encoding) {
case "gzip":
// Is compressed using GZip: Wrap the reader
inputStream = new GZIPInputStream(inputStream);
streamReader = new InputStreamReader(inputStream);
break;
//streamReader = new InputStreamReader(inputStream);
case "utf-8":
encoding = "UTF-8";
streamReader = new InputStreamReader(yc.getInputStream(), encoding);
break;
case "utf-16":
encoding = "UTF-16";
streamReader = new InputStreamReader(yc.getInputStream(), encoding);
break;
default:
break;
}
}
List<String> sourceText;
try (BufferedReader in = new BufferedReader(streamReader)) {
String inputLine;
sourceText = new ArrayList<>();
while ((inputLine = in.readLine()) != null) {
sourceText.add(inputLine);
}
}
return sourceText;
}
catch (MalformedURLException ex) {
// Do whatever you want with exception.
ex.printStackTrace();
}
catch (IOException ex) {
// Do whatever you want with exception.
ex.printStackTrace();
}
return null;
}
/**
* This method will retrieve all links which are contained between specifically
* supplied String Tags where the desired Links may reside between and are related
* to the supplied <b>Reference String</b>. A String Start Tag and a String End Tag
* would be required as well.<br><br>
*
* So, if any Web Page Source line that contains the Reference String of:<pre>
*
* "rel=\"dct:subject\""</pre><br>
*
* is looked at and if <i>on the same source line</i> the supplied Start Tag
* String (ie: "href=\"") and the supplied End Tag String (ie: "\">") are found then
* the text between those tags is retrieved.<br><br>
*
* This method utilizes the support method named <b>getBetween()</b>.<br><br>
*
* @param referenceString (String) The reference string to look for on any web
* page source line.<br>
*
* @param pageSource (List Interface of String) The List which contains all the
* HTML Web Page Source.<br>
*
* @param desiredLinkStartTag (String) The Start Tag String where the desired
* Link or links may reside after. This can be any string. Links are retrieved
* from between the Start Tag and the End Tag.<br>
*
* @param desiredLinkEndTag (String) The End Tag String where the desired
* Link or links may reside before. This can be any string. Links are retrieved
* from between the Start Tag and the End Tag.<br>
*
* @return (1D String Array) A String Array containing the Links Found.<br>
*
* @see #getBetween(java.lang.String, java.lang.String, java.lang.String, boolean...) getBetween()
*/
public String[] getRelatedLinks(String referenceString, List<String> pageSource,
String desiredLinkStartTag, String desiredLinkEndTag) {
List<String> links = new ArrayList<>();
for (int i = 0; i < pageSource.size(); i++) {
if (pageSource.get(i).contains(referenceString)) {
String[] lnks = getBetween(pageSource.get(i), desiredLinkStartTag, desiredLinkEndTag);
links.addAll(Arrays.asList(lnks));
}
}
return links.toArray(new String[0]);
}
/**
* Retrieves a specific Link from within the Related Links List generated by
* the <b>getRelatedLinks()</b> method.<br><br>
*
* This method requires the use of the <b>getRelatedLinks()</b> method.
*
* @param relatedArray (1D String Array) The array returned from the <b>getRelatedLinks()</b>
* method.<br>
*
* @param desiredStringInLink (String - Letter Case Sensitive) The string title
* contained within the link to retrieve.<br>
*
* @return (1D String Array) Containing any links found.<br>
*
* @see #getRelatedLinks(java.lang.String, java.util.List, java.lang.String, java.lang.String) getRelatedLinks()
*
*/
public String[] getFromRelatedLinksThatContain(String[] relatedArray, String desiredStringInLink) {
List<String> desiredLinks = new ArrayList<>();
for (int i = 0; i < relatedArray.length; i++) {
if (relatedArray[i].contains(desiredStringInLink)) {
desiredLinks.add(relatedArray[i]);
}
}
return desiredLinks.toArray(new String[0]);
}
/**
* Retrieves any string data located between the supplied string leftString
* parameter and the supplied string rightString parameter.<br><br>
* This method will return all instances of a substring located between the
* supplied Left String and the supplied Right String which may be found
* within the supplied Input String.<br>
*
* @param inputString (String) The string to look for substring(s) in.
*
* @param leftString (String) What may be to the Left side of the substring
* we want within the main input string. Sometimes the
* substring you want may be contained at the very
* beginning of a string and therefore there is no
* Left-String available. In this case you would simply
* pass a Null String ("") to this parameter which
* basically informs the method of this fact. Null can
* not be supplied and will ultimately generate a
* NullPointerException.
*
* @param rightString (String) What may be to the Right side of the
* substring we want within the main input string.
* Sometimes the substring you want may be contained at
* the very end of a string and therefore there is no
* Right-String available. In this case you would simply
* pass a Null String ("") to this parameter which
* basically informs the method of this fact. Null can
* not be supplied and will ultimately generate a
* NullPointerException.
*
* @param options (Optional - Boolean - 2 Parameters):<pre>
*
* ignoreLetterCase - Default is false. This option works against the
* string supplied within the leftString parameter
* and the string supplied within the rightString
* parameter. If set to true then letter case is
* ignored when searching for strings supplied in
* these two parameters. If left at default false
* then letter case is not ignored.
*
* trimFound - Default is true. By default this method will trim
* off leading and trailing white-spaces from found
* sub-string items. General sentences which obviously
* contain spaces will almost always give you a white-
* space within an extracted sub-string. By setting
* this parameter to false, leading and trailing white-
* spaces are not trimmed off before they are placed
* into the returned Array.</pre>
*
* @return (1D String Array) Returns a Single Dimensional String Array
* containing all the sub-strings found within the supplied Input
* String which are between the supplied Left String and supplied
* Right String. You can shorten this method up a little by
* returning a List<String> ArrayList and removing the 'List
* to 1D Array' conversion code at the end of this method. This
* method initially stores its findings within a List object
* anyways.
*/
public static String[] getBetween(String inputString, String leftString, String rightString, boolean... options) {
// Return nothing if nothing was supplied.
if (inputString.equals("") || (leftString.equals("") && rightString.equals(""))) {
return null;
}
// Prepare optional parameters if any supplied.
// If none supplied then use Defaults...
boolean ignoreCase = false; // Default.
boolean trimFound = true; // Default.
if (options.length > 0) {
if (options.length >= 1) {
ignoreCase = options[0];
}
if (options.length >= 2) {
trimFound = options[1];
}
}
// Remove any ASCII control characters from the
// supplied string (if they exist).
String modString = inputString.replaceAll("\\p{Cntrl}", "");
// Establish a List String Array Object to hold
// our found substrings between the supplied Left
// String and supplied Right String.
List<String> list = new ArrayList<>();
// Use Pattern Matching to locate our possible
// substrings within the supplied Input String.
String regEx = Pattern.quote(leftString)
+ (!rightString.equals("") ? "(.*?)" : "(.*)?")
+ Pattern.quote(rightString);
if (ignoreCase) {
regEx = "(?i)" + regEx;
}
Pattern pattern = Pattern.compile(regEx);
Matcher matcher = pattern.matcher(modString);
while (matcher.find()) {
// Add the found substrings into the List.
String found = matcher.group(1);
if (trimFound) {
found = found.trim();
}
list.add(found);
}
String[] res;
// Convert the ArrayList to a 1D String Array.
// If the List contains something then convert
if (list.size() > 0) {
res = new String[list.size()];
res = list.toArray(res);
} // Otherwise return Null.
else {
res = null;
}
// Return the String Array.
return res;
}
PREFIX dbr: <http://dbpedia.org/resource/>
SELECT DISTINCT ?subject
WHERE { dbr:Part-of-speech_tagging dct:subject ?subject }
LIMIT 100