Python 在印度专利网站上搜索专利数据_Python_Python 2.7_Web Scraping_Beautifulsoup

Python 在印度专利网站上搜索专利数据

python python-2.7 web-scraping

Python 在印度专利网站上搜索专利数据,python,python-2.7,web-scraping,beautifulsoup,Python,Python 2.7,Web Scraping,Beautifulsoup,我正试图写一个网络垃圾程序来获取有关专利的数据。这是我目前掌握的代码 #import the necessary modules import urllib2 #import the beautifulsoup functions to parse the data from bs4 import BeautifulSoup #mention the website that you are trying to scrape patentsite="http://ipindiaservices

我正试图写一个网络垃圾程序来获取有关专利的数据。这是我目前掌握的代码

#import the necessary modules
import urllib2
#import the beautifulsoup functions to parse the data
from bs4 import BeautifulSoup

#mention the website that you are trying to scrape
patentsite="http://ipindiaservices.gov.in/publicsearch/"

#Query the website and return the html to the variable 'page'
page = urllib2.urlopen(patentsite)

#Parse the html in the 'page' variable, and store it in Beautiful Soup format
soup = BeautifulSoup(page)

print soup

不幸的是，印度专利网站不够健全，或者我不知道如何在这方面进一步发展

这是上述代码的输出

<!-- 
################################################################### 
##                                                               ##
##                                                               ##
##           SIDDHAST.COM                                        ##            
##                                                               ##
##                                                               ##
################################################################### 
--><!DOCTYPE HTML>
<html>
<head>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta charset="utf-8"/>
<title>:: InPASS - Indian Patent Advanced Search System ::</title>
<link href="resources/ipats-all.css" rel="stylesheet"/>
<script src="app.js" type="text/javascript"></script>
<link href="resources/app.css" rel="stylesheet"/>
</head>
<body></body>
</html>


：：InPASS-印度专利高级搜索系统：

我想说的是，假设我提供一个公司名称，刮板机应该获得该特定公司的所有专利。我想做其他事情，如果我可以得到这个部分的权利，如提供一套输入，刮板将使用寻找专利。但我被困在无法继续前进的地方

任何关于如何获取此数据的提示都将不胜感激。

您只需请求即可完成此操作。这篇文章将使用一个参数rc_uu创建，这是我们使用time.time创建的时间戳

“field[]”

中的每个值应与

“fieldvalue[]”

中的每个值匹配，并依次与

“operator[]”

匹配。无论您选择

*和*

*还是*或

*不*

，每个键后的

[]

都指定我们正在传递一个值数组，否则将无法工作：

data = {
    "publication_type_published": "on",
    "publication_type_granted": "on",
    "fieldDate": "APD",
    "datefieldfrom": "19120101",
    "datefieldto": "20160906",
    "operatordate": " AND ",
    "field[]": ["PA"], # claims,.description, patent-number codes go here
    "fieldvalue[]": ["chris*"], # matching values for ^^ go here
    "operator[]": [" AND "], # matching sql logic for ^^ goes here
    "page": "1", #  gives you next page results
    "start": "0", # not sure what effect this actually has.
    "limit": "25"} # not sure how this relates as  len(r.json()[u'record']) stays 25 regardless

import requests
from time import time

post = "http://ipindiaservices.gov.in/publicsearch/resources/webservices/search.php?_dc={}".format(
    str(time()).replace(".", ""))

with requests.Session() as s:
    s.get("http://ipindiaservices.gov.in/publicsearch/")
    s.headers.update({"X-Requested-With": "XMLHttpRequest"})
    r = s.post(post, data=data)
    print(r.json())

输出如下所示，我无法全部添加，因为有太多数据要发布：

{u'success': True, u'record': [{u'Publication_Status': u'Published', u'appDate': u'2016/06/16', u'pubDate': u'2016/08/31', u'title': u'ACTUATOR FOR DEPLOYABLE IMPLANT', u'sourceID': u'inpat', u'abstract': u'\n    Systems and methods are provided for usin.............

如果您使用记录键，您会得到一个目录列表，如：

{u'Publication_Status': u'Published', u'appDate': u'2015/01/27', u'pubDate': u'2015/06/26', u'title': u'CORRUGATED PALLET', u'sourceID': u'inpat', u'abstract': u'\n    A corrugated paperboard pallet is produced from two flat blanks which comprise a pallet top and a pallet bottom. The two blanks are each folded to produce only two parallel vertically extending double thickness ribs&nbsp;three horizontal panels&nbsp;two vertical side walls and two horizontal flaps. The ribs of the pallet top and pallet bottom lock each other from opening in the center of the pallet by intersecting perpendicularly with notches in the ribs. The horizontal flaps lock the ribs from opening at the edges of the pallet by intersecting perpendicularly with notches&nbsp;and the vertical sidewalls include vertical flaps that open inward defining fork passages whereby the vertical flaps lock said horizontal flaps from opening.\n  ', u'Assignee': u'OLVEY Douglas A., SKETO James L., GUMBERT Sean G., DANKO Joseph J., GABRYS Christopher W., ', u'field_of_invention': u'FI10', u'publication_no': u'26/2015', u'patent_no': u'', u'application_no': u'642/DELNP/2015', u'UCID': u'WVJ4NVVIYzFLcUQvVnJsZGczcVRmSS96Vkh3NWsrS1h3Qk43S2xHczJ2WT0%3D', u'Publication_Type': u'A'}

这是你的专利信息

您可以看到，如果我们在浏览器中选择一些值，则所有fieldvalue、field和operator line up、

和

中的值都是默认值，因此您可以看到每个选项：

因此，找出代码，选择所需内容并发布。

您已经获得了所需的html。然而，这个页面似乎是一个webapp，其中所有内容都是通过JavaScript处理的（在

app.js

）。所以你的方法很可能行不通。你可能想看看那个网站是否提供了一个你可以使用的API是的，我确实在寻找这类信息。这似乎不存在。我也尝试了一些在线网络刮板。难道没有办法，我可以刮这个网站吗？正如我所说，它更像是一个webapp而不是一个网站（因为它完全是通过javascript驱动的）。您可能可以使用Selenium做一些事情，但我从未使用过。^如果Selenium太复杂，无法使用，请使用Casper.js或Phantom.js。这太棒了！谢谢我将编写代码，然后用它做更多的工作。非常感谢。不用担心，这只是挑选你想要的任何值的问题，确保列表中的值对齐并发布到url，你就会得到你想要的json格式。