如何从类似Python语法的JavaScript函数调用中提取数据
我正在用Python中的scrapy从一个网站上抓取数据 所需数据位于脚本标记中,如下所示:如何从类似Python语法的JavaScript函数调用中提取数据,python,scrapy,Python,Scrapy,我正在用Python中的scrapy从一个网站上抓取数据 所需数据位于脚本标记中,如下所示: <script type="text/javascript"> getDetailsfrmBean("storePg","564","Berwyn, IL","7180 W CERMAK RD.","SPACE A1","","BERWYN","IL","US","60402","(708) 788-5097","{Monday-Saturday=10-9,sunday=11-6}","41
<script type="text/javascript">
getDetailsfrmBean("storePg","564","Berwyn, IL","7180 W CERMAK RD.","SPACE A1","","BERWYN","IL","US","60402","(708) 788-5097","{Monday-Saturday=10-9,sunday=11-6}","41.8507029","-87.8033709");
</script>
item['lat'] = tree.xpath('//script[@type="text/javascript"]/text()'.extract()[0].encode('utf-8')
item['long'] = tree.xpath('//script[@type="text/javascript"]/text()'.extract()[0].encode('utf-8')
然后
但是我如何解析这些内容以便
item['lat'] is equal to "41.8507029"
item['long'] is equal to "-87.8033709"
item['city'] is equal to "BERWYN"
item['state'] is equal to "IL"
我可以得到一些建议来解决这个问题。因为这个调用也是有效的Python语法,我们可以使用
ast
模块。加上参数都是字符串文字,这使事情更简单
import ast
line = 'getDetailsfrmBean("storePg","564","Berwyn, IL","7180 W CERMAK RD.","SPACE A1","","BERWYN","IL","US","60402","(708) 788-5097","{Monday-Saturday=10-9,sunday=11-6}","41.8507029","-87.8033709");'
print([arg.s for arg in ast.parse(line).body[0].value.args])
输出:
['storePg', '564', 'Berwyn, IL', '7180 W CERMAK RD.', 'SPACE A1', '', 'BERWYN', 'IL', 'US', '60402', '(708) 788-5097', '{Monday-Saturday=10-9,sunday=11-6}', '41.8507029', '-87.8033709']
"storePg","564","Berwyn, IL","7180 W CERMAK RD.","SPACE A1","","BERWYN","IL","US","60402","(708) 788-5097","{Monday-Saturday=10-9,sunday=11-6}","41.8507029","-87.8033709"
说明:
print([arg.s # value of string literal
for arg in
ast.parse(line)
.body # module (list of statements)
[0] # first statement (an Expr node)
.value # expression (a Call)
.args # arguments to function call
])
由于此调用也是有效的Python语法,因此我们可以使用
ast
模块。加上参数都是字符串文字,这使事情更简单
import ast
line = 'getDetailsfrmBean("storePg","564","Berwyn, IL","7180 W CERMAK RD.","SPACE A1","","BERWYN","IL","US","60402","(708) 788-5097","{Monday-Saturday=10-9,sunday=11-6}","41.8507029","-87.8033709");'
print([arg.s for arg in ast.parse(line).body[0].value.args])
输出:
['storePg', '564', 'Berwyn, IL', '7180 W CERMAK RD.', 'SPACE A1', '', 'BERWYN', 'IL', 'US', '60402', '(708) 788-5097', '{Monday-Saturday=10-9,sunday=11-6}', '41.8507029', '-87.8033709']
"storePg","564","Berwyn, IL","7180 W CERMAK RD.","SPACE A1","","BERWYN","IL","US","60402","(708) 788-5097","{Monday-Saturday=10-9,sunday=11-6}","41.8507029","-87.8033709"
说明:
print([arg.s # value of string literal
for arg in
ast.parse(line)
.body # module (list of statements)
[0] # first statement (an Expr node)
.value # expression (a Call)
.args # arguments to function call
])
用
re
import re
temp_string = 'getDetailsfrmBean("storePg","564","Berwyn, IL","7180 W CERMAK RD.","SPACE A1","","BERWYN","IL","US","60402","(708) 788-5097","{Monday-Saturday=10-9,sunday=11-6}","41.8507029","-87.8033709");'
split_list = filter(None, re.split("[, \-!?:\"]+",temp_string))
print split_list
应产生以下输出:
['getDetailsfrmBean(', 'storePg', '564', 'Berwyn', 'IL', '7180', 'W', 'CERMAK', 'RD.', 'SPACE', 'A1', 'BERWYN', 'IL', 'US', '60402', '(708)', '788', '5097', '{Monday', 'Saturday=10', '9', 'sunday=11', '6}', '41.8507029', '87.8033709', ');']
从这里的答案中选择:用
re
import re
temp_string = 'getDetailsfrmBean("storePg","564","Berwyn, IL","7180 W CERMAK RD.","SPACE A1","","BERWYN","IL","US","60402","(708) 788-5097","{Monday-Saturday=10-9,sunday=11-6}","41.8507029","-87.8033709");'
split_list = filter(None, re.split("[, \-!?:\"]+",temp_string))
print split_list
应产生以下输出:
['getDetailsfrmBean(', 'storePg', '564', 'Berwyn', 'IL', '7180', 'W', 'CERMAK', 'RD.', 'SPACE', 'A1', 'BERWYN', 'IL', 'US', '60402', '(708)', '788', '5097', '{Monday', 'Saturday=10', '9', 'sunday=11', '6}', '41.8507029', '87.8033709', ');']
从这里的答案中可以看出这一点:您可以使用一个简单的正则表达式来提取逗号分隔的引号字符串部分:
import re
line = 'getDetailsfrmBean("storePg","564","Berwyn, IL","7180 W CERMAK RD.","SPACE A1","","BERWYN","IL","US","60402","(708) 788-5097","{Monday-Saturday=10-9,sunday=11-6}","41.8507029","-87.8033709");'
args_string = re.match(r'getDetailsfrmBean\((.+)\);$', line.strip()).group(1)
print(args_string)
输出:
['storePg', '564', 'Berwyn, IL', '7180 W CERMAK RD.', 'SPACE A1', '', 'BERWYN', 'IL', 'US', '60402', '(708) 788-5097', '{Monday-Saturday=10-9,sunday=11-6}', '41.8507029', '-87.8033709']
"storePg","564","Berwyn, IL","7180 W CERMAK RD.","SPACE A1","","BERWYN","IL","US","60402","(708) 788-5097","{Monday-Saturday=10-9,sunday=11-6}","41.8507029","-87.8033709"
然后有多种方法可以从此类数据中解析字符串列表:
import ast
import json
import csv
args_array = '[%s]' % args_string
assert (json.loads(args_array)
== ast.literal_eval(args_array)
== next(csv.reader([args_string]))
== ['storePg', '564', 'Berwyn, IL', '7180 W CERMAK RD.', 'SPACE A1', '', 'BERWYN', 'IL', 'US', '60402',
'(708) 788-5097', '{Monday-Saturday=10-9,sunday=11-6}', '41.8507029', '-87.8033709'])
您可以使用简单的正则表达式仅提取逗号分隔的引号字符串部分:
import re
line = 'getDetailsfrmBean("storePg","564","Berwyn, IL","7180 W CERMAK RD.","SPACE A1","","BERWYN","IL","US","60402","(708) 788-5097","{Monday-Saturday=10-9,sunday=11-6}","41.8507029","-87.8033709");'
args_string = re.match(r'getDetailsfrmBean\((.+)\);$', line.strip()).group(1)
print(args_string)
输出:
['storePg', '564', 'Berwyn, IL', '7180 W CERMAK RD.', 'SPACE A1', '', 'BERWYN', 'IL', 'US', '60402', '(708) 788-5097', '{Monday-Saturday=10-9,sunday=11-6}', '41.8507029', '-87.8033709']
"storePg","564","Berwyn, IL","7180 W CERMAK RD.","SPACE A1","","BERWYN","IL","US","60402","(708) 788-5097","{Monday-Saturday=10-9,sunday=11-6}","41.8507029","-87.8033709"
然后有多种方法可以从此类数据中解析字符串列表:
import ast
import json
import csv
args_array = '[%s]' % args_string
assert (json.loads(args_array)
== ast.literal_eval(args_array)
== next(csv.reader([args_string]))
== ['storePg', '564', 'Berwyn, IL', '7180 W CERMAK RD.', 'SPACE A1', '', 'BERWYN', 'IL', 'US', '60402',
'(708) 788-5097', '{Monday-Saturday=10-9,sunday=11-6}', '41.8507029', '-87.8033709'])
即将编写一个包含
ast
和re
+json
两种方法的答案-但是@Alex Hall使用ast
方法更快,这是imho更喜欢的方法-但是另一种方法将涉及一个简单的正则表达式和json模块,它还提供一个列表,可以扫描同一字符串中的多个函数调用:
import re
import json
fn_cutter = re.compile("getDetailsfrmBean\((.+?)\);")
for key in item:
for i, match in enumerate(fn_cutter.findall(item[key])):
print(key, i, ':', json.loads("[" + match + "]"))
在将JSON对象转换为Python结构并捕获同一个值内的多个方法调用时,这将为您节省一些时间,但它肯定无法处理
anotherMethod(args)或包含在JS方法调用中的值。即将编写一个答案,其中包含使用ast
和re
+json
的两种方法-但是使用ast
方法时@Alex Hall的速度更快,imho更喜欢哪种方法?但另一种方法将涉及一个简单的正则表达式和json模块,它还提供一个列表,可以扫描同一字符串中的多个函数调用:
import re
import json
fn_cutter = re.compile("getDetailsfrmBean\((.+?)\);")
for key in item:
for i, match in enumerate(fn_cutter.findall(item[key])):
print(key, i, ':', json.loads("[" + match + "]"))
在将JSON对象转换为Python结构并捕获同一个值内的多个方法调用时,这将为您节省一些时间,但它肯定无法处理anotherMethod(args)
或…JS方法调用中包含的value
。应该用逗号分割字符串,然后得到一个包含值的数组。使用数组获取所需的值。请记住,该值可能有一个双引号,因此您可能也需要删除它。您应该用逗号拆分字符串,然后得到一个包含值的数组。使用数组获取所需的值。请记住,该值可能有一个双引号,因此您可能也需要删除它。