Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/293.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python BeautifulSoup不会返回所有数据_Python_Parsing_Beautifulsoup_Html Parsing - Fatal编程技术网

Python BeautifulSoup不会返回所有数据

Python BeautifulSoup不会返回所有数据,python,parsing,beautifulsoup,html-parsing,Python,Parsing,Beautifulsoup,Html Parsing,今天我尝试使用Python的库BeautifulSoup解析月球阶段的一些数据 from bs4 import BeautifulSoup import urllib2 moon_url = "http://www.moongiant.com/phase/today/" try: rqest = urllib2.urlopen(moon_url) moon_Soup = BeautifulSoup(rqest, 'lxml') moon_angle = 0

今天我尝试使用Python的库BeautifulSoup解析月球阶段的一些数据

from bs4 import BeautifulSoup
import urllib2

moon_url = "http://www.moongiant.com/phase/today/"


try:
    rqest =  urllib2.urlopen(moon_url)
    moon_Soup = BeautifulSoup(rqest, 'lxml')
    moon_angle = 0
    moon_illumination = 0
    main_data = moon_Soup.find('div', {'id' : 'moonDetails'})
    print main_data

except urllib2.URLError:
    print "Error"
但输出不是这个:

<div id="moonDetails">        
      Phase: <span>Waxing Crescent</span><br>Illumination: <span>36%
</span><br>Moon Age: <span>6.00 days</span><br>Moon Angle: <span>0.55</span><br>Moon Distance: <span>364,</span>434.78 km<br>Sun Angle: <span>0.53</span><br>Sun Distance: <span>149,</span>571,918.47 km<br>
</div>

阶段:打蜡新月
照明:36%
月亮年龄:6.00天
月亮角度:0.55
月亮距离:364434.78公里
太阳角度:0.53
太阳距离:149571918.47公里
只是:

<div id="moonDetails">
</div>


有什么想法吗?

正如RaminNietzsche在评论中所说的,您应该在这个特定的
脚本
标记中提取脚本的文本。例如,您可以使用
regex
内置方法(如
split()
strip()
replace()

代码:

from bs4 import BeautifulSoup
import requests
import re
import json

moon_url = "http://www.moongiant.com/phase/today/"
html_source =  requests.get(moon_url).text

moon_soup = BeautifulSoup(html_source, 'html.parser')

data = moon_soup.find_all('script', {'type' : 'text/javascript'})

for d in data:
    d = d.text
    if 'var jArray=' in d:
        jArray = re.search('\{(.*?)\}', d).group()
        moon_data = json.loads(jArray)
        print(moon_data)

        #if you want mArray data too, you just have to:
        # 1. add `'var mArray=' in d` in the if clause, and
        # 2. uncomment the following lines
        #mArray = re.search('\[+(.*?)\];', d).group()
        #print(mArray)
{'3': ['<b>April 4</b>', '58%\n', 'Sun Angle: 0.53291621763825', 'Sun Distance: 149657950.85286', 'Moon Distance: 369697.55153449', 'Moon Age: 8.1316595947356', 'Moon Angle: 0.53870564539409', 'Waxing Gibbous', 'April 4'], '2': ["<span style='color:#c7b699'><b>April 3</b></span>", 'Illumination: <span>47%\n</span>', 'Sun Angle: <span>0.53', 'Sun Distance: <span>149,</span>614,</span>943.28', 'Moon Distance: <span>366,</span>585.35', 'Moon Age: <span>7.08', 'Moon Angle: <span>0.54', 'First Quarter', '<b>Monday, April 3, 2017</b>', 'April', 'Phase: <span>First Quarter</span>', 'April 3'], '1': ['<b>April 2</b>', '36%\n', 'Sun Angle: 0.53322274612254', 'Sun Distance: 149571918.46739', 'Moon Distance: 364434.77975454', 'Moon Age: 6.002888839693', 'Moon Angle: 0.54648504798072', 'Waxing Crescent', 'April 2'], '4': ['<b>April 5</b>', '69%\n', 'Sun Angle: 0.53276322269153', 'Sun Distance: 149700928.5008', 'Moon Distance: 373577.14506795', 'Moon Age: 9.1657967733025', 'Moon Angle: 0.53311119464703', 'Waxing Gibbous', 'April 5'], '0': ['<b>April 1</b>', '25%\n', 'Sun Angle: 0.53337618944887', 'Sun Distance: 149528889.15122', 'Moon Distance: 363387.67496992', 'Moon Age: 4.9078487808877', 'Moon Angle: 0.54805974945761', 'Waxing Crescent', 'April 1']}
print(moon_data['4'])
print('-')*5
print(moon_data['4'][2])
['<b>April 5</b>', '69%\n', 'Sun Angle: 0.53276322269153', 'Sun Distance: 149700928.5008', 'Moon Distance: 373577.14506795', 'Moon Age: 9.1657967733025', 'Moon Angle: 0.53311119464703', 'Waxing Gibbous', 'April 5']
-----
Sun Angle: 0.53276322269153
输出:

from bs4 import BeautifulSoup
import requests
import re
import json

moon_url = "http://www.moongiant.com/phase/today/"
html_source =  requests.get(moon_url).text

moon_soup = BeautifulSoup(html_source, 'html.parser')

data = moon_soup.find_all('script', {'type' : 'text/javascript'})

for d in data:
    d = d.text
    if 'var jArray=' in d:
        jArray = re.search('\{(.*?)\}', d).group()
        moon_data = json.loads(jArray)
        print(moon_data)

        #if you want mArray data too, you just have to:
        # 1. add `'var mArray=' in d` in the if clause, and
        # 2. uncomment the following lines
        #mArray = re.search('\[+(.*?)\];', d).group()
        #print(mArray)
{'3': ['<b>April 4</b>', '58%\n', 'Sun Angle: 0.53291621763825', 'Sun Distance: 149657950.85286', 'Moon Distance: 369697.55153449', 'Moon Age: 8.1316595947356', 'Moon Angle: 0.53870564539409', 'Waxing Gibbous', 'April 4'], '2': ["<span style='color:#c7b699'><b>April 3</b></span>", 'Illumination: <span>47%\n</span>', 'Sun Angle: <span>0.53', 'Sun Distance: <span>149,</span>614,</span>943.28', 'Moon Distance: <span>366,</span>585.35', 'Moon Age: <span>7.08', 'Moon Angle: <span>0.54', 'First Quarter', '<b>Monday, April 3, 2017</b>', 'April', 'Phase: <span>First Quarter</span>', 'April 3'], '1': ['<b>April 2</b>', '36%\n', 'Sun Angle: 0.53322274612254', 'Sun Distance: 149571918.46739', 'Moon Distance: 364434.77975454', 'Moon Age: 6.002888839693', 'Moon Angle: 0.54648504798072', 'Waxing Crescent', 'April 2'], '4': ['<b>April 5</b>', '69%\n', 'Sun Angle: 0.53276322269153', 'Sun Distance: 149700928.5008', 'Moon Distance: 373577.14506795', 'Moon Age: 9.1657967733025', 'Moon Angle: 0.53311119464703', 'Waxing Gibbous', 'April 5'], '0': ['<b>April 1</b>', '25%\n', 'Sun Angle: 0.53337618944887', 'Sun Distance: 149528889.15122', 'Moon Distance: 363387.67496992', 'Moon Age: 4.9078487808877', 'Moon Angle: 0.54805974945761', 'Waxing Crescent', 'April 1']}
print(moon_data['4'])
print('-')*5
print(moon_data['4'][2])
['<b>April 5</b>', '69%\n', 'Sun Angle: 0.53276322269153', 'Sun Distance: 149700928.5008', 'Moon Distance: 373577.14506795', 'Moon Age: 9.1657967733025', 'Moon Angle: 0.53311119464703', 'Waxing Gibbous', 'April 5']
-----
Sun Angle: 0.53276322269153
输出:

from bs4 import BeautifulSoup
import requests
import re
import json

moon_url = "http://www.moongiant.com/phase/today/"
html_source =  requests.get(moon_url).text

moon_soup = BeautifulSoup(html_source, 'html.parser')

data = moon_soup.find_all('script', {'type' : 'text/javascript'})

for d in data:
    d = d.text
    if 'var jArray=' in d:
        jArray = re.search('\{(.*?)\}', d).group()
        moon_data = json.loads(jArray)
        print(moon_data)

        #if you want mArray data too, you just have to:
        # 1. add `'var mArray=' in d` in the if clause, and
        # 2. uncomment the following lines
        #mArray = re.search('\[+(.*?)\];', d).group()
        #print(mArray)
{'3': ['<b>April 4</b>', '58%\n', 'Sun Angle: 0.53291621763825', 'Sun Distance: 149657950.85286', 'Moon Distance: 369697.55153449', 'Moon Age: 8.1316595947356', 'Moon Angle: 0.53870564539409', 'Waxing Gibbous', 'April 4'], '2': ["<span style='color:#c7b699'><b>April 3</b></span>", 'Illumination: <span>47%\n</span>', 'Sun Angle: <span>0.53', 'Sun Distance: <span>149,</span>614,</span>943.28', 'Moon Distance: <span>366,</span>585.35', 'Moon Age: <span>7.08', 'Moon Angle: <span>0.54', 'First Quarter', '<b>Monday, April 3, 2017</b>', 'April', 'Phase: <span>First Quarter</span>', 'April 3'], '1': ['<b>April 2</b>', '36%\n', 'Sun Angle: 0.53322274612254', 'Sun Distance: 149571918.46739', 'Moon Distance: 364434.77975454', 'Moon Age: 6.002888839693', 'Moon Angle: 0.54648504798072', 'Waxing Crescent', 'April 2'], '4': ['<b>April 5</b>', '69%\n', 'Sun Angle: 0.53276322269153', 'Sun Distance: 149700928.5008', 'Moon Distance: 373577.14506795', 'Moon Age: 9.1657967733025', 'Moon Angle: 0.53311119464703', 'Waxing Gibbous', 'April 5'], '0': ['<b>April 1</b>', '25%\n', 'Sun Angle: 0.53337618944887', 'Sun Distance: 149528889.15122', 'Moon Distance: 363387.67496992', 'Moon Age: 4.9078487808877', 'Moon Angle: 0.54805974945761', 'Waxing Crescent', 'April 1']}
print(moon_data['4'])
print('-')*5
print(moon_data['4'][2])
['<b>April 5</b>', '69%\n', 'Sun Angle: 0.53276322269153', 'Sun Distance: 149700928.5008', 'Moon Distance: 373577.14506795', 'Moon Age: 9.1657967733025', 'Moon Angle: 0.53311119464703', 'Waxing Gibbous', 'April 5']
-----
Sun Angle: 0.53276322269153
[‘4月5日’、‘69%\n’、‘太阳角度:0.53276322269153’、‘太阳距离:14970928.5008’、‘月亮距离:373577.14506795’、‘月亮年龄:9.1657967733025’、‘月亮角度:0.5331119464703’、‘上蜡凸出’、‘4月5’]
-----
太阳角度:0.53276322269153

另一种方法,我从root的答案中抄袭了它的要点

其思想是,您可以同时使用seleniumlxml来访问页面的DOM,该页面已由javascript加载和处理

>>> moon_url = "http://www.moongiant.com/phase/today/"
>>> import selenium.webdriver as webdriver
>>> import lxml.html as html
>>> import lxml.html.clean as clean
>>> 
>>> browser = webdriver.Chrome()
>>> browser.get(moon_url)
>>> content = browser.page_source
>>> cleaner = clean.Cleaner()
>>> content = cleaner.clean_html(content)
>>> doc = html.fromstring(content)
>>> type(doc)
<class 'lxml.html.HtmlElement'>
>>> type(content)
<class 'str'>
>>> open('c:/scratch/content.htm','w').write(content)
27070

实际上,在RaminNietzsche的评论之后,我使用了干刮库

from bs4 import BeautifulSoup
import urllib2
import dryscrape

    moon_url = "http://www.moongiant.com/phase/today/"

try:
    rqest =  urllib2.urlopen(moon_url)
    session = dryscrape.Session()
    session.visit(moon_url)
    response = session.body()
    soup = BeautifulSoup(response, 'lxml')

    moon_data = soup.findAll('div', {'id':'moonDetails'})
    print moon_data
因此,现在的输出是:

<div id="moonDetails">        
      Phase: <span>Waxing Crescent</span><br>Illumination: <span>36%
</span><br>Moon Age: <span>6.00 days</span><br>Moon Angle: <span>0.55</span><br>Moon Distance: <span>364,</span>434.78 km<br>Sun Angle: <span>0.53</span><br>Sun Distance: <span>149,</span>571,918.47 km<br>
</div>

阶段:打蜡新月
照明:36%
月亮年龄:6.00天
月亮角度:0.55
月亮距离:364434.78公里
太阳角度:0.53
太阳距离:149571918.47公里

谢谢大家的回答!

这些数据在
var mArray
中,而不是
中,实际上它在var jArray中。我如何使用Python解析jArray?阅读非常感谢,它真的很有帮助!似乎与windows不兼容?这里没有提到安装。。