Python 使用beautifulsoup4删除数据_Python_Html_Beautifulsoup_Python Requests

Python 使用beautifulsoup4删除数据

python html

Python 使用beautifulsoup4删除数据,python,html,beautifulsoup,python-requests,Python,Html,Beautifulsoup,Python Requests,我想用request和beautifulsoup4从我的uni网站上搜集一些数据。我试图搜集这些数据，但我对这里的所有元素都感到困惑。有什么建议吗 <div id="_27_1termCourses__8_1" style=""> <h4 class="u_indent" id="anonymous_element_13">Courses where you are: Student

我想用request和beautifulsoup4从我的uni网站上搜集一些数据。我试图搜集这些数据，但我对这里的所有元素都感到困惑。有什么建议吗

<div id="_27_1termCourses__8_1" style="">
    <h4 class="u_indent" id="anonymous_element_13">Courses where you are: Student</h4> 
    <ul class="portletList-img courseListing coursefakeclass u_indent">
        <li>
            <img alt="" src="/images/ci/icons/bookopen_li.gif" width="12" height="12">
            <a href=" /launcher?type=Course&amp;id=_65254_1&amp;url=" target="_top">1430101_10777_FALL2017-2018: Astro &amp; Space Sciences </a>
                <div class="courseInformation">
                    <span class="courseRole">
                    Instructor:
                    </span>
                    <span class="name">here is the name that i want to get it;&nbsp;&nbsp;</span>
            </div>
        </li>
        <li>
            <img alt="" src="/images/ci/icons/bookopen_li.gif" width="12" height="12">
            <a href=" /launcher?type=Course&amp;id=_65816_1&amp;url=" target="_top">0403201_12360_FALL2017-2018: Digital Logic Design </a>
                <div class="courseInformation">
                    <span class="courseRole">
                    Instructor:
                    </span>
                    <span class="name">here is the name that i want to get it ;&nbsp;&nbsp;</span>
                </div>
            </li>
    </ul>

问题在于正则表达式，而不是请求或集合；更具体地说，

的re.compile（“[\u 27\u 1termcources]\u\u\d+”）

部分。首先，它用引号括起来，不应该用引号括起来。其次，

[…]

用于匹配这些括号中的任何一个字符。建议使用BeautifulSoup：如果筛选到的

div

标记

r1

只包含一个

ul

标记，那么在for循环中查找

ul

是毫无意义的，更不用说使用

attrs了={'class'：'portletList-img courseListing coursefakeclass u_indent'}

参数。您可以简单地使用

res.ul

，或者完全删除中介体

ul

并使用

result.extend（res.findAll（'li'））

@Mahesh实际上正则表达式在下一个代码打印（[element.get\u text（strip=True）时工作正常对于soup中的元素。find_all（id=r1）]），我尝试了您的解决方案结果。extend（res.findAll（'li'））但我得到的只是[]在您分配给

r1

之后，打印它并发布输出。[27_1termCourses]\uud+不会在课程名称末尾捕获任何下划线（“”），因为“[\d]”只捕获“[0-9]”我相信一个正确的正则表达式应该是：27_1termCourses[\d]+问题在于正则表达式，而不是请求或美化组；更具体地说是

're.compile（“[\u 1termCourses]\u\d+”）

部分。首先，它被括在引号中，而不应该是。其次

[…]

用于匹配这些括号中的任何一个字符。建议使用BeautifulSoup：如果筛选到

r1

的

div

标记只包含一个

ul

标记，则在for循环中查找

ul

是毫无意义的，更不用说使用

attrs了={'class'：'portletList-img courseListing coursefakeclass u_indent'}

参数。您可以简单地使用

res.ul

，或者完全删除中介体

ul

并使用

result.extend（res.findAll（'li'））

r1

之后，打印它并发布输出。[27_1termCourses]\uud+不会在课程名称末尾捕获任何下划线（“”），因为“[\d]”只捕获“[0-9]”。我相信正确的正则表达式应该是：271termCourses[\d]+

from bs4 import BeautifulSoup
import requests
import re
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:28.0) Gecko/20100101 Firefox/28.0','Accept': 'application/json, text/javascript, */*; q=0.01','Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8','X-Requested-With': 'XMLHttpRequest'}

url = 'https://url/'
url_ajax = "url/webapps/portal/execute/tabs/tabAction"

#data
payload = {'user_id': '','password': ''}
course_data = {'action' : 'refreshAjaxModule','modId' : '_27_1','tabId' : '_1_1' ,'tab_tab_group_id' : '_1_1'}

#post data
session = requests.Session()
session.post(url,headers=headers,data=payload) #username and password
UrlAjx= session.post(url_ajax , headers=headers, data= course_data) #get the ajax call

#get the html elements
soup = BeautifulSoup(UrlAjx.text, 'lxml')
r1 = soup.find_all('div',  attrs={ 'id':'re.compile(" [_27_1termCourses]__\d+" )' } )
result = []
// i tried a lot of codes here before i post this question but nothings works 
for res in r1 :
    ul = res.find('ul',attrs={'class':'portletList-img courseListing coursefakeclass u_indent'})
    
    result.append((li_element))
print(result)