python中文本数据清理的问题_Python_Web Crawler_Text Mining_Data Cleaning_Stop Words

python中文本数据清理的问题

python web-crawler

python中文本数据清理的问题,python,web-crawler,text-mining,data-cleaning,stop-words,Python,Web Crawler,Text Mining,Data Cleaning,Stop Words,我正在开发一个使用web爬行方法对Internet文章进行爬行的程序。该程序通过输入网站的起始页和结束页来启动此程序按以下顺序工作。文章信息的网络爬网（标题、排序、时间、内容）删除特殊字符只提取名词在清理文章内容的过程中，可能会出现提取名词的问题。它一直工作到名词提取之前错误消息如下所示 ValueError：传递值的长度为4，索引表示为5 为了解决这个问题，我使用添加DataFrameAppend的方法进行编码。但这并不能解决问题使用konlypy方法（韩语语素分析器）网站页

我正在开发一个使用web爬行方法对Internet文章进行爬行的程序。
该程序通过输入网站的起始页和结束页来启动

此程序按以下顺序工作。

文章信息的网络爬网（标题、排序、时间、内容）

删除特殊字符

只提取名词

在清理文章内容的过程中，可能会出现提取名词的问题。它一直工作到名词提取之前

错误消息如下所示
ValueError：传递值的长度为4，索引表示为5
为了解决这个问题，我使用添加DataFrameAppend的方法进行编码。但这并不能解决问题

使用konlypy方法（韩语语素分析器）网站页面循环在设置Pandas DataFrame中现有的4列之后，使用append将提取为名词的列添加为第5列。我知道这个方法会添加一个列，而不考虑索引名。如果查看底部的图像链接，第一篇文章将被爬网并显示结果。在下一篇文章中，它不起作用，并且出现了一个错误。

（程序错误结果）
（韩语停止词词典）

我解决了这个问题。它取决于代码在for循环语句中的位置。我已经能够修复这个问题，因为我继续重新定位有问题的区域，除了以前工作过的代码。我在下面的代码中只应用了两次退格就解决了这个问题

news_info['Nouns'] = news_info['Article'].apply(lambda x: get_nouns(x))

请提供一个最小的可复制示例。最好用刮痧

while startpage<lastpage + 1:
  url = f'http://www.koscaj.com/news/articleList.html?page={startpage}&total=72698&box_idxno=&sc_section_code=S1N2&view_type=sm'
  html = urllib.request.urlopen(url).read()
  soup = BeautifulSoup(html, 'html.parser')
  links = soup.find_all(class_='list-titles')

  print(f'-----{count}page result-----')
# Articles loop in the web-site page
  for link in links:
    news_url = "http://www.koscaj.com"+link.find('a')['href']
    news_link = urllib.request.urlopen(news_url).read()
    soup2 = BeautifulSoup(news_link, 'html.parser')

    # an article's title
    title = soup2.find('div', {'class':'article-head-title'})

    if title:
        title = soup2.find('div', {'class':'article-head-title'}).text
    else:
        title = ''
           
    # an article's sort
    sorts = soup2.find('nav', {'class':'article-head-nav auto-marbtm-10'})
    try:
        sorts2 = sorts.find_all('a')
        sort = sorts2[2].text
    except:
        sort =''
    
    # an article's time
    date = soup2.find('div',{'class':'info-text'})
    try:
        datetime = date.find('i', {'class':'fa fa-clock-o fa-fw'}).parent.text.strip()
        datetime = datetime.replace("승인", "")
    except:
        datetime = ''

    # an article's content
    article = soup2.find('div', {'id':'article-view-content-div'})
    if article:
        article = soup2.find('div', {'id':'article-view-content-div'}).text
        article = article.replace("\n", "")
        article = article.replace("\r", "")
        article = article.replace("\t", "")
        article = article.replace("[전문건설신문] koscaj@kosca.or.kr", "")
        article = article.replace("저작권자 © 대한전문건설신문 무단전재 및 재배포 금지", "")
        article = article.replace("전문건설신문", "")
        article = article.replace("다른기사 보기", "")

    else:
        article = ''

    # Remove special characters
    news_info['Title'] = news_info['Title'].apply(lambda x: text_cleaning(x))
    news_info['Sort'] = news_info['Sort'].apply(lambda x: text_cleaning(x))
    news_info['Article'] = news_info['Article'].apply(lambda x: text_cleaning(x))

    # Dataframe for storing after crawling individual articles
    row = [title, sort, datetime, article]
    series = pd.Series(row, index=news_info.columns)
    news_info = news_info.append(series, ignore_index=True)
    
    
    
    # Load Korean stopword dictionary file    
    path = "C:/Users/이바울/Desktop/이바울/코딩파일/stopwords-ko.txt"
    with open(path, encoding = 'utf-8') as f:
        stopwords = f.readlines()
    
    stopwords = [x.strip() for x in stopwords]

    news_info['Nouns'] = news_info['Article'].apply(lambda x: get_nouns(x))    


  startpage += 1
  count += 1

news_info.to_excel(f'processing{lastpage-int(1)}-{startpage-int(1)}.xlsx')

print('Complete')

news_info['Nouns'] = news_info['Article'].apply(lambda x: get_nouns(x))