파이썬 네이버 뉴스 스크래핑 시작하기

1 개요[ | ]

파이썬 네이버 뉴스 크롤링 시작하기
파이썬 네이버 뉴스 스크래핑 시작하기
Python 네이버 뉴스 스크래핑 시작하기

2 예시 1: 키워드 검색[ | ]

지정한 키워드로 검색하여 첫번째 기사의 제목과 URL을 추출한다.

import requests
from bs4 import BeautifulSoup

keyword = '코로나'
r = requests.get(f'https://search.naver.com/search.naver?where=news&query={keyword}')
soup = BeautifulSoup(r.text, 'html.parser')
articles = soup.select('ul.list_news > li')
title = articles[0].select_one('a.news_tit')['title']
url = articles[0].select_one('div.info_group > a:nth-of-type(2)')['href'] 
print('title=', title)
print('url=', url)

3 예시 2: 부제 없음[ | ]

지정한 URL의 웹페이지에서 제목, 부제, 본문을 추출한다.

import requests
from bs4 import BeautifulSoup, Comment

url = 'https://news.naver.com/main/read.nhn?mode=LSD&mid=sec&sid1=102&oid=028&aid=0002547311'
r = requests.get(url, headers={'User-Agent':'Mozilla/5.0'})
soup = BeautifulSoup(r.text, 'html.parser')

title = soup.select_one('h3#articleTitle').text
content = soup.select_one('#articleBodyContents')
subtitle = content.select_one('strong')
if subtitle is not None: subtitle = subtitle.extract().text

for x in content.select('script'): x.extract()                             # <script>...</script> 제거
for x in content(text=lambda text: isinstance(text, Comment)): x.extract() # <!-- 주석 --> 제거
for x in content.select("br"): x.replace_with("\n")                        # <br>을 \n로 교체
content = "".join([str(x) for x in content.contents])                      # 최상위 태그 제거(=innerHtml 추출)
content = content.strip()                                                  # 앞뒤 공백 제거

print('title=', title)       # 제목
print('subtitle=', subtitle) # 부제
print('content=', content)   # 본문

4 예시 3: 부제 있음[ | ]

지정한 URL의 웹페이지에서 제목, 부제, 본문을 추출한다.

import requests
from bs4 import BeautifulSoup, Comment

url = 'https://news.naver.com/main/read.nhn?mode=LSD&mid=sec&sid1=100&oid=001&aid=0012439508'
r = requests.get(url, headers={'User-Agent':'Mozilla/5.0'})
soup = BeautifulSoup(r.text, 'html.parser')

title = soup.select_one('h3#articleTitle').text
content = soup.select_one('#articleBodyContents')
subtitle = content.select_one('strong')
if subtitle is not None: subtitle = subtitle.extract().text

for x in content.select('script'): x.extract()                             # <script>...</script> 제거
for x in content(text=lambda text: isinstance(text, Comment)): x.extract() # <!-- 주석 --> 제거
for x in content.select("br"): x.replace_with("\n")                        # <br>을 \n로 교체
content = "".join([str(x) for x in content.contents])                      # 최상위 태그 제거(=innerHtml 추출)
content = content.strip()                                                  # 앞뒤 공백 제거

print('title=', title)       # 제목
print('subtitle=', subtitle) # 부제
print('content=', content)   # 본문

5 같이 보기[ | ]