웹 스크래핑 기초: BeautifulSoup

핵심 개념

웹 페이지의 HTML을 파싱해서 원하는 정보를 추출하는 기법을 배운다.

본문

BeautifulSoup 기본

PYTHON📋 코드 (20줄)

# ⚠️ 이 코드는 허가된 환경에서만 사용하세요.
# pip install beautifulsoup4 lxml
from bs4 import BeautifulSoup
import requests

html = requests.get('https://example.com', timeout=5).text
soup = BeautifulSoup(html, 'lxml')  # lxml 파서가 가장 빠름

# 첫 번째 매칭 — find
title = soup.find('title')
print(title.text)  # "Example Domain"

# 모두 매칭 — find_all
links = soup.find_all('a')
for a in links:
    print(a.get('href'), '→', a.text)

# CSS 선택자 — select / select_one (jQuery 스타일)
nav_links = soup.select('nav a.menu-item')
first_p = soup.select_one('article > p:first-child')

주요 추출 패턴

PYTHON📋 코드 (21줄)

# 모든 링크
hrefs = [a['href'] for a in soup.find_all('a', href=True)]

# 외부 링크만
import re
external = [h for h in hrefs if re.match(r'^https?://', h)]

# 이메일 추출
emails = re.findall(r'[\w.+-]+@[\w-]+\.[\w.-]+', html)

# 메타 태그
description = soup.find('meta', attrs={'name': 'description'})
if description:
    print(description.get('content'))

# 폼 필드 추출 (CSRF 토큰 찾기)
form = soup.find('form', id='login')
if form:
    csrf = form.find('input', attrs={'name': 'csrf_token'})
    if csrf:
        print('CSRF 토큰:', csrf['value'])

테이블 추출

PYTHON📋 코드 (13줄)

def extract_tables(html: str) -> list[list[list[str]]]:
    """모든 <table>의 행/열을 2D 배열로 추출."""
    soup = BeautifulSoup(html, 'lxml')
    tables = []
    for table in soup.find_all('table'):
        rows = []
        for tr in table.find_all('tr'):
            cells = [c.get_text(strip=True) for c in tr.find_all(['td', 'th'])]
            if cells:
                rows.append(cells)
        if rows:
            tables.append(rows)
    return tables

robots.txt — 윤리적 스크래핑

PYTHON📋 코드 (17줄)

from urllib.parse import urljoin
from urllib.robotparser import RobotFileParser

def can_scrape(url: str, user_agent: str = 'WhiteHat/1.0') -> bool:
    """robots.txt 정책 확인 — 스크래핑 허용 여부."""
    rp = RobotFileParser()
    base = urljoin(url, '/robots.txt')
    rp.set_url(base)
    try:
        rp.read()
        return rp.can_fetch(user_agent, url)
    except Exception:
        return False  # 못 읽으면 보수적으로 거부

# 항상 먼저 체크
if can_scrape('https://example.com/page'):
    r = requests.get('https://example.com/page')

실습: 페이지에서 이메일·링크·폼 자동 추출

PYTHON📋 코드 (47줄)

def page_recon(url: str) -> dict:
    """OSINT 정찰 — 공개 정보만 추출 (윤리적 범위)."""
    if not can_scrape(url):
        return {'error': 'robots.txt 거부 — 스크래핑 금지'}

    r = requests.get(url, timeout=5, headers={'User-Agent': 'WhiteHat/1.0'})
    soup = BeautifulSoup(r.text, 'lxml')

    # 1. 이메일 (정규식 + obfuscation 처리)
    text = soup.get_text()
    emails = set(re.findall(r'[\w.+-]+@[\w-]+\.[\w.-]+', text))
    # "name [at] domain [dot] com" 변형 처리
    obfuscated = re.findall(r'(\w+)\s*\[at\]\s*([\w-]+)\s*\[dot\]\s*(\w+)', text, re.I)
    for u, d, t in obfuscated:
        emails.add(f'{u}@{d}.{t}')

    # 2. 외부 도메인
    hosts = set()
    for a in soup.find_all('a', href=True):
        href = a['href']
        m = re.match(r'^https?://([^/]+)', href)
        if m:
            hosts.add(m.group(1))

    # 3. 폼 정보
    forms = []
    for form in soup.find_all('form'):
        forms.append({
            'action': form.get('action'),
            'method': form.get('method', 'GET').upper(),
            'inputs': [inp.get('name') for inp in form.find_all('input') if inp.get('name')],
        })

    # 4. 메타데이터
    meta = {}
    for tag in soup.find_all('meta'):
        name = tag.get('name') or tag.get('property')
        if name and tag.get('content'):
            meta[name] = tag['content']

    return {
        'url': r.url,
        'emails': sorted(emails),
        'external_hosts': sorted(hosts),
        'forms': forms,
        'meta': meta,
    }

⚠️ 윤리적 가이드라인

✅ 공개 정보만 — 인증 우회 금지
✅ robots.txt 존중
✅ Rate Limit 준수 (초당 2회 이하)
✅ User-Agent에 신원 명시 (이메일 등)
❌ 로그인 우회 시도
❌ 자동 댓글·스팸·DDoS

AI 프롬프트

🤖 AI에게 잘 물어보는 법 — 모델·전략별 프롬프트

Claude

무료: Sonnet 4.6 / Pro $20/mo: Opus 4.6

내 BeautifulSoup 스크립트의
robots.txt 무시·rate limit 누락 등
윤리적 문제와 코드 품질을 분석해줘.

ChatGPT

무료: GPT-5.5 / Plus $20/mo: GPT-5.5 Pro

BeautifulSoup·Selenium·Playwright의
각각 적합한 케이스(정적 HTML/JS 렌더링/로그인 자동화)를
실전 코드로 비교해줘.

Gemini

무료: 2.5 Flash / Pro $19.99/mo: 3.1 Pro

내 사이트 페이지를 전체 스크래핑해서
공개된 이메일·외부 링크·민감 메타데이터가
과도하게 노출되어 있는지 종합 진단해줘.

Grok

무료: Grok 4.1 / SuperGrok $30/mo

2026년 웹 스크래핑 법적 환경 —
LinkedIn vs hiQ 판결 이후
실무에서 어디까지 허용되는지 솔직히 알려줘.

⭐ 이것만 기억하세요

웹 스크래핑 기초: BeautifulSoup는 이 3가지만 확실히 잡으세요

1.BeautifulSoup + CSS 선택자 + 정규식 조합으로 거의 모든 HTML 정보 추출이 가능하다

2.robots.txt 존중 + Rate Limit 준수 + User-Agent 명시 — 윤리적 스크래핑 3원칙을 지켜야 한다

3.다음 챕터에서 지금까지 배운 것을 종합해 보안 헤더 자동 점검 CLI 도구를 완성한다

💬 이 챕터 질문 보기

SECURITY · CH.65 — 질문하거나 답변을 확인하세요

→

진행도 65 / 84

← 커리큘럼으로 ← 목록으로 (화이트햇 보안)