Web Scraping Best Practices

Ethics, Legality & Ground Rules
Architecture at a Glance
Popular Libraries & When to Use Them
Hands‑On: HTTP Scraper (Python)
Headless Browser Patterns
Request Hygiene
Parsing Patterns
Storage
Monitoring
Quick Starts: Go & Java
Production Checklists

1) Ethics, Legality & Ground Rules

Always respect site Terms of Service, honor robots.txt, and remain considerate of server load. When APIs exist, prefer them over scraping. Design your crawlers to be transparent and reversible—log what you collect and why.

Check for an API first. Usually faster, cleaner, and more stable than HTML.
Review robots.txt and sitemaps. Use them as guidance for allowed paths and crawl policies.
Throttle requests. Use adaptive rate limits, backoff, and concurrency caps.
Identify your client politely. Rotate realistic User‑Agents and include a contact email if appropriate.
Store raw HTML and logs. Reproducibility helps debug issues and ensure accountability.

2) Architecture at a Glance

Robust scrapers separate fetch, parse, and persist stages. Start simple, keep it modular, and instrument everything (metrics, logs, traces).

Queue (URLs & retries) → Fetcher (HTTP/Browser) → Parser (DOM/HTML/JSON) → Validator → Storage (DB/File) → Monitor (errors, success rates).

3) Popular Libraries & When to Use Them

Python

Category	Libraries	Notes
HTTP	httpx requests	Async + HTTP/2, timeouts, retries/backoff Great with proxies and connection pooling
Parsing	BeautifulSoup lxml parsel	CSS/XPath selectors with robust fallbacks Validate required fields, default optional to null
Crawling	Scrapy	Pipelines, middlewares, auto‑throttle, retries Battle‑tested for production crawls
Browser	Playwright Selenium	Use for dynamic JS, auth, multi‑step flows Keep concurrency bounded; wait for stable selectors

Node.js

Category	Libraries	Notes
HTTP	got axios	Timeouts, retries/backoff; HTTP/2 where needed Good proxy support and streaming
Parsing	cheerio jsdom	Fast server‑side DOM parsing Use semantic selectors; default on missing data
Browser	Playwright Puppeteer	Network interception, cross‑browser automation Headless only when required by JS
Crawling	Crawlee Apify SDK	Queues, storages, autoscaling and retries Great building blocks for resilient crawlers

Go & Java

Language	HTTP / Parser	Browser	Notes
Go	colly goquery	chromedp rod	High‑throughput, low‑memory crawlers Good for large‑scale concurrent fetching
Java	jsoup	Selenium	Robust HTML parsing + enterprise ecosystems Use browser only for complex, JS‑heavy flows

4) Hands‑On: Respectful HTTP Scraper (Python)

A minimal, production‑ready pattern that checks robots.txt, uses HTTP/2, throttles, retries, and parses safely.

# --- imports ---
import asyncio  # async event loop (for concurrency patterns)
import time  # sleep/backoff utilities
from urllib import robotparser  # robots.txt parser
import httpx  # modern async HTTP client (HTTP/2 support)
from bs4 import BeautifulSoup  # HTML parser (CSS selectors)

# --- config ---
BASE = "https://example.com"  # target origin
START = f"{BASE}/products"  # seed URL to crawl
HEADERS = {  # polite, browser-like headers
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0 Safari/537.36 PromptFuelBot/1.0",  # identify client
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",  # accept common HTML types
}

MAX_CONCURRENCY = 3  # bound concurrency to be gentle to servers
MAX_RETRIES = 3  # retry transient network errors
DELAY_BASE = 0.75  # base delay (seconds) between requests

# --- robots.txt check ---
def allowed(url: str, ua: str = HEADERS["User-Agent"]) -> bool:  # check crawl permission for URL
    rp = robotparser.RobotFileParser()  # initialize parser
    rp.set_url(f"{BASE}/robots.txt")  # location of robots.txt
    try:
        rp.read()  # fetch and parse rules
    except Exception:
        return False  # if robots cannot be fetched, play it safe
    return rp.can_fetch(ua, url)  # consult policy for this UA and URL

5) Headless Browser Patterns

Use a headless browser when you truly need JS execution, multi-step flows, or authenticated sessions. Keep concurrency bounded and monitor 429/403s.

6) Request Hygiene

Rotate realistic User‑Agents, accept‑language, and connection headers.
Respect cache and robots directives; add backoff with jitter.
Persist cookies per origin; retry idempotently.

7) Parsing Patterns that Survive Redesigns

Favor semantic anchors (ids, data‑attributes) over brittle classes.
Use CSS/XPath carefully; assert required fields, default optional ones to null.
Version your extractors; log HTML snapshots when extractors fail.

8) Storage: What to Keep and Why

At minimum, store raw HTML, parsed JSON, and logs/metrics. Consider PostgreSQL for entities and object storage for raw snapshots.

9) Monitoring: Know When You’re Failing

Metrics: success rate, latency, bytes/page, error codes, retries, CAPTCHAs encountered.
Logs: per‑URL trace with timing; store request/response headers.
Alerts: spikes in 403/429, DOM changes (selector miss), growth in retries.

10) Quick Starts: Go & Java

// Basic Go scraper with colly (line-by-line commented)
package main // entry package

import ( // imports
  "fmt"                                 // console output
  "github.com/gocolly/colly/v2"         // Go crawling framework
)

func main() { // program entrypoint
  c := colly.NewCollector(               // create a collector (crawler)
    colly.UserAgent("Mozilla/5.0 PromptFuelBot/1.0"),  // set polite User-Agent
    colly.AllowedDomains("example.com"), // restrict crawl scope to domain
  )

  c.Limit(&colly.LimitRule{              // set politeness (throttle)
    DomainGlob:  "*example.*",           // apply to matching domains
    Parallelism: 2,                       // small concurrency
    RandomDelay: 500,                     // jitter (milliseconds)
  })

  c.OnHTML(".product-card", func(e *colly.HTMLElement) { // parse each product card
    title := e.ChildText(".product-title") // extract title text
    price := e.ChildText(".product-price") // extract price text
    url := e.ChildAttr("a", "href")        // extract product link
    fmt.Println(title, price, url)          // output fields
  })

  c.Visit("https://example.com/products")  // start crawl at listing URL
}

// Minimal Java + jsoup example (line-by-line commented)
import org.jsoup.Jsoup;                      // HTTP + HTML parsing
import org.jsoup.nodes.Document;             // DOM root node
import org.jsoup.nodes.Element;              // DOM element
import org.jsoup.select.Elements;            // element collection

public class JsoupScraper {                  // class declaration
  public static void main(String[] args) throws Exception { // entry point
    Document doc = Jsoup.connect("https://example.com/products") // prepare request
       .userAgent("Mozilla/5.0 PromptFuelBot/1.0")              // set polite UA
       .timeout(15000)                                           // request timeout (ms)
       .get();                                                   // perform GET and parse

    Elements cards = doc.select(".product-card");               // select product cards
    for (Element card : cards) {                                 // iterate results
      String title = card.selectFirst(".product-title") != null
        ? card.selectFirst(".product-title").text() : null;     // extract title
      String price = card.selectFirst(".product-price") != null
        ? card.selectFirst(".product-price").text() : null;     // extract price
      String url = card.selectFirst("a") != null
        ? card.selectFirst("a").attr("href") : null;            // extract link
      System.out.println(title + " | " + price + " | " + url);   // print fields
    }
  }
}

11) Production Checklists

Do

Start with HTTP clients; escalate to headless only when needed.
Implement retries, backoff, and bounded concurrency.
Version your extractors and log raw responses.
Respect robots.txt and ToS; prefer official APIs.
Instrument metrics; alert on spikes in 403/429/5xx.

Don’t

Hammer a single host or ignore crawl delays.
Hardcode brittle selectors; avoid scraping PII without consent.
Store only parsed data—keep raw HTML for audits.

Table of Contents