1) Ethics, Legality & Ground Rules

Always respect site Terms of Service, honor robots.txt, and remain considerate of server load. When APIs exist, prefer them over scraping. Design your crawlers to be transparent and reversible—log what you collect and why.

  • Check for an API first. Usually faster, cleaner, and more stable than HTML.
  • Review robots.txt and sitemaps. Use them as guidance for allowed paths and crawl policies.
  • Throttle requests. Use adaptive rate limits, backoff, and concurrency caps.
  • Identify your client politely. Rotate realistic User‑Agents and include a contact email if appropriate.
  • Store raw HTML and logs. Reproducibility helps debug issues and ensure accountability.

2) Architecture at a Glance

Robust scrapers separate fetch, parse, and persist stages. Start simple, keep it modular, and instrument everything (metrics, logs, traces).

Queue (URLs & retries) → Fetcher (HTTP/Browser) → Parser (DOM/HTML/JSON) → ValidatorStorage (DB/File) → Monitor (errors, success rates).

3) Popular Libraries & When to Use Them

Python

CategoryLibrariesNotes
HTTP httpx requests
  • Async + HTTP/2, timeouts, retries/backoff
  • Great with proxies and connection pooling
Parsing BeautifulSoup lxml parsel
  • CSS/XPath selectors with robust fallbacks
  • Validate required fields, default optional to null
Crawling Scrapy
  • Pipelines, middlewares, auto‑throttle, retries
  • Battle‑tested for production crawls
Browser Playwright Selenium
  • Use for dynamic JS, auth, multi‑step flows
  • Keep concurrency bounded; wait for stable selectors

Node.js

CategoryLibrariesNotes
HTTP got axios
  • Timeouts, retries/backoff; HTTP/2 where needed
  • Good proxy support and streaming
Parsing cheerio jsdom
  • Fast server‑side DOM parsing
  • Use semantic selectors; default on missing data
Browser Playwright Puppeteer
  • Network interception, cross‑browser automation
  • Headless only when required by JS
Crawling Crawlee Apify SDK
  • Queues, storages, autoscaling and retries
  • Great building blocks for resilient crawlers

Go & Java

LanguageHTTP / ParserBrowserNotes
Go colly goquery chromedp rod
  • High‑throughput, low‑memory crawlers
  • Good for large‑scale concurrent fetching
Java jsoup Selenium
  • Robust HTML parsing + enterprise ecosystems
  • Use browser only for complex, JS‑heavy flows

4) Hands‑On: Respectful HTTP Scraper (Python)

A minimal, production‑ready pattern that checks robots.txt, uses HTTP/2, throttles, retries, and parses safely.

# --- imports ---
import asyncio  # async event loop (for concurrency patterns)
import time  # sleep/backoff utilities
from urllib import robotparser  # robots.txt parser
import httpx  # modern async HTTP client (HTTP/2 support)
from bs4 import BeautifulSoup  # HTML parser (CSS selectors)

# --- config ---
BASE = "https://example.com"  # target origin
START = f"{BASE}/products"  # seed URL to crawl
HEADERS = {  # polite, browser-like headers
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0 Safari/537.36 PromptFuelBot/1.0",  # identify client
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",  # accept common HTML types
}

MAX_CONCURRENCY = 3  # bound concurrency to be gentle to servers
MAX_RETRIES = 3  # retry transient network errors
DELAY_BASE = 0.75  # base delay (seconds) between requests

# --- robots.txt check ---
def allowed(url: str, ua: str = HEADERS["User-Agent"]) -> bool:  # check crawl permission for URL
    rp = robotparser.RobotFileParser()  # initialize parser
    rp.set_url(f"{BASE}/robots.txt")  # location of robots.txt
    try:
        rp.read()  # fetch and parse rules
    except Exception:
        return False  # if robots cannot be fetched, play it safe
    return rp.can_fetch(ua, url)  # consult policy for this UA and URL

5) Headless Browser Patterns

Use a headless browser when you truly need JS execution, multi-step flows, or authenticated sessions. Keep concurrency bounded and monitor 429/403s.

6) Request Hygiene

  • Rotate realistic User‑Agents, accept‑language, and connection headers.
  • Respect cache and robots directives; add backoff with jitter.
  • Persist cookies per origin; retry idempotently.

7) Parsing Patterns that Survive Redesigns

  • Favor semantic anchors (ids, data‑attributes) over brittle classes.
  • Use CSS/XPath carefully; assert required fields, default optional ones to null.
  • Version your extractors; log HTML snapshots when extractors fail.

8) Storage: What to Keep and Why

At minimum, store raw HTML, parsed JSON, and logs/metrics. Consider PostgreSQL for entities and object storage for raw snapshots.

9) Monitoring: Know When You’re Failing

  • Metrics: success rate, latency, bytes/page, error codes, retries, CAPTCHAs encountered.
  • Logs: per‑URL trace with timing; store request/response headers.
  • Alerts: spikes in 403/429, DOM changes (selector miss), growth in retries.

10) Quick Starts: Go & Java

// Basic Go scraper with colly (line-by-line commented)
package main // entry package

import ( // imports
  "fmt"                                 // console output
  "github.com/gocolly/colly/v2"         // Go crawling framework
)

func main() { // program entrypoint
  c := colly.NewCollector(               // create a collector (crawler)
    colly.UserAgent("Mozilla/5.0 PromptFuelBot/1.0"),  // set polite User-Agent
    colly.AllowedDomains("example.com"), // restrict crawl scope to domain
  )

  c.Limit(&colly.LimitRule{              // set politeness (throttle)
    DomainGlob:  "*example.*",           // apply to matching domains
    Parallelism: 2,                       // small concurrency
    RandomDelay: 500,                     // jitter (milliseconds)
  })

  c.OnHTML(".product-card", func(e *colly.HTMLElement) { // parse each product card
    title := e.ChildText(".product-title") // extract title text
    price := e.ChildText(".product-price") // extract price text
    url := e.ChildAttr("a", "href")        // extract product link
    fmt.Println(title, price, url)          // output fields
  })

  c.Visit("https://example.com/products")  // start crawl at listing URL
}
// Minimal Java + jsoup example (line-by-line commented)
import org.jsoup.Jsoup;                      // HTTP + HTML parsing
import org.jsoup.nodes.Document;             // DOM root node
import org.jsoup.nodes.Element;              // DOM element
import org.jsoup.select.Elements;            // element collection

public class JsoupScraper {                  // class declaration
  public static void main(String[] args) throws Exception { // entry point
    Document doc = Jsoup.connect("https://example.com/products") // prepare request
       .userAgent("Mozilla/5.0 PromptFuelBot/1.0")              // set polite UA
       .timeout(15000)                                           // request timeout (ms)
       .get();                                                   // perform GET and parse

    Elements cards = doc.select(".product-card");               // select product cards
    for (Element card : cards) {                                 // iterate results
      String title = card.selectFirst(".product-title") != null
        ? card.selectFirst(".product-title").text() : null;     // extract title
      String price = card.selectFirst(".product-price") != null
        ? card.selectFirst(".product-price").text() : null;     // extract price
      String url = card.selectFirst("a") != null
        ? card.selectFirst("a").attr("href") : null;            // extract link
      System.out.println(title + " | " + price + " | " + url);   // print fields
    }
  }
}

11) Production Checklists

Do

  • Start with HTTP clients; escalate to headless only when needed.
  • Implement retries, backoff, and bounded concurrency.
  • Version your extractors and log raw responses.
  • Respect robots.txt and ToS; prefer official APIs.
  • Instrument metrics; alert on spikes in 403/429/5xx.

Don’t

  • Hammer a single host or ignore crawl delays.
  • Hardcode brittle selectors; avoid scraping PII without consent.
  • Store only parsed data—keep raw HTML for audits.