Table of Contents
1) Ethics, Legality & Ground Rules
Always respect site Terms of Service, honor robots.txt, and remain considerate of server load. When APIs exist, prefer them over scraping. Design your crawlers to be transparent and reversible—log what you collect and why.
- Check for an API first. Usually faster, cleaner, and more stable than HTML.
- Review robots.txt and sitemaps. Use them as guidance for allowed paths and crawl policies.
- Throttle requests. Use adaptive rate limits, backoff, and concurrency caps.
- Identify your client politely. Rotate realistic User‑Agents and include a contact email if appropriate.
- Store raw HTML and logs. Reproducibility helps debug issues and ensure accountability.
2) Architecture at a Glance
Robust scrapers separate fetch, parse, and persist stages. Start simple, keep it modular, and instrument everything (metrics, logs, traces).
Queue (URLs & retries) → Fetcher (HTTP/Browser) → Parser (DOM/HTML/JSON) → Validator → Storage (DB/File) → Monitor (errors, success rates).
3) Popular Libraries & When to Use Them
Python
| Category | Libraries | Notes |
|---|---|---|
| HTTP | httpx requests |
|
| Parsing | BeautifulSoup lxml parsel |
|
| Crawling | Scrapy |
|
| Browser | Playwright Selenium |
|
Node.js
| Category | Libraries | Notes |
|---|---|---|
| HTTP | got axios |
|
| Parsing | cheerio jsdom |
|
| Browser | Playwright Puppeteer |
|
| Crawling | Crawlee Apify SDK |
|
Go & Java
| Language | HTTP / Parser | Browser | Notes |
|---|---|---|---|
| Go | colly goquery | chromedp rod |
|
| Java | jsoup | Selenium |
|
4) Hands‑On: Respectful HTTP Scraper (Python)
A minimal, production‑ready pattern that checks robots.txt, uses HTTP/2, throttles, retries, and parses safely.
# --- imports ---
import asyncio # async event loop (for concurrency patterns)
import time # sleep/backoff utilities
from urllib import robotparser # robots.txt parser
import httpx # modern async HTTP client (HTTP/2 support)
from bs4 import BeautifulSoup # HTML parser (CSS selectors)
# --- config ---
BASE = "https://example.com" # target origin
START = f"{BASE}/products" # seed URL to crawl
HEADERS = { # polite, browser-like headers
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0 Safari/537.36 PromptFuelBot/1.0", # identify client
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", # accept common HTML types
}
MAX_CONCURRENCY = 3 # bound concurrency to be gentle to servers
MAX_RETRIES = 3 # retry transient network errors
DELAY_BASE = 0.75 # base delay (seconds) between requests
# --- robots.txt check ---
def allowed(url: str, ua: str = HEADERS["User-Agent"]) -> bool: # check crawl permission for URL
rp = robotparser.RobotFileParser() # initialize parser
rp.set_url(f"{BASE}/robots.txt") # location of robots.txt
try:
rp.read() # fetch and parse rules
except Exception:
return False # if robots cannot be fetched, play it safe
return rp.can_fetch(ua, url) # consult policy for this UA and URL
5) Headless Browser Patterns
Use a headless browser when you truly need JS execution, multi-step flows, or authenticated sessions. Keep concurrency bounded and monitor 429/403s.
6) Request Hygiene
- Rotate realistic User‑Agents, accept‑language, and connection headers.
- Respect cache and robots directives; add backoff with jitter.
- Persist cookies per origin; retry idempotently.
7) Parsing Patterns that Survive Redesigns
- Favor semantic anchors (ids, data‑attributes) over brittle classes.
- Use CSS/XPath carefully; assert required fields, default optional ones to null.
- Version your extractors; log HTML snapshots when extractors fail.
8) Storage: What to Keep and Why
At minimum, store raw HTML, parsed JSON, and logs/metrics. Consider PostgreSQL for entities and object storage for raw snapshots.
9) Monitoring: Know When You’re Failing
- Metrics: success rate, latency, bytes/page, error codes, retries, CAPTCHAs encountered.
- Logs: per‑URL trace with timing; store request/response headers.
- Alerts: spikes in 403/429, DOM changes (selector miss), growth in retries.
10) Quick Starts: Go & Java
// Basic Go scraper with colly (line-by-line commented)
package main // entry package
import ( // imports
"fmt" // console output
"github.com/gocolly/colly/v2" // Go crawling framework
)
func main() { // program entrypoint
c := colly.NewCollector( // create a collector (crawler)
colly.UserAgent("Mozilla/5.0 PromptFuelBot/1.0"), // set polite User-Agent
colly.AllowedDomains("example.com"), // restrict crawl scope to domain
)
c.Limit(&colly.LimitRule{ // set politeness (throttle)
DomainGlob: "*example.*", // apply to matching domains
Parallelism: 2, // small concurrency
RandomDelay: 500, // jitter (milliseconds)
})
c.OnHTML(".product-card", func(e *colly.HTMLElement) { // parse each product card
title := e.ChildText(".product-title") // extract title text
price := e.ChildText(".product-price") // extract price text
url := e.ChildAttr("a", "href") // extract product link
fmt.Println(title, price, url) // output fields
})
c.Visit("https://example.com/products") // start crawl at listing URL
}
// Minimal Java + jsoup example (line-by-line commented)
import org.jsoup.Jsoup; // HTTP + HTML parsing
import org.jsoup.nodes.Document; // DOM root node
import org.jsoup.nodes.Element; // DOM element
import org.jsoup.select.Elements; // element collection
public class JsoupScraper { // class declaration
public static void main(String[] args) throws Exception { // entry point
Document doc = Jsoup.connect("https://example.com/products") // prepare request
.userAgent("Mozilla/5.0 PromptFuelBot/1.0") // set polite UA
.timeout(15000) // request timeout (ms)
.get(); // perform GET and parse
Elements cards = doc.select(".product-card"); // select product cards
for (Element card : cards) { // iterate results
String title = card.selectFirst(".product-title") != null
? card.selectFirst(".product-title").text() : null; // extract title
String price = card.selectFirst(".product-price") != null
? card.selectFirst(".product-price").text() : null; // extract price
String url = card.selectFirst("a") != null
? card.selectFirst("a").attr("href") : null; // extract link
System.out.println(title + " | " + price + " | " + url); // print fields
}
}
}
11) Production Checklists
Do
- Start with HTTP clients; escalate to headless only when needed.
- Implement retries, backoff, and bounded concurrency.
- Version your extractors and log raw responses.
- Respect robots.txt and ToS; prefer official APIs.
- Instrument metrics; alert on spikes in 403/429/5xx.
Don’t
- Hammer a single host or ignore crawl delays.
- Hardcode brittle selectors; avoid scraping PII without consent.
- Store only parsed data—keep raw HTML for audits.