Table of Contents
Introduction: why scrape the web with JavaScript?
JavaScript is the native language of the modern web. If your stack is Node.js, scraping with JS means fewer context switches, shared utilities, and easy deployment with serverless or containers. For static pages, a simple HTTP client plus an HTML parser is enough. For dynamic sites (React/Vue/Next) or bot-protected targets, you’ll want a real browser.
Rule of thumb: Try Axios + Cheerio first (fast & cheap). If the content needs JavaScript execution, move to Playwright or Puppeteer. Use Selenium when you need language parity across ecosystems or specific Grid features.
Libraries overview (what to use, when)
These are the most common options for a javascript website scraper.
| Category | Library | Setup | Performance | Best for | Pros | Cons |
|---|---|---|---|---|---|---|
| HTTP + HTML | axios + cheerio |
Static HTML, APIs, speed | Fast, light, cheap, easy to scale | No JS execution; cannot render SPA content | ||
| Browser | playwright |
Dynamic pages, cross‑browser | Auto‑waiting; Firefox/WebKit; robust | Heavier infra; higher cost than HTTP | ||
| Browser | puppeteer |
Chromium‑only, CDP power users | Fast; deep Chrome control; great ecosystem | Chromium focus; limited cross‑browser | ||
| Browser | selenium-webdriver |
Legacy / multi‑language / Grid | Mature; many languages; Grid support | More verbose; slower than newer tools | ||
| Parsing | cheerio, jsdom |
HTML traversal | jQuery‑like selectors; simple API | No real DOM events / rendering | ||
| Crawling | crawlee (Apify) |
Full crawler framework | Queues, retries, autoscaled pools | More opinions; bigger learning curve |
Step 1 — HTTP scraping with Axios + Cheerio (line‑by‑line)
For pages that don’t require JS execution, axios (HTTP client) + cheerio (HTML parser) is the fastest, cheapest path. Below we scrape a product listing and extract the title/price.
# Initialize a Node.js project (generates package.json)
npm init -y
# Install dependencies for HTTP + HTML parsing
npm i axios cheerio
# (Optional) Dev helper for running scripts with modern Node flags
npm i -D nodemon
// ------------------------------
// scrape-static.js — line by line
// ------------------------------
// 1) Import required modules
const fs = require('fs'); // Filesystem for writing results
const path = require('path'); // Path utilities to build file paths
const axios = require('axios'); // HTTP client to fetch the HTML page
const cheerio = require('cheerio'); // jQuery-like HTML parser and selector engine
// 2) Configure the target and output locations
const START_URL = 'https://example.com/products'; // Page to scrape (replace with your target)
const OUT_DIR = path.join(__dirname, 'out'); // Directory for scraped output
const OUT_JSON = path.join(OUT_DIR, 'products.json'); // JSON file path
// 3) Utility: ensure the output directory exists
function ensureDir(dir){
if(!fs.existsSync(dir)) { // Create directory if missing
fs.mkdirSync(dir, { recursive: true });
}
}
// 4) Main scraping routine
async function scrapeProducts(){
try {
// 4a) Fetch the page HTML
const response = await axios.get(START_URL, {
headers: {
// Realistic UA reduces naive bot blocks
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 13_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120 Safari/537.36'
}
});
// 4b) Load HTML into Cheerio to traverse with CSS selectors
const $ = cheerio.load(response.data);
// 4c) Select and extract data into a structured array
const items = [];
$('.product-card').each((i, el) => {
const title = $(el).find('.product-title').text().trim(); // Product name
const price = $(el).find('.product-price').text().trim(); // Product price
const link = $(el).find('a').attr('href') || null; // Optional link
const image = $(el).find('img').attr('src') || null; // Optional image
if (title && price) items.push({ title, price, link, image });
});
// 4d) Persist results to disk
ensureDir(OUT_DIR);
fs.writeFileSync(OUT_JSON, JSON.stringify(items, null, 2));
console.log(`Saved ${items.length} items to ${OUT_JSON}`);
} catch (err) {
// 4e) Handle and report errors (non-zero exit for CI/cron)
console.error('Scrape failed:', err.message);
process.exitCode = 1;
}
}
// 5) Execute when run directly: `node scrape-static.js`
if (require.main === module) scrapeProducts();
Step 2 — Browser scraping with Playwright (line‑by‑line)
When the page renders data with client‑side JavaScript, use a real browser. playwright is our recommended default in 2025.
# Install Playwright and its browsers
npm i -D playwright
npx playwright install chromium
# (Optional) Also install firefox/webkit if needed:
# npx playwright install firefox webkit
// ------------------------------
// scrape-playwright.js — line by line
// ------------------------------
// 1) Imports
const fs = require('fs'); // Save screenshots/data
const path = require('path'); // Build output paths
const { chromium } = require('playwright'); // Headless browser automation
// 2) Output and target config
const START_URL = 'https://example.com/products';
const OUT_DIR = path.join(__dirname, 'out');
const SHOT = path.join(OUT_DIR, 'page.png');
const OUT_JSON = path.join(OUT_DIR, 'products.playwright.json');
function ensureDir(dir){ if(!fs.existsSync(dir)) fs.mkdirSync(dir,{recursive:true}); }
// 3) Helper: scroll page for lazy/infinite lists
async function autoScroll(page){
await page.evaluate(async () => {
await new Promise(resolve => {
let total = 0; const step = 600; // Pixels per step
const timer = setInterval(() => {
const { scrollHeight } = document.documentElement;
window.scrollBy(0, step); total += step; // Accumulate scroll
if (total >= scrollHeight - window.innerHeight - 50) {
clearInterval(timer); resolve(); // Stop near bottom
}
}, 180);
});
});
}
// 4) Main flow
async function run(){
ensureDir(OUT_DIR); // Prepare output dir
const browser = await chromium.launch({ headless: true });
const context = await browser.newContext({
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120 Safari/537.36', // Realistic UA
viewport: { width: 1366, height: 900 }, // Typical laptop size
locale: 'en-US', // Locale sometimes affects content
});
const page = await context.newPage(); // New tab
await page.goto(START_URL, { waitUntil: 'domcontentloaded' });
await autoScroll(page); // Ensure content is loaded
// 4a) Extract items in the page context
const items = await page.$$eval('.product-card', cards =>
cards.map(c => ({
title: c.querySelector('.product-title')?.textContent?.trim(),
price: c.querySelector('.product-price')?.textContent?.trim(),
})).filter(x => x.title && x.price)
);
await page.screenshot({ path: SHOT, fullPage: true }); // Save screenshot
fs.writeFileSync(OUT_JSON, JSON.stringify(items, null, 2)); // Save data
await browser.close();
}
// 5) Run when executed directly
if (require.main === module) run().catch(e => { console.error(e); process.exitCode = 1; });
Puppeteer stealth example
# npm i puppeteer-extra puppeteer-extra-plugin-stealth puppeteer
// ------------------------------
// scrape-puppeteer.js (stealth) — line by line
// ------------------------------
const fs = require('fs'); // For writing files
const path = require('path'); // For building paths
const puppeteer = require('puppeteer-extra'); // Puppeteer wrapper
const StealthPlugin = require('puppeteer-extra-plugin-stealth'); // Anti-detection tweaks
puppeteer.use(StealthPlugin()); // Enable stealth plugin
(async () => {
const browser = await puppeteer.launch({ headless: true }); // Launch headless Chrome
const page = await browser.newPage(); // New tab
await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120 Safari/537.36'); // Realistic UA
await page.goto('https://example.com/products', { waitUntil: 'domcontentloaded' }); // Navigate
// Extract items using DOM selectors
const items = await page.$$eval('.product-card', els => els.map(el => ({
title: el.querySelector('.product-title')?.textContent?.trim(),
price: el.querySelector('.product-price')?.textContent?.trim(),
})).filter(x => x.title && x.price));
// Persist results
fs.writeFileSync(path.join(__dirname, 'out', 'products.puppeteer.json'), JSON.stringify(items, null, 2));
await browser.close(); // Cleanup
})();
Selenium WebDriver example
// ------------------------------
// scrape-selenium.js — line by line
// ------------------------------
const fs = require('fs'); // File writes
const path = require('path'); // Paths
const { Builder, By, until } = require('selenium-webdriver'); // Selenium core
const chrome = require('selenium-webdriver/chrome'); // Chrome options
const START_URL = 'https://example.com/products'; // Target URL
const OUT_DIR = path.join(__dirname, 'out'); // Output dir
const OUT_JSON = path.join(OUT_DIR, 'products.selenium.json'); // Output file
function ensureDir(d){ if(!fs.existsSync(d)) fs.mkdirSync(d,{recursive:true}); }
async function run(){
ensureDir(OUT_DIR);
// Configure Chrome for headless scraping with fewer detection signals
const options = new chrome.Options()
.addArguments('--headless=new')
.addArguments('--disable-blink-features=AutomationControlled');
// Build a WebDriver instance
const driver = await new Builder().forBrowser('chrome').setChromeOptions(options).build();
try{
await driver.get(START_URL); // Navigate
await driver.wait(until.elementsLocated(By.css('.product-card')), 10000); // Wait for content
// Extract data from product cards
const cards = await driver.findElements(By.css('.product-card'));
const items = [];
for(const card of cards){
const title = (await card.findElement(By.css('.product-title')).getText()).trim();
const price = (await card.findElement(By.css('.product-price')).getText()).trim();
items.push({ title, price });
}
fs.writeFileSync(OUT_JSON, JSON.stringify(items, null, 2)); // Save results
} finally {
await driver.quit(); // Always close
}
}
if (require.main === module) run().catch(e => { console.error('Selenium scrape failed:', e); process.exitCode = 1; });
Concurrency & rate limiting (be fast — but polite)
Don’t hammer targets. Use a limiter so you respect sites and reduce bans. bottleneck works great with CommonJS.
// Install deps first: npm i bottleneck axios
const axios = require('axios'); // HTTP client
const Bottleneck = require('bottleneck'); // Concurrency + rate limiter
// Create limiter: at most 3 in flight, ~2 req/s
const limiter = new Bottleneck({ minTime: 500, maxConcurrent: 3 });
// Task to run through the limiter
async function fetchUrl(url){
const res = await axios.get(url, { timeout: 10000 });
return { url, status: res.status, bytes: res.data.length };
}
// Schedule multiple jobs with backpressure
(async () => {
const urls = ['https://example.com/1','https://example.com/2','https://example.com/3'];
const tasks = urls.map(u => limiter.schedule(() => fetchUrl(u)));
const results = await Promise.allSettled(tasks);
console.log(results);
})();
Store results (JSON + CSV)
JSON is perfect for pipelines. CSV helps analysts. Here’s a tiny utility you can reuse.
// Tiny persistence helpers — JSON and CSV
const fs = require('fs'); // Filesystem
const path = require('path'); // Paths
// Ensure a directory exists before writing files
function ensureDir(d){ if(!fs.existsSync(d)) fs.mkdirSync(d,{recursive:true}); }
// Convert array of objects to CSV (handles quotes)
function toCSV(rows){
if(!rows.length) return '';
const headers = Object.keys(rows[0]); // Use keys from first row
const esc = v => `"${String(v ?? '').replace(/"/g,'""')}"`; // Escape quotes
const lines = [headers.map(esc).join(',')]; // Header row
for(const r of rows){
lines.push(headers.map(h => esc(r[h])).join(',')); // Data rows
}
return lines.join('\n');
}
// Save both JSON and CSV variants to disk
function saveAll(outDir, base, rows){
ensureDir(outDir);
const jsonPath = path.join(outDir, `${base}.json`);
const csvPath = path.join(outDir, `${base}.csv`);
fs.writeFileSync(jsonPath, JSON.stringify(rows, null, 2)); // Pretty JSON
fs.writeFileSync(csvPath, toCSV(rows)); // CSV text
return { jsonPath, csvPath };
}
module.exports = { toCSV, saveAll };
Anti‑bot & fingerprinting (practical tips)
- Behave like a user: realistic
User‑Agent, viewport, timezone, and small jitters. - Rotate networks: trusted proxies/residential IPs; avoid bursts; use
bottleneck. - Vary fingerprints: Playwright contexts; Puppeteer stealth.
- Detect blocks early: watch for 403/429, challenges, odd redirects; add retries/backoff.
- CAPTCHA: consider human-in-the-loop or solver services when allowed; otherwise change strategy.
Legal & ethics
Always follow the law and site terms. Respect robots.txt and robots meta tags, never scrape private or paywalled data you don’t have rights to, and avoid harming services (use rate limits, cache, and respectful intervals).
Summary & next steps
- Start with Axios + Cheerio (fast & cheap).
- Use Playwright for SPAs and dynamic content.
- Choose Puppeteer if you want pure Chromium + CDP.
- Use Selenium for Grid or language parity.
- Add bottleneck for polite concurrency; store data as JSON/CSV.