JavaScript Web Scraping — Step‑by‑Step 2025 Tutorial

Introduction
Libraries Overview
HTTP scraping (Axios + Cheerio)
Browser scraping (Playwright)
Puppeteer stealth example
Selenium WebDriver example
Concurrency & rate limiting
Store results (JSON/CSV)
Anti-bot & fingerprinting
Legal & ethics
Summary & next steps

Introduction: why scrape the web with JavaScript?

JavaScript is the native language of the modern web. If your stack is Node.js, scraping with JS means fewer context switches, shared utilities, and easy deployment with serverless or containers. For static pages, a simple HTTP client plus an HTML parser is enough. For dynamic sites (React/Vue/Next) or bot-protected targets, you’ll want a real browser.

Rule of thumb: Try Axios + Cheerio first (fast & cheap). If the content needs JavaScript execution, move to Playwright or Puppeteer. Use Selenium when you need language parity across ecosystems or specific Grid features.

Libraries overview (what to use, when)

These are the most common options for a javascript website scraper.

Category	Library	Setup	Performance	Best for	Pros	Cons
HTTP + HTML	`axios` + `cheerio`	★★★★★	★★★★★	Static HTML, APIs, speed	Fast, light, cheap, easy to scale	No JS execution; cannot render SPA content
Browser	`playwright`	★★★★★	★★★★★	Dynamic pages, cross‑browser	Auto‑waiting; Firefox/WebKit; robust	Heavier infra; higher cost than HTTP
Browser	`puppeteer`	★★★★★	★★★★★	Chromium‑only, CDP power users	Fast; deep Chrome control; great ecosystem	Chromium focus; limited cross‑browser
Browser	`selenium-webdriver`	★★★★★	★★★★★	Legacy / multi‑language / Grid	Mature; many languages; Grid support	More verbose; slower than newer tools
Parsing	`cheerio`, `jsdom`	★★★★★	★★★★★	HTML traversal	jQuery‑like selectors; simple API	No real DOM events / rendering
Crawling	`crawlee` (Apify)	★★★★★	★★★★★	Full crawler framework	Queues, retries, autoscaled pools	More opinions; bigger learning curve

Step 1 — HTTP scraping with Axios + Cheerio (line‑by‑line)

For pages that don’t require JS execution, axios (HTTP client) + cheerio (HTML parser) is the fastest, cheapest path. Below we scrape a product listing and extract the title/price.

# Initialize a Node.js project (generates package.json)
npm init -y
# Install dependencies for HTTP + HTML parsing
npm i axios cheerio
# (Optional) Dev helper for running scripts with modern Node flags
npm i -D nodemon

// ------------------------------
// scrape-static.js — line by line
// ------------------------------

// 1) Import required modules
const fs = require('fs');                // Filesystem for writing results
const path = require('path');            // Path utilities to build file paths
const axios = require('axios');          // HTTP client to fetch the HTML page
const cheerio = require('cheerio');      // jQuery-like HTML parser and selector engine

// 2) Configure the target and output locations
const START_URL = 'https://example.com/products'; // Page to scrape (replace with your target)
const OUT_DIR = path.join(__dirname, 'out');      // Directory for scraped output
const OUT_JSON = path.join(OUT_DIR, 'products.json'); // JSON file path

// 3) Utility: ensure the output directory exists
function ensureDir(dir){
  if(!fs.existsSync(dir)) {               // Create directory if missing
    fs.mkdirSync(dir, { recursive: true });
  }
}

// 4) Main scraping routine
async function scrapeProducts(){
  try {
    // 4a) Fetch the page HTML
    const response = await axios.get(START_URL, {
      headers: {
        // Realistic UA reduces naive bot blocks
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 13_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120 Safari/537.36'
      }
    });

    // 4b) Load HTML into Cheerio to traverse with CSS selectors
    const $ = cheerio.load(response.data);

    // 4c) Select and extract data into a structured array
    const items = [];
    $('.product-card').each((i, el) => {
      const title = $(el).find('.product-title').text().trim(); // Product name
      const price = $(el).find('.product-price').text().trim(); // Product price
      const link  = $(el).find('a').attr('href') || null;       // Optional link
      const image = $(el).find('img').attr('src') || null;      // Optional image
      if (title && price) items.push({ title, price, link, image });
    });

    // 4d) Persist results to disk
    ensureDir(OUT_DIR);
    fs.writeFileSync(OUT_JSON, JSON.stringify(items, null, 2));
    console.log(`Saved ${items.length} items to ${OUT_JSON}`);

  } catch (err) {
    // 4e) Handle and report errors (non-zero exit for CI/cron)
    console.error('Scrape failed:', err.message);
    process.exitCode = 1;
  }
}

// 5) Execute when run directly: `node scrape-static.js`
if (require.main === module) scrapeProducts();

Step 2 — Browser scraping with Playwright (line‑by‑line)

When the page renders data with client‑side JavaScript, use a real browser. playwright is our recommended default in 2025.

# Install Playwright and its browsers
npm i -D playwright
npx playwright install chromium
# (Optional) Also install firefox/webkit if needed:
# npx playwright install firefox webkit

// ------------------------------
// scrape-playwright.js — line by line
// ------------------------------

// 1) Imports
const fs = require('fs');                      // Save screenshots/data
const path = require('path');                  // Build output paths
const { chromium } = require('playwright');    // Headless browser automation

// 2) Output and target config
const START_URL = 'https://example.com/products';
const OUT_DIR = path.join(__dirname, 'out');
const SHOT = path.join(OUT_DIR, 'page.png');
const OUT_JSON = path.join(OUT_DIR, 'products.playwright.json');

function ensureDir(dir){ if(!fs.existsSync(dir)) fs.mkdirSync(dir,{recursive:true}); }

// 3) Helper: scroll page for lazy/infinite lists
async function autoScroll(page){
  await page.evaluate(async () => {
    await new Promise(resolve => {
      let total = 0; const step = 600;               // Pixels per step
      const timer = setInterval(() => {
        const { scrollHeight } = document.documentElement;
        window.scrollBy(0, step); total += step;     // Accumulate scroll
        if (total >= scrollHeight - window.innerHeight - 50) {
          clearInterval(timer); resolve();           // Stop near bottom
        }
      }, 180);
    });
  });
}

// 4) Main flow
async function run(){
  ensureDir(OUT_DIR);                                // Prepare output dir
  const browser = await chromium.launch({ headless: true });
  const context = await browser.newContext({
    userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120 Safari/537.36', // Realistic UA
    viewport: { width: 1366, height: 900 },          // Typical laptop size
    locale: 'en-US',                                  // Locale sometimes affects content
  });
  const page = await context.newPage();               // New tab
  await page.goto(START_URL, { waitUntil: 'domcontentloaded' });
  await autoScroll(page);                             // Ensure content is loaded

  // 4a) Extract items in the page context
  const items = await page.$$eval('.product-card', cards =>
    cards.map(c => ({
      title: c.querySelector('.product-title')?.textContent?.trim(),
      price: c.querySelector('.product-price')?.textContent?.trim(),
    })).filter(x => x.title && x.price)
  );

  await page.screenshot({ path: SHOT, fullPage: true }); // Save screenshot
  fs.writeFileSync(OUT_JSON, JSON.stringify(items, null, 2)); // Save data
  await browser.close();
}

// 5) Run when executed directly
if (require.main === module) run().catch(e => { console.error(e); process.exitCode = 1; });

Puppeteer stealth example

# npm i puppeteer-extra puppeteer-extra-plugin-stealth puppeteer

// ------------------------------
// scrape-puppeteer.js (stealth) — line by line
// ------------------------------

const fs = require('fs');                             // For writing files
const path = require('path');                         // For building paths
const puppeteer = require('puppeteer-extra');         // Puppeteer wrapper
const StealthPlugin = require('puppeteer-extra-plugin-stealth'); // Anti-detection tweaks
puppeteer.use(StealthPlugin());                       // Enable stealth plugin

(async () => {
  const browser = await puppeteer.launch({ headless: true }); // Launch headless Chrome
  const page = await browser.newPage();                       // New tab
  await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120 Safari/537.36'); // Realistic UA
  await page.goto('https://example.com/products', { waitUntil: 'domcontentloaded' }); // Navigate
  // Extract items using DOM selectors
  const items = await page.$$eval('.product-card', els => els.map(el => ({
    title: el.querySelector('.product-title')?.textContent?.trim(),
    price: el.querySelector('.product-price')?.textContent?.trim(),
  })).filter(x => x.title && x.price));
  // Persist results
  fs.writeFileSync(path.join(__dirname, 'out', 'products.puppeteer.json'), JSON.stringify(items, null, 2));
  await browser.close();                               // Cleanup
})();

Selenium WebDriver example

// ------------------------------
// scrape-selenium.js — line by line
// ------------------------------

const fs = require('fs');                                        // File writes
const path = require('path');                                    // Paths
const { Builder, By, until } = require('selenium-webdriver');    // Selenium core
const chrome = require('selenium-webdriver/chrome');             // Chrome options

const START_URL = 'https://example.com/products';                // Target URL
const OUT_DIR = path.join(__dirname, 'out');                     // Output dir
const OUT_JSON = path.join(OUT_DIR, 'products.selenium.json');   // Output file
function ensureDir(d){ if(!fs.existsSync(d)) fs.mkdirSync(d,{recursive:true}); }

async function run(){
  ensureDir(OUT_DIR);
  // Configure Chrome for headless scraping with fewer detection signals
  const options = new chrome.Options()
    .addArguments('--headless=new')
    .addArguments('--disable-blink-features=AutomationControlled');

  // Build a WebDriver instance
  const driver = await new Builder().forBrowser('chrome').setChromeOptions(options).build();
  try{
    await driver.get(START_URL);                                             // Navigate
    await driver.wait(until.elementsLocated(By.css('.product-card')), 10000); // Wait for content

    // Extract data from product cards
    const cards = await driver.findElements(By.css('.product-card'));
    const items = [];
    for(const card of cards){
      const title = (await card.findElement(By.css('.product-title')).getText()).trim();
      const price = (await card.findElement(By.css('.product-price')).getText()).trim();
      items.push({ title, price });
    }

    fs.writeFileSync(OUT_JSON, JSON.stringify(items, null, 2));  // Save results
  } finally {
    await driver.quit();                                          // Always close
  }
}

if (require.main === module) run().catch(e => { console.error('Selenium scrape failed:', e); process.exitCode = 1; });

Concurrency & rate limiting (be fast — but polite)

Don’t hammer targets. Use a limiter so you respect sites and reduce bans. bottleneck works great with CommonJS.

// Install deps first: npm i bottleneck axios
const axios = require('axios');                 // HTTP client
const Bottleneck = require('bottleneck');       // Concurrency + rate limiter

// Create limiter: at most 3 in flight, ~2 req/s
const limiter = new Bottleneck({ minTime: 500, maxConcurrent: 3 });

// Task to run through the limiter
async function fetchUrl(url){
  const res = await axios.get(url, { timeout: 10000 });
  return { url, status: res.status, bytes: res.data.length };
}

// Schedule multiple jobs with backpressure
(async () => {
  const urls = ['https://example.com/1','https://example.com/2','https://example.com/3'];
  const tasks = urls.map(u => limiter.schedule(() => fetchUrl(u)));
  const results = await Promise.allSettled(tasks);
  console.log(results);
})();

Store results (JSON + CSV)

JSON is perfect for pipelines. CSV helps analysts. Here’s a tiny utility you can reuse.

// Tiny persistence helpers — JSON and CSV
const fs = require('fs');                          // Filesystem
const path = require('path');                      // Paths

// Ensure a directory exists before writing files
function ensureDir(d){ if(!fs.existsSync(d)) fs.mkdirSync(d,{recursive:true}); }

// Convert array of objects to CSV (handles quotes)
function toCSV(rows){
  if(!rows.length) return '';
  const headers = Object.keys(rows[0]);                 // Use keys from first row
  const esc = v => `"${String(v ?? '').replace(/"/g,'""')}"`; // Escape quotes
  const lines = [headers.map(esc).join(',')];            // Header row
  for(const r of rows){
    lines.push(headers.map(h => esc(r[h])).join(','));   // Data rows
  }
  return lines.join('\n');
}

// Save both JSON and CSV variants to disk
function saveAll(outDir, base, rows){
  ensureDir(outDir);
  const jsonPath = path.join(outDir, `${base}.json`);
  const csvPath  = path.join(outDir, `${base}.csv`);
  fs.writeFileSync(jsonPath, JSON.stringify(rows, null, 2)); // Pretty JSON
  fs.writeFileSync(csvPath, toCSV(rows));                    // CSV text
  return { jsonPath, csvPath };
}

module.exports = { toCSV, saveAll };

Anti‑bot & fingerprinting (practical tips)

Behave like a user: realistic User‑Agent, viewport, timezone, and small jitters.
Rotate networks: trusted proxies/residential IPs; avoid bursts; use bottleneck.
Vary fingerprints: Playwright contexts; Puppeteer stealth.
Detect blocks early: watch for 403/429, challenges, odd redirects; add retries/backoff.
CAPTCHA: consider human-in-the-loop or solver services when allowed; otherwise change strategy.

Legal & ethics

Always follow the law and site terms. Respect robots.txt and robots meta tags, never scrape private or paywalled data you don’t have rights to, and avoid harming services (use rate limits, cache, and respectful intervals).

Summary & next steps

Start with Axios + Cheerio (fast & cheap).
Use Playwright for SPAs and dynamic content.
Choose Puppeteer if you want pure Chromium + CDP.
Use Selenium for Grid or language parity.
Add bottleneck for polite concurrency; store data as JSON/CSV.

Table of Contents