Backend Development

Web Scraping with Node.js

Mayur Dabhi

April 27, 2026

14 min read

Web scraping is one of the most practical skills in a developer's toolkit — whether you're collecting pricing data, monitoring competitors, aggregating content, or building datasets for machine learning. Node.js is an exceptional platform for scraping because of its non-blocking I/O model, the maturity of its HTTP and DOM libraries, and the sheer speed you can achieve with asynchronous requests. In this guide we'll go from zero to production-ready scraper using the two most popular libraries: Cheerio for static sites and Puppeteer for JavaScript-heavy pages.

When to Scrape vs When to Use an API

Always check for an official API before scraping. Many sites offer free tiers (Twitter, GitHub, OpenWeatherMap). Scraping is appropriate when no API exists, the API is cost-prohibitive, or you need data the API doesn't expose. Always review a site's robots.txt and Terms of Service first.

Choosing the Right Tool

Node.js scraping falls into two broad categories depending on how the target page renders its content:

Library	Approach	Best For	Speed
Cheerio	Parse raw HTML (server-rendered)	Static pages, blogs, news sites	Very fast (no browser)
Puppeteer	Headless Chrome (full browser)	SPAs, React/Vue apps, login-gated pages	Slower (real browser)
Playwright	Multi-browser headless automation	Cross-browser, complex auth flows	Moderate
axios + Cheerio	HTTP + DOM parsing	Simple HTML extraction at scale	Fastest

Static vs Dynamic scraping pipelines in Node.js

Setting Up Your Project

Initialise the project

Create a new Node.js project and install the core dependencies you'll need for both static and dynamic scraping.

Terminal

mkdir node-scraper && cd node-scraper
npm init -y

# Static scraping
npm install axios cheerio

# Dynamic scraping (downloads Chromium automatically)
npm install puppeteer

# Utilities
npm install fs-extra csv-writer p-limit

Configure project structure

Organise your scraper code into logical modules for reusability and maintainability.

Project Structure

node-scraper/
├── scrapers/
│   ├── cheerio-scraper.js   # Static HTML scraping
│   └── puppeteer-scraper.js # Dynamic JS-rendered scraping
├── utils/
│   ├── http.js              # Axios instance with retry logic
│   └── export.js            # CSV / JSON export helpers
├── data/                    # Output directory
└── index.js                 # Entry point

Static Scraping with Cheerio

Cheerio implements jQuery's API on top of a fast HTML parser. It's the best tool for scraping server-rendered pages — news articles, documentation, e-commerce product listings, and anything returned as complete HTML from the server.

Basic Cheerio Scraper

scrapers/cheerio-scraper.js

const axios = require('axios');
const cheerio = require('cheerio');

async function scrapeArticles(url) {
  const { data: html } = await axios.get(url, {
    headers: {
      'User-Agent':
        'Mozilla/5.0 (compatible; MyCrawler/1.0; +https://example.com)',
    },
    timeout: 10_000,
  });

  const $ = cheerio.load(html);
  const articles = [];

  // Iterate each article card in the listing
  $('article.post-card').each((_, el) => {
    articles.push({
      title: $(el).find('h2.post-title').text().trim(),
      url: $(el).find('a.post-link').attr('href'),
      date: $(el).find('time').attr('datetime'),
      summary: $(el).find('p.excerpt').text().trim(),
      tags: $(el)
        .find('.tag')
        .map((_, t) => $(t).text().trim())
        .get(),
    });
  });

  return articles;
}

module.exports = { scrapeArticles };

Handling Pagination

Pagination Loop

async function scrapeAllPages(baseUrl, maxPages = 10) {
  const allItems = [];
  let page = 1;

  while (page <= maxPages) {
    const url = `${baseUrl}?page=${page}`;
    console.log(`Scraping page ${page}: ${url}`);

    const { data: html } = await axios.get(url, { timeout: 10_000 });
    const $ = cheerio.load(html);

    // Collect items on this page
    $('li.result-item').each((_, el) => {
      allItems.push({
        name: $(el).find('.name').text().trim(),
        price: parseFloat($(el).find('.price').text().replace(/[^0-9.]/g, '')),
        sku: $(el).data('sku'),
      });
    });

    // Stop if there is no "next" link
    const hasNext = $('a[rel="next"]').length > 0;
    if (!hasNext) break;

    page++;
    // Polite delay — avoid hammering the server
    await new Promise(r => setTimeout(r, 1500));
  }

  return allItems;
}

Rate Limiting is Not Optional

Sending requests without delays can overload servers, get your IP blocked, and is considered abusive. Always add at least a 1–2 second delay between requests. For production scrapers, use a queue with concurrency limits.

Dynamic Scraping with Puppeteer

Puppeteer controls a real Chromium browser, which means it can execute JavaScript, handle login forms, click buttons, wait for network requests to complete, and scrape content that only appears after client-side rendering. It's the tool you reach for when Cheerio returns an empty DOM.

Core Puppeteer Patterns

const puppeteer = require('puppeteer');

async function scrapeWithPuppeteer(url) {
  const browser = await puppeteer.launch({
    headless: 'new',          // use new headless mode
    args: ['--no-sandbox'],
  });

  const page = await browser.newPage();

  // Set a realistic viewport and User-Agent
  await page.setViewport({ width: 1280, height: 800 });
  await page.setUserAgent(
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
  );

  await page.goto(url, { waitUntil: 'networkidle2' });

  // Extract data in browser context
  const products = await page.$$eval('.product-card', cards =>
    cards.map(card => ({
      name: card.querySelector('.name')?.textContent.trim(),
      price: card.querySelector('.price')?.textContent.trim(),
      rating: card.querySelector('.stars')?.dataset.rating,
    }))
  );

  await browser.close();
  return products;
}

// Wait for a selector to appear
await page.waitForSelector('.results-container', { timeout: 15_000 });

// Wait for a network request to finish
await page.waitForResponse(res =>
  res.url().includes('/api/products') && res.status() === 200
);

// Wait for JavaScript variable to be set
await page.waitForFunction(
  () => window.__NEXT_DATA__?.props?.pageProps?.items?.length > 0
);

// Wait for navigation after a click
await Promise.all([
  page.waitForNavigation({ waitUntil: 'networkidle0' }),
  page.click('button.load-more'),
]);

// Scroll to bottom to trigger infinite scroll
await page.evaluate(async () => {
  await new Promise(resolve => {
    let total = 0;
    const step = 400;
    const timer = setInterval(() => {
      window.scrollBy(0, step);
      total += step;
      if (total >= document.body.scrollHeight) {
        clearInterval(timer);
        resolve();
      }
    }, 200);
  });
});

async function loginAndScrape(url, credentials) {
  const browser = await puppeteer.launch({ headless: 'new' });
  const page = await browser.newPage();

  // Navigate to login page
  await page.goto('https://example.com/login');

  // Fill in credentials
  await page.type('#email', credentials.email, { delay: 50 });
  await page.type('#password', credentials.password, { delay: 50 });

  // Submit and wait for redirect
  await Promise.all([
    page.waitForNavigation({ waitUntil: 'networkidle0' }),
    page.click('button[type="submit"]'),
  ]);

  // Verify we are logged in
  const loggedIn = await page.$('.user-dashboard');
  if (!loggedIn) throw new Error('Login failed');

  // Now navigate to the target URL
  await page.goto(url, { waitUntil: 'networkidle2' });
  const data = await page.$$eval('.protected-item', items =>
    items.map(i => i.textContent.trim())
  );

  await browser.close();
  return data;
}

// Intercept XHR/fetch responses to grab API data directly
const browser = await puppeteer.launch({ headless: 'new' });
const page = await browser.newPage();

// Collect API responses before navigation
const apiData = [];
page.on('response', async response => {
  const url = response.url();
  if (url.includes('/api/v2/listings') && response.status() === 200) {
    try {
      const json = await response.json();
      apiData.push(...(json.results ?? []));
    } catch {}
  }
});

await page.goto('https://example.com/listings', {
  waitUntil: 'networkidle0',
});

console.log('Captured API items:', apiData.length);
await browser.close();

Building a Production-Ready Scraper

A one-off script is easy, but production scrapers need reliability, retry logic, concurrency control, and proper data export. Here's how to build a robust scraper that handles failures gracefully.

HTTP Client with Retry Logic

utils/http.js

const axios = require('axios');

const client = axios.create({
  timeout: 15_000,
  headers: {
    'User-Agent':
      'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Accept-Language': 'en-US,en;q=0.9',
    Accept: 'text/html,application/xhtml+xml',
  },
});

// Automatic retry on 429 / 5xx with exponential back-off
client.interceptors.response.use(null, async error => {
  const config = error.config;
  config._retryCount = config._retryCount ?? 0;

  const retryableStatus = [429, 500, 502, 503, 504];
  const shouldRetry =
    config._retryCount < 3 &&
    (!error.response || retryableStatus.includes(error.response.status));

  if (!shouldRetry) return Promise.reject(error);

  config._retryCount += 1;
  const delay = 1000 * 2 ** config._retryCount; // 2s, 4s, 8s
  console.log(`Retry ${config._retryCount} for ${config.url} in ${delay}ms`);
  await new Promise(r => setTimeout(r, delay));
  return client(config);
});

module.exports = client;

Concurrent Scraping with Rate Limiting

Concurrent Batch Scraper

const pLimit = require('p-limit');
const client = require('./utils/http');
const cheerio = require('cheerio');

// Max 3 concurrent requests at once
const limit = pLimit(3);

async function scrapeProduct(url) {
  const { data: html } = await client.get(url);
  const $ = cheerio.load(html);
  return {
    url,
    title: $('h1.product-title').text().trim(),
    price: $('[data-price]').data('price'),
    description: $('div.product-description').text().trim().slice(0, 500),
    images: $('img.product-image')
      .map((_, el) => $(el).attr('src'))
      .get(),
  };
}

async function scrapeAll(urls) {
  const tasks = urls.map(url =>
    limit(async () => {
      try {
        return await scrapeProduct(url);
      } catch (err) {
        console.error(`Failed: ${url} — ${err.message}`);
        return null;
      }
    })
  );

  const results = await Promise.all(tasks);
  return results.filter(Boolean); // remove nulls from failures
}

module.exports = { scrapeAll };

Exporting Data to CSV and JSON

utils/export.js

const { createObjectCsvWriter } = require('csv-writer');
const fs = require('fs-extra');
const path = require('path');

async function saveJSON(data, filename) {
  await fs.ensureDir('./data');
  const filePath = path.join('./data', filename);
  await fs.writeJSON(filePath, data, { spaces: 2 });
  console.log(`Saved ${data.length} records → ${filePath}`);
}

async function saveCSV(data, filename, headers) {
  await fs.ensureDir('./data');
  const writer = createObjectCsvWriter({
    path: path.join('./data', filename),
    header: headers, // [{ id: 'title', title: 'Title' }, ...]
  });
  await writer.writeRecords(data);
  console.log(`CSV written → ./data/${filename}`);
}

module.exports = { saveJSON, saveCSV };

Anti-Detection and Ethical Practices

Modern websites increasingly deploy bot-detection measures. Understanding them helps you build scrapers that are less likely to be blocked, and reminds you why ethical scraping matters.

Common Anti-Detection Techniques

Rotate User-Agents: Use a pool of real browser User-Agent strings and pick one randomly per request.
Respect robots.txt: Parse /robots.txt and honour Disallow paths. Use the robots-parser npm package.
Use Puppeteer-extra stealth plugin: Patches dozens of Puppeteer fingerprinting vectors (navigator.webdriver, Chrome-specific globals, etc.).
Add realistic delays: Randomise delays between 800ms and 3000ms to mimic human browsing.
Handle CAPTCHAs gracefully: Detect captcha in the response URL or body, pause, and log for manual review — or integrate a CAPTCHA-solving service for automation.
Use residential proxies for scale: Rotate IPs through a proxy pool to avoid IP-level rate limits.

Puppeteer Stealth Setup

// npm install puppeteer-extra puppeteer-extra-plugin-stealth
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');

puppeteer.use(StealthPlugin());

const browser = await puppeteer.launch({ headless: 'new' });
const page = await browser.newPage();

// Randomise viewport to avoid static fingerprint
await page.setViewport({
  width: 1280 + Math.floor(Math.random() * 100),
  height: 800 + Math.floor(Math.random() * 100),
});

// Pass the navigator.webdriver check
await page.evaluateOnNewDocument(() => {
  Object.defineProperty(navigator, 'webdriver', { get: () => undefined });
});

await page.goto('https://example.com', { waitUntil: 'networkidle2' });

Legal and Ethical Considerations

Always read a website's Terms of Service. Scraping personal data may violate GDPR or CCPA. Never scrape at a rate that degrades site performance. Store only the data you need, and delete it when no longer required. For commercial use, consult a lawyer.

Real-World Example: Job Listing Aggregator

Let's put it all together with a practical example — scraping job listings from a static job board and saving the results.

index.js — Full Job Scraper

const client = require('./utils/http');
const cheerio = require('cheerio');
const { saveJSON, saveCSV } = require('./utils/export');
const pLimit = require('p-limit');

const BASE = 'https://jobs.example.com';
const limit = pLimit(2);

async function getListingUrls(page = 1) {
  const { data: html } = await client.get(`${BASE}/jobs?page=${page}`);
  const $ = cheerio.load(html);
  return $('a.job-link')
    .map((_, el) => BASE + $(el).attr('href'))
    .get();
}

async function scrapeJob(url) {
  await new Promise(r => setTimeout(r, 800 + Math.random() * 1200));
  const { data: html } = await client.get(url);
  const $ = cheerio.load(html);
  return {
    title: $('h1.job-title').text().trim(),
    company: $('span.company').text().trim(),
    location: $('span.location').text().trim(),
    salary: $('span.salary').text().trim() || 'Not disclosed',
    posted: $('time').attr('datetime'),
    description: $('div.job-description').text().trim().slice(0, 1000),
    url,
  };
}

async function run() {
  console.log('Starting job scraper...');
  const urls = [];

  for (let p = 1; p <= 5; p++) {
    const pageUrls = await getListingUrls(p);
    urls.push(...pageUrls);
    await new Promise(r => setTimeout(r, 1500));
  }

  console.log(`Found ${urls.length} job listings`);

  const jobs = await Promise.all(
    urls.map(url => limit(() => scrapeJob(url).catch(() => null)))
  );
  const valid = jobs.filter(Boolean);

  await saveJSON(valid, 'jobs.json');
  await saveCSV(valid, 'jobs.csv', [
    { id: 'title', title: 'Title' },
    { id: 'company', title: 'Company' },
    { id: 'location', title: 'Location' },
    { id: 'salary', title: 'Salary' },
    { id: 'posted', title: 'Posted' },
    { id: 'url', title: 'URL' },
  ]);

  console.log(`Done! Scraped ${valid.length} jobs.`);
}

run().catch(console.error);

Debugging and Troubleshooting

Common Scraping Problems and Fixes

Problem	Likely Cause	Fix
Empty Cheerio results	Content rendered by JavaScript	Switch to Puppeteer
403 Forbidden	Missing User-Agent or blocked IP	Set realistic headers, rotate proxies
CAPTCHA page returned	Too many requests / bot detection	Slow down, use stealth plugin
Timeout errors	Slow server or `waitUntil` never fires	Increase timeout, use `domcontentloaded`
Selectors stop working	Site redesigned its DOM	Add monitoring, use resilient selectors (`data-*` attrs)
Memory leak in Puppeteer	Not closing pages/browsers	Always call `browser.close()` in `finally`

Debugging Puppeteer Visually

Debug Mode

// Run in headed mode to watch what's happening
const browser = await puppeteer.launch({
  headless: false,        // show the browser window
  slowMo: 250,            // slow down each action by 250ms
  devtools: true,         // open DevTools automatically
});

// Take a screenshot to debug the current page state
await page.screenshot({ path: 'debug.png', fullPage: true });

// Dump the full rendered HTML
const html = await page.content();
require('fs').writeFileSync('debug.html', html);

Conclusion

Web scraping with Node.js is a powerful skill that unlocks access to data across the web. Here's a summary of what we've covered:

                 Key Takeaways
                Choose the right tool: Cheerio for static HTML (fast), Puppeteer for JS-rendered pages (flexible)
Always be polite: Respect robots.txt, add delays, and limit concurrency
Build for resilience: Retry logic, error handling, and null checks are non-negotiable
Use stealth for dynamic scraping: puppeteer-extra-plugin-stealth bypasses most bot detectors
Network interception is a superpower: Often faster and more reliable than DOM scraping
Always close your browser: Puppeteer leaks memory if you forget browser.close()

            

"Good scraping is invisible scraping — polite, reliable, and targeted. Treat other people's servers the way you'd want yours treated."

With these patterns in your toolkit you can build scrapers that survive site redesigns, handle authentication, process thousands of pages per hour, and export clean structured data — all in Node.js. Start with Cheerio for simple projects, graduate to Puppeteer when JavaScript rendering is required, and always keep the ethical guidelines front of mind.

Node.js Scraping Automation Puppeteer Cheerio JavaScript

Mayur Dabhi

Full Stack Developer with 5+ years of experience building scalable web applications with Laravel, React, and Node.js.