Backend Development

Web Scraping with Node.js

Mayur Dabhi
Mayur Dabhi
April 27, 2026
14 min read

Web scraping is one of the most practical skills in a developer's toolkit — whether you're collecting pricing data, monitoring competitors, aggregating content, or building datasets for machine learning. Node.js is an exceptional platform for scraping because of its non-blocking I/O model, the maturity of its HTTP and DOM libraries, and the sheer speed you can achieve with asynchronous requests. In this guide we'll go from zero to production-ready scraper using the two most popular libraries: Cheerio for static sites and Puppeteer for JavaScript-heavy pages.

When to Scrape vs When to Use an API

Always check for an official API before scraping. Many sites offer free tiers (Twitter, GitHub, OpenWeatherMap). Scraping is appropriate when no API exists, the API is cost-prohibitive, or you need data the API doesn't expose. Always review a site's robots.txt and Terms of Service first.

Choosing the Right Tool

Node.js scraping falls into two broad categories depending on how the target page renders its content:

Library Approach Best For Speed
Cheerio Parse raw HTML (server-rendered) Static pages, blogs, news sites Very fast (no browser)
Puppeteer Headless Chrome (full browser) SPAs, React/Vue apps, login-gated pages Slower (real browser)
Playwright Multi-browser headless automation Cross-browser, complex auth flows Moderate
axios + Cheerio HTTP + DOM parsing Simple HTML extraction at scale Fastest
Node.js Script axios.get() HTTP GET Web Server Static HTML HTML string Cheerio $('selector') Data JSON / CSV — Static Scraping Pipeline (Cheerio) — Node.js Script puppeteer Headless Chrome Renders JS page.$$eval() DOM queries Data JSON / CSV — Dynamic Scraping Pipeline (Puppeteer) —

Static vs Dynamic scraping pipelines in Node.js

Setting Up Your Project

1

Initialise the project

Create a new Node.js project and install the core dependencies you'll need for both static and dynamic scraping.

Terminal
mkdir node-scraper && cd node-scraper
npm init -y

# Static scraping
npm install axios cheerio

# Dynamic scraping (downloads Chromium automatically)
npm install puppeteer

# Utilities
npm install fs-extra csv-writer p-limit
2

Configure project structure

Organise your scraper code into logical modules for reusability and maintainability.

Project Structure
node-scraper/
├── scrapers/
│   ├── cheerio-scraper.js   # Static HTML scraping
│   └── puppeteer-scraper.js # Dynamic JS-rendered scraping
├── utils/
│   ├── http.js              # Axios instance with retry logic
│   └── export.js            # CSV / JSON export helpers
├── data/                    # Output directory
└── index.js                 # Entry point

Static Scraping with Cheerio

Cheerio implements jQuery's API on top of a fast HTML parser. It's the best tool for scraping server-rendered pages — news articles, documentation, e-commerce product listings, and anything returned as complete HTML from the server.

Basic Cheerio Scraper

scrapers/cheerio-scraper.js
const axios = require('axios');
const cheerio = require('cheerio');

async function scrapeArticles(url) {
  const { data: html } = await axios.get(url, {
    headers: {
      'User-Agent':
        'Mozilla/5.0 (compatible; MyCrawler/1.0; +https://example.com)',
    },
    timeout: 10_000,
  });

  const $ = cheerio.load(html);
  const articles = [];

  // Iterate each article card in the listing
  $('article.post-card').each((_, el) => {
    articles.push({
      title: $(el).find('h2.post-title').text().trim(),
      url: $(el).find('a.post-link').attr('href'),
      date: $(el).find('time').attr('datetime'),
      summary: $(el).find('p.excerpt').text().trim(),
      tags: $(el)
        .find('.tag')
        .map((_, t) => $(t).text().trim())
        .get(),
    });
  });

  return articles;
}

module.exports = { scrapeArticles };

Handling Pagination

Pagination Loop
async function scrapeAllPages(baseUrl, maxPages = 10) {
  const allItems = [];
  let page = 1;

  while (page <= maxPages) {
    const url = `${baseUrl}?page=${page}`;
    console.log(`Scraping page ${page}: ${url}`);

    const { data: html } = await axios.get(url, { timeout: 10_000 });
    const $ = cheerio.load(html);

    // Collect items on this page
    $('li.result-item').each((_, el) => {
      allItems.push({
        name: $(el).find('.name').text().trim(),
        price: parseFloat($(el).find('.price').text().replace(/[^0-9.]/g, '')),
        sku: $(el).data('sku'),
      });
    });

    // Stop if there is no "next" link
    const hasNext = $('a[rel="next"]').length > 0;
    if (!hasNext) break;

    page++;
    // Polite delay — avoid hammering the server
    await new Promise(r => setTimeout(r, 1500));
  }

  return allItems;
}
Rate Limiting is Not Optional

Sending requests without delays can overload servers, get your IP blocked, and is considered abusive. Always add at least a 1–2 second delay between requests. For production scrapers, use a queue with concurrency limits.

Dynamic Scraping with Puppeteer

Puppeteer controls a real Chromium browser, which means it can execute JavaScript, handle login forms, click buttons, wait for network requests to complete, and scrape content that only appears after client-side rendering. It's the tool you reach for when Cheerio returns an empty DOM.

Core Puppeteer Patterns

const puppeteer = require('puppeteer');

async function scrapeWithPuppeteer(url) {
  const browser = await puppeteer.launch({
    headless: 'new',          // use new headless mode
    args: ['--no-sandbox'],
  });

  const page = await browser.newPage();

  // Set a realistic viewport and User-Agent
  await page.setViewport({ width: 1280, height: 800 });
  await page.setUserAgent(
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
  );

  await page.goto(url, { waitUntil: 'networkidle2' });

  // Extract data in browser context
  const products = await page.$$eval('.product-card', cards =>
    cards.map(card => ({
      name: card.querySelector('.name')?.textContent.trim(),
      price: card.querySelector('.price')?.textContent.trim(),
      rating: card.querySelector('.stars')?.dataset.rating,
    }))
  );

  await browser.close();
  return products;
}
// Wait for a selector to appear
await page.waitForSelector('.results-container', { timeout: 15_000 });

// Wait for a network request to finish
await page.waitForResponse(res =>
  res.url().includes('/api/products') && res.status() === 200
);

// Wait for JavaScript variable to be set
await page.waitForFunction(
  () => window.__NEXT_DATA__?.props?.pageProps?.items?.length > 0
);

// Wait for navigation after a click
await Promise.all([
  page.waitForNavigation({ waitUntil: 'networkidle0' }),
  page.click('button.load-more'),
]);

// Scroll to bottom to trigger infinite scroll
await page.evaluate(async () => {
  await new Promise(resolve => {
    let total = 0;
    const step = 400;
    const timer = setInterval(() => {
      window.scrollBy(0, step);
      total += step;
      if (total >= document.body.scrollHeight) {
        clearInterval(timer);
        resolve();
      }
    }, 200);
  });
});
async function loginAndScrape(url, credentials) {
  const browser = await puppeteer.launch({ headless: 'new' });
  const page = await browser.newPage();

  // Navigate to login page
  await page.goto('https://example.com/login');

  // Fill in credentials
  await page.type('#email', credentials.email, { delay: 50 });
  await page.type('#password', credentials.password, { delay: 50 });

  // Submit and wait for redirect
  await Promise.all([
    page.waitForNavigation({ waitUntil: 'networkidle0' }),
    page.click('button[type="submit"]'),
  ]);

  // Verify we are logged in
  const loggedIn = await page.$('.user-dashboard');
  if (!loggedIn) throw new Error('Login failed');

  // Now navigate to the target URL
  await page.goto(url, { waitUntil: 'networkidle2' });
  const data = await page.$$eval('.protected-item', items =>
    items.map(i => i.textContent.trim())
  );

  await browser.close();
  return data;
}
// Intercept XHR/fetch responses to grab API data directly
const browser = await puppeteer.launch({ headless: 'new' });
const page = await browser.newPage();

// Collect API responses before navigation
const apiData = [];
page.on('response', async response => {
  const url = response.url();
  if (url.includes('/api/v2/listings') && response.status() === 200) {
    try {
      const json = await response.json();
      apiData.push(...(json.results ?? []));
    } catch {}
  }
});

await page.goto('https://example.com/listings', {
  waitUntil: 'networkidle0',
});

console.log('Captured API items:', apiData.length);
await browser.close();

Building a Production-Ready Scraper

A one-off script is easy, but production scrapers need reliability, retry logic, concurrency control, and proper data export. Here's how to build a robust scraper that handles failures gracefully.

HTTP Client with Retry Logic

utils/http.js
const axios = require('axios');

const client = axios.create({
  timeout: 15_000,
  headers: {
    'User-Agent':
      'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Accept-Language': 'en-US,en;q=0.9',
    Accept: 'text/html,application/xhtml+xml',
  },
});

// Automatic retry on 429 / 5xx with exponential back-off
client.interceptors.response.use(null, async error => {
  const config = error.config;
  config._retryCount = config._retryCount ?? 0;

  const retryableStatus = [429, 500, 502, 503, 504];
  const shouldRetry =
    config._retryCount < 3 &&
    (!error.response || retryableStatus.includes(error.response.status));

  if (!shouldRetry) return Promise.reject(error);

  config._retryCount += 1;
  const delay = 1000 * 2 ** config._retryCount; // 2s, 4s, 8s
  console.log(`Retry ${config._retryCount} for ${config.url} in ${delay}ms`);
  await new Promise(r => setTimeout(r, delay));
  return client(config);
});

module.exports = client;

Concurrent Scraping with Rate Limiting

Concurrent Batch Scraper
const pLimit = require('p-limit');
const client = require('./utils/http');
const cheerio = require('cheerio');

// Max 3 concurrent requests at once
const limit = pLimit(3);

async function scrapeProduct(url) {
  const { data: html } = await client.get(url);
  const $ = cheerio.load(html);
  return {
    url,
    title: $('h1.product-title').text().trim(),
    price: $('[data-price]').data('price'),
    description: $('div.product-description').text().trim().slice(0, 500),
    images: $('img.product-image')
      .map((_, el) => $(el).attr('src'))
      .get(),
  };
}

async function scrapeAll(urls) {
  const tasks = urls.map(url =>
    limit(async () => {
      try {
        return await scrapeProduct(url);
      } catch (err) {
        console.error(`Failed: ${url} — ${err.message}`);
        return null;
      }
    })
  );

  const results = await Promise.all(tasks);
  return results.filter(Boolean); // remove nulls from failures
}

module.exports = { scrapeAll };

Exporting Data to CSV and JSON

utils/export.js
const { createObjectCsvWriter } = require('csv-writer');
const fs = require('fs-extra');
const path = require('path');

async function saveJSON(data, filename) {
  await fs.ensureDir('./data');
  const filePath = path.join('./data', filename);
  await fs.writeJSON(filePath, data, { spaces: 2 });
  console.log(`Saved ${data.length} records → ${filePath}`);
}

async function saveCSV(data, filename, headers) {
  await fs.ensureDir('./data');
  const writer = createObjectCsvWriter({
    path: path.join('./data', filename),
    header: headers, // [{ id: 'title', title: 'Title' }, ...]
  });
  await writer.writeRecords(data);
  console.log(`CSV written → ./data/${filename}`);
}

module.exports = { saveJSON, saveCSV };

Anti-Detection and Ethical Practices

Modern websites increasingly deploy bot-detection measures. Understanding them helps you build scrapers that are less likely to be blocked, and reminds you why ethical scraping matters.

Common Anti-Detection Techniques

Puppeteer Stealth Setup
// npm install puppeteer-extra puppeteer-extra-plugin-stealth
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');

puppeteer.use(StealthPlugin());

const browser = await puppeteer.launch({ headless: 'new' });
const page = await browser.newPage();

// Randomise viewport to avoid static fingerprint
await page.setViewport({
  width: 1280 + Math.floor(Math.random() * 100),
  height: 800 + Math.floor(Math.random() * 100),
});

// Pass the navigator.webdriver check
await page.evaluateOnNewDocument(() => {
  Object.defineProperty(navigator, 'webdriver', { get: () => undefined });
});

await page.goto('https://example.com', { waitUntil: 'networkidle2' });
Legal and Ethical Considerations

Always read a website's Terms of Service. Scraping personal data may violate GDPR or CCPA. Never scrape at a rate that degrades site performance. Store only the data you need, and delete it when no longer required. For commercial use, consult a lawyer.

Real-World Example: Job Listing Aggregator

Let's put it all together with a practical example — scraping job listings from a static job board and saving the results.

index.js — Full Job Scraper
const client = require('./utils/http');
const cheerio = require('cheerio');
const { saveJSON, saveCSV } = require('./utils/export');
const pLimit = require('p-limit');

const BASE = 'https://jobs.example.com';
const limit = pLimit(2);

async function getListingUrls(page = 1) {
  const { data: html } = await client.get(`${BASE}/jobs?page=${page}`);
  const $ = cheerio.load(html);
  return $('a.job-link')
    .map((_, el) => BASE + $(el).attr('href'))
    .get();
}

async function scrapeJob(url) {
  await new Promise(r => setTimeout(r, 800 + Math.random() * 1200));
  const { data: html } = await client.get(url);
  const $ = cheerio.load(html);
  return {
    title: $('h1.job-title').text().trim(),
    company: $('span.company').text().trim(),
    location: $('span.location').text().trim(),
    salary: $('span.salary').text().trim() || 'Not disclosed',
    posted: $('time').attr('datetime'),
    description: $('div.job-description').text().trim().slice(0, 1000),
    url,
  };
}

async function run() {
  console.log('Starting job scraper...');
  const urls = [];

  for (let p = 1; p <= 5; p++) {
    const pageUrls = await getListingUrls(p);
    urls.push(...pageUrls);
    await new Promise(r => setTimeout(r, 1500));
  }

  console.log(`Found ${urls.length} job listings`);

  const jobs = await Promise.all(
    urls.map(url => limit(() => scrapeJob(url).catch(() => null)))
  );
  const valid = jobs.filter(Boolean);

  await saveJSON(valid, 'jobs.json');
  await saveCSV(valid, 'jobs.csv', [
    { id: 'title', title: 'Title' },
    { id: 'company', title: 'Company' },
    { id: 'location', title: 'Location' },
    { id: 'salary', title: 'Salary' },
    { id: 'posted', title: 'Posted' },
    { id: 'url', title: 'URL' },
  ]);

  console.log(`Done! Scraped ${valid.length} jobs.`);
}

run().catch(console.error);

Debugging and Troubleshooting

Common Scraping Problems and Fixes

Problem Likely Cause Fix
Empty Cheerio results Content rendered by JavaScript Switch to Puppeteer
403 Forbidden Missing User-Agent or blocked IP Set realistic headers, rotate proxies
CAPTCHA page returned Too many requests / bot detection Slow down, use stealth plugin
Timeout errors Slow server or waitUntil never fires Increase timeout, use domcontentloaded
Selectors stop working Site redesigned its DOM Add monitoring, use resilient selectors (data-* attrs)
Memory leak in Puppeteer Not closing pages/browsers Always call browser.close() in finally

Debugging Puppeteer Visually

Debug Mode
// Run in headed mode to watch what's happening
const browser = await puppeteer.launch({
  headless: false,        // show the browser window
  slowMo: 250,            // slow down each action by 250ms
  devtools: true,         // open DevTools automatically
});

// Take a screenshot to debug the current page state
await page.screenshot({ path: 'debug.png', fullPage: true });

// Dump the full rendered HTML
const html = await page.content();
require('fs').writeFileSync('debug.html', html);

Conclusion

Web scraping with Node.js is a powerful skill that unlocks access to data across the web. Here's a summary of what we've covered:

Key Takeaways

  • Choose the right tool: Cheerio for static HTML (fast), Puppeteer for JS-rendered pages (flexible)
  • Always be polite: Respect robots.txt, add delays, and limit concurrency
  • Build for resilience: Retry logic, error handling, and null checks are non-negotiable
  • Use stealth for dynamic scraping: puppeteer-extra-plugin-stealth bypasses most bot detectors
  • Network interception is a superpower: Often faster and more reliable than DOM scraping
  • Always close your browser: Puppeteer leaks memory if you forget browser.close()
"Good scraping is invisible scraping — polite, reliable, and targeted. Treat other people's servers the way you'd want yours treated."

With these patterns in your toolkit you can build scrapers that survive site redesigns, handle authentication, process thousands of pages per hour, and export clean structured data — all in Node.js. Start with Cheerio for simple projects, graduate to Puppeteer when JavaScript rendering is required, and always keep the ethical guidelines front of mind.

Node.js Scraping Automation Puppeteer Cheerio JavaScript
Mayur Dabhi

Mayur Dabhi

Full Stack Developer with 5+ years of experience building scalable web applications with Laravel, React, and Node.js.