Web Scraping with Node.js
Web scraping is one of the most practical skills in a developer's toolkit — whether you're collecting pricing data, monitoring competitors, aggregating content, or building datasets for machine learning. Node.js is an exceptional platform for scraping because of its non-blocking I/O model, the maturity of its HTTP and DOM libraries, and the sheer speed you can achieve with asynchronous requests. In this guide we'll go from zero to production-ready scraper using the two most popular libraries: Cheerio for static sites and Puppeteer for JavaScript-heavy pages.
Always check for an official API before scraping. Many sites offer free tiers (Twitter, GitHub, OpenWeatherMap). Scraping is appropriate when no API exists, the API is cost-prohibitive, or you need data the API doesn't expose. Always review a site's robots.txt and Terms of Service first.
Choosing the Right Tool
Node.js scraping falls into two broad categories depending on how the target page renders its content:
| Library | Approach | Best For | Speed |
|---|---|---|---|
| Cheerio | Parse raw HTML (server-rendered) | Static pages, blogs, news sites | Very fast (no browser) |
| Puppeteer | Headless Chrome (full browser) | SPAs, React/Vue apps, login-gated pages | Slower (real browser) |
| Playwright | Multi-browser headless automation | Cross-browser, complex auth flows | Moderate |
| axios + Cheerio | HTTP + DOM parsing | Simple HTML extraction at scale | Fastest |
Static vs Dynamic scraping pipelines in Node.js
Setting Up Your Project
Initialise the project
Create a new Node.js project and install the core dependencies you'll need for both static and dynamic scraping.
mkdir node-scraper && cd node-scraper
npm init -y
# Static scraping
npm install axios cheerio
# Dynamic scraping (downloads Chromium automatically)
npm install puppeteer
# Utilities
npm install fs-extra csv-writer p-limit
Configure project structure
Organise your scraper code into logical modules for reusability and maintainability.
node-scraper/
├── scrapers/
│ ├── cheerio-scraper.js # Static HTML scraping
│ └── puppeteer-scraper.js # Dynamic JS-rendered scraping
├── utils/
│ ├── http.js # Axios instance with retry logic
│ └── export.js # CSV / JSON export helpers
├── data/ # Output directory
└── index.js # Entry point
Static Scraping with Cheerio
Cheerio implements jQuery's API on top of a fast HTML parser. It's the best tool for scraping server-rendered pages — news articles, documentation, e-commerce product listings, and anything returned as complete HTML from the server.
Basic Cheerio Scraper
const axios = require('axios');
const cheerio = require('cheerio');
async function scrapeArticles(url) {
const { data: html } = await axios.get(url, {
headers: {
'User-Agent':
'Mozilla/5.0 (compatible; MyCrawler/1.0; +https://example.com)',
},
timeout: 10_000,
});
const $ = cheerio.load(html);
const articles = [];
// Iterate each article card in the listing
$('article.post-card').each((_, el) => {
articles.push({
title: $(el).find('h2.post-title').text().trim(),
url: $(el).find('a.post-link').attr('href'),
date: $(el).find('time').attr('datetime'),
summary: $(el).find('p.excerpt').text().trim(),
tags: $(el)
.find('.tag')
.map((_, t) => $(t).text().trim())
.get(),
});
});
return articles;
}
module.exports = { scrapeArticles };
Handling Pagination
async function scrapeAllPages(baseUrl, maxPages = 10) {
const allItems = [];
let page = 1;
while (page <= maxPages) {
const url = `${baseUrl}?page=${page}`;
console.log(`Scraping page ${page}: ${url}`);
const { data: html } = await axios.get(url, { timeout: 10_000 });
const $ = cheerio.load(html);
// Collect items on this page
$('li.result-item').each((_, el) => {
allItems.push({
name: $(el).find('.name').text().trim(),
price: parseFloat($(el).find('.price').text().replace(/[^0-9.]/g, '')),
sku: $(el).data('sku'),
});
});
// Stop if there is no "next" link
const hasNext = $('a[rel="next"]').length > 0;
if (!hasNext) break;
page++;
// Polite delay — avoid hammering the server
await new Promise(r => setTimeout(r, 1500));
}
return allItems;
}
Sending requests without delays can overload servers, get your IP blocked, and is considered abusive. Always add at least a 1–2 second delay between requests. For production scrapers, use a queue with concurrency limits.
Dynamic Scraping with Puppeteer
Puppeteer controls a real Chromium browser, which means it can execute JavaScript, handle login forms, click buttons, wait for network requests to complete, and scrape content that only appears after client-side rendering. It's the tool you reach for when Cheerio returns an empty DOM.
Core Puppeteer Patterns
const puppeteer = require('puppeteer');
async function scrapeWithPuppeteer(url) {
const browser = await puppeteer.launch({
headless: 'new', // use new headless mode
args: ['--no-sandbox'],
});
const page = await browser.newPage();
// Set a realistic viewport and User-Agent
await page.setViewport({ width: 1280, height: 800 });
await page.setUserAgent(
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
);
await page.goto(url, { waitUntil: 'networkidle2' });
// Extract data in browser context
const products = await page.$$eval('.product-card', cards =>
cards.map(card => ({
name: card.querySelector('.name')?.textContent.trim(),
price: card.querySelector('.price')?.textContent.trim(),
rating: card.querySelector('.stars')?.dataset.rating,
}))
);
await browser.close();
return products;
}
// Wait for a selector to appear
await page.waitForSelector('.results-container', { timeout: 15_000 });
// Wait for a network request to finish
await page.waitForResponse(res =>
res.url().includes('/api/products') && res.status() === 200
);
// Wait for JavaScript variable to be set
await page.waitForFunction(
() => window.__NEXT_DATA__?.props?.pageProps?.items?.length > 0
);
// Wait for navigation after a click
await Promise.all([
page.waitForNavigation({ waitUntil: 'networkidle0' }),
page.click('button.load-more'),
]);
// Scroll to bottom to trigger infinite scroll
await page.evaluate(async () => {
await new Promise(resolve => {
let total = 0;
const step = 400;
const timer = setInterval(() => {
window.scrollBy(0, step);
total += step;
if (total >= document.body.scrollHeight) {
clearInterval(timer);
resolve();
}
}, 200);
});
});
async function loginAndScrape(url, credentials) {
const browser = await puppeteer.launch({ headless: 'new' });
const page = await browser.newPage();
// Navigate to login page
await page.goto('https://example.com/login');
// Fill in credentials
await page.type('#email', credentials.email, { delay: 50 });
await page.type('#password', credentials.password, { delay: 50 });
// Submit and wait for redirect
await Promise.all([
page.waitForNavigation({ waitUntil: 'networkidle0' }),
page.click('button[type="submit"]'),
]);
// Verify we are logged in
const loggedIn = await page.$('.user-dashboard');
if (!loggedIn) throw new Error('Login failed');
// Now navigate to the target URL
await page.goto(url, { waitUntil: 'networkidle2' });
const data = await page.$$eval('.protected-item', items =>
items.map(i => i.textContent.trim())
);
await browser.close();
return data;
}
// Intercept XHR/fetch responses to grab API data directly
const browser = await puppeteer.launch({ headless: 'new' });
const page = await browser.newPage();
// Collect API responses before navigation
const apiData = [];
page.on('response', async response => {
const url = response.url();
if (url.includes('/api/v2/listings') && response.status() === 200) {
try {
const json = await response.json();
apiData.push(...(json.results ?? []));
} catch {}
}
});
await page.goto('https://example.com/listings', {
waitUntil: 'networkidle0',
});
console.log('Captured API items:', apiData.length);
await browser.close();
Building a Production-Ready Scraper
A one-off script is easy, but production scrapers need reliability, retry logic, concurrency control, and proper data export. Here's how to build a robust scraper that handles failures gracefully.
HTTP Client with Retry Logic
const axios = require('axios');
const client = axios.create({
timeout: 15_000,
headers: {
'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept-Language': 'en-US,en;q=0.9',
Accept: 'text/html,application/xhtml+xml',
},
});
// Automatic retry on 429 / 5xx with exponential back-off
client.interceptors.response.use(null, async error => {
const config = error.config;
config._retryCount = config._retryCount ?? 0;
const retryableStatus = [429, 500, 502, 503, 504];
const shouldRetry =
config._retryCount < 3 &&
(!error.response || retryableStatus.includes(error.response.status));
if (!shouldRetry) return Promise.reject(error);
config._retryCount += 1;
const delay = 1000 * 2 ** config._retryCount; // 2s, 4s, 8s
console.log(`Retry ${config._retryCount} for ${config.url} in ${delay}ms`);
await new Promise(r => setTimeout(r, delay));
return client(config);
});
module.exports = client;
Concurrent Scraping with Rate Limiting
const pLimit = require('p-limit');
const client = require('./utils/http');
const cheerio = require('cheerio');
// Max 3 concurrent requests at once
const limit = pLimit(3);
async function scrapeProduct(url) {
const { data: html } = await client.get(url);
const $ = cheerio.load(html);
return {
url,
title: $('h1.product-title').text().trim(),
price: $('[data-price]').data('price'),
description: $('div.product-description').text().trim().slice(0, 500),
images: $('img.product-image')
.map((_, el) => $(el).attr('src'))
.get(),
};
}
async function scrapeAll(urls) {
const tasks = urls.map(url =>
limit(async () => {
try {
return await scrapeProduct(url);
} catch (err) {
console.error(`Failed: ${url} — ${err.message}`);
return null;
}
})
);
const results = await Promise.all(tasks);
return results.filter(Boolean); // remove nulls from failures
}
module.exports = { scrapeAll };
Exporting Data to CSV and JSON
const { createObjectCsvWriter } = require('csv-writer');
const fs = require('fs-extra');
const path = require('path');
async function saveJSON(data, filename) {
await fs.ensureDir('./data');
const filePath = path.join('./data', filename);
await fs.writeJSON(filePath, data, { spaces: 2 });
console.log(`Saved ${data.length} records → ${filePath}`);
}
async function saveCSV(data, filename, headers) {
await fs.ensureDir('./data');
const writer = createObjectCsvWriter({
path: path.join('./data', filename),
header: headers, // [{ id: 'title', title: 'Title' }, ...]
});
await writer.writeRecords(data);
console.log(`CSV written → ./data/${filename}`);
}
module.exports = { saveJSON, saveCSV };
Anti-Detection and Ethical Practices
Modern websites increasingly deploy bot-detection measures. Understanding them helps you build scrapers that are less likely to be blocked, and reminds you why ethical scraping matters.
Common Anti-Detection Techniques
- Rotate User-Agents: Use a pool of real browser User-Agent strings and pick one randomly per request.
- Respect
robots.txt: Parse/robots.txtand honourDisallowpaths. Use therobots-parsernpm package. - Use Puppeteer-extra stealth plugin: Patches dozens of Puppeteer fingerprinting vectors (
navigator.webdriver, Chrome-specific globals, etc.). - Add realistic delays: Randomise delays between
800msand3000msto mimic human browsing. - Handle CAPTCHAs gracefully: Detect
captchain the response URL or body, pause, and log for manual review — or integrate a CAPTCHA-solving service for automation. - Use residential proxies for scale: Rotate IPs through a proxy pool to avoid IP-level rate limits.
// npm install puppeteer-extra puppeteer-extra-plugin-stealth
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
puppeteer.use(StealthPlugin());
const browser = await puppeteer.launch({ headless: 'new' });
const page = await browser.newPage();
// Randomise viewport to avoid static fingerprint
await page.setViewport({
width: 1280 + Math.floor(Math.random() * 100),
height: 800 + Math.floor(Math.random() * 100),
});
// Pass the navigator.webdriver check
await page.evaluateOnNewDocument(() => {
Object.defineProperty(navigator, 'webdriver', { get: () => undefined });
});
await page.goto('https://example.com', { waitUntil: 'networkidle2' });
Always read a website's Terms of Service. Scraping personal data may violate GDPR or CCPA. Never scrape at a rate that degrades site performance. Store only the data you need, and delete it when no longer required. For commercial use, consult a lawyer.
Real-World Example: Job Listing Aggregator
Let's put it all together with a practical example — scraping job listings from a static job board and saving the results.
const client = require('./utils/http');
const cheerio = require('cheerio');
const { saveJSON, saveCSV } = require('./utils/export');
const pLimit = require('p-limit');
const BASE = 'https://jobs.example.com';
const limit = pLimit(2);
async function getListingUrls(page = 1) {
const { data: html } = await client.get(`${BASE}/jobs?page=${page}`);
const $ = cheerio.load(html);
return $('a.job-link')
.map((_, el) => BASE + $(el).attr('href'))
.get();
}
async function scrapeJob(url) {
await new Promise(r => setTimeout(r, 800 + Math.random() * 1200));
const { data: html } = await client.get(url);
const $ = cheerio.load(html);
return {
title: $('h1.job-title').text().trim(),
company: $('span.company').text().trim(),
location: $('span.location').text().trim(),
salary: $('span.salary').text().trim() || 'Not disclosed',
posted: $('time').attr('datetime'),
description: $('div.job-description').text().trim().slice(0, 1000),
url,
};
}
async function run() {
console.log('Starting job scraper...');
const urls = [];
for (let p = 1; p <= 5; p++) {
const pageUrls = await getListingUrls(p);
urls.push(...pageUrls);
await new Promise(r => setTimeout(r, 1500));
}
console.log(`Found ${urls.length} job listings`);
const jobs = await Promise.all(
urls.map(url => limit(() => scrapeJob(url).catch(() => null)))
);
const valid = jobs.filter(Boolean);
await saveJSON(valid, 'jobs.json');
await saveCSV(valid, 'jobs.csv', [
{ id: 'title', title: 'Title' },
{ id: 'company', title: 'Company' },
{ id: 'location', title: 'Location' },
{ id: 'salary', title: 'Salary' },
{ id: 'posted', title: 'Posted' },
{ id: 'url', title: 'URL' },
]);
console.log(`Done! Scraped ${valid.length} jobs.`);
}
run().catch(console.error);
Debugging and Troubleshooting
Common Scraping Problems and Fixes
| Problem | Likely Cause | Fix |
|---|---|---|
| Empty Cheerio results | Content rendered by JavaScript | Switch to Puppeteer |
| 403 Forbidden | Missing User-Agent or blocked IP | Set realistic headers, rotate proxies |
| CAPTCHA page returned | Too many requests / bot detection | Slow down, use stealth plugin |
| Timeout errors | Slow server or waitUntil never fires |
Increase timeout, use domcontentloaded |
| Selectors stop working | Site redesigned its DOM | Add monitoring, use resilient selectors (data-* attrs) |
| Memory leak in Puppeteer | Not closing pages/browsers | Always call browser.close() in finally |
Debugging Puppeteer Visually
// Run in headed mode to watch what's happening
const browser = await puppeteer.launch({
headless: false, // show the browser window
slowMo: 250, // slow down each action by 250ms
devtools: true, // open DevTools automatically
});
// Take a screenshot to debug the current page state
await page.screenshot({ path: 'debug.png', fullPage: true });
// Dump the full rendered HTML
const html = await page.content();
require('fs').writeFileSync('debug.html', html);
Conclusion
Web scraping with Node.js is a powerful skill that unlocks access to data across the web. Here's a summary of what we've covered:
Key Takeaways
- Choose the right tool: Cheerio for static HTML (fast), Puppeteer for JS-rendered pages (flexible)
- Always be polite: Respect
robots.txt, add delays, and limit concurrency - Build for resilience: Retry logic, error handling, and null checks are non-negotiable
- Use stealth for dynamic scraping:
puppeteer-extra-plugin-stealthbypasses most bot detectors - Network interception is a superpower: Often faster and more reliable than DOM scraping
- Always close your browser: Puppeteer leaks memory if you forget
browser.close()
"Good scraping is invisible scraping — polite, reliable, and targeted. Treat other people's servers the way you'd want yours treated."
With these patterns in your toolkit you can build scrapers that survive site redesigns, handle authentication, process thousands of pages per hour, and export clean structured data — all in Node.js. Start with Cheerio for simple projects, graduate to Puppeteer when JavaScript rendering is required, and always keep the ethical guidelines front of mind.