News Web Scraper - Next.js - Minhaz Irphan Mohamed

News Web Scraper – A Web Scraping Project with Next.js & Playwright

Overview

This project is a Next.js-based news aggregation platform that scrapes the latest news articles from Newswire.lk and presents them in a clean and modern UI. It features a dark mode/light mode toggle, a 30-minute cooldown period for fetching new articles to prevent abuse, and a seamless user experience powered by TailwindCSS.

The platform utilizes Playwright for web scraping, ensuring robust and efficient data extraction. The scraped data is validated and processed before being displayed to users. Additionally, error handling and retry mechanisms enhance the reliability of the scraping process.

Key Features

  • Web Scraping with Playwright: Extracts the latest 15 news articles from a specific section of Newswire.lk.
  • Next.js for Server-side Rendering (SSR): Ensures fast load times and SEO benefits.
  • Tailwind CSS for Styling: This enables a sleek and responsive UI with a dark mode toggle.
  • Session-based Rate Limiting (30 mins delay): Prevents excessive scraping requests and optimizes API performance.
  • Error Handling & Retry Mechanism: Retries failed scraping attempts up to 3 times for better reliability.
  • Optimized Performance: Uses efficient selectors and browser automation techniques to minimize scraping time.

Technologies Used

  • Next.js (Frontend & API Routes)
  • TailwindCSS (Styling & UI components)
  • Playwright (Web Scraping Automation)
  • TypeScript (Strict type safety)
  • Session Storage (To enforce rate limiting on refresh requests)

How It Works

  1. Scraper Execution: When a user requests the latest news, a Playwright instance is launched in a serverless function.
  2. Data Extraction: The scraper targets the #frontpage-area_c_1 div (which contains the latest news articles), collects the title, image, date published and link from the first 15 articles, and filters out any invalid entries.
  3. Rate Limiting: Users must wait 30 minutes before fetching new articles again to prevent abuse.
  4. Data Presentation: The extracted news is displayed in a visually appealing layout, with dark mode support.
  5. Error Handling & Retries: If scraping fails, the system automatically retries up to 3 times before returning an error.

Challenges & Solutions

1. Slow Scraping Performance (~30 seconds per request)

  • Issue: Playwright’s browser instance launch time was causing delays.
  • Solution: Optimized page interactions by reducing unnecessary waits and using waitUntil: 'load' for faster page readiness detection.

2. Handling Dynamic Content Loading

  • Issue: The target website loads content dynamically, which initially caused issues with missing articles.
  • Solution: Implemented page.waitForSelector with an extended timeout to ensure all elements are fully loaded before extraction.

3. Preventing API Abuse

  • Issue: Users could continuously refresh to trigger new scraping requests.
  • Solution: Implemented session storage-based rate limiting, requiring users to wait 30 minutes before new data can be fetched.

Video Demo & Repository

This project showcases my ability to integrate frontend, backend, and automation seamlessly while optimizing performance, handling real-world scraping challenges, and ensuring an excellent user experience.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Share via
Copy link