Web Scraping with Jsoup in a Spring Boot Project

Practical guide to configure and use Jsoup in a Spring Boot project to fetch and manipulate HTML data.

Step 1: Create a Spring Boot Project

Set Up the Project
Use Spring Initializr to create a new Spring Boot project.
- Language: Java + Maven
- Dependencies: Spring Web
Add Jsoup to the Project

Add the following dependency in the pom.xml file:

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
</dependency>

Check version :

Maven Repository: org.jsoup » jsoup

Step 2: Structure the Project

Create a Service Class for Scraping

Create a service that uses Jsoup for web scraping.

package com.example.scraper.service;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.springframework.stereotype.Service;

import java.io.IOException;

@Service
public class ScrapingService {

    public String scrapeData(String url) {
        try {
            // Load the HTML page
            Document document = Jsoup.connect(url).get();

            // Extract specific elements (e.g., titles)
            Elements titles = document.select("h1, h2, h3");
            StringBuilder result = new StringBuilder("Found titles:\n");

            for (Element title : titles) {
                result.append(title.text()).append("\n");
            }

            return result.toString();
        } catch (IOException e) {
            return "Scraping error: " + e.getMessage();
        }
    }
}

Create a Controller to Expose the API

Add a controller to make the service accessible via a REST endpoint:

package com.example.scraper.controller;

import com.example.scraper.service.ScrapingService;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RequestParam;
import org.springframework.web.bind.annotation.RestController;

@RestController
@RequiredArgsConstructor
public class ScrapingController {

    private final ScrapingService scrapingService;

    @GetMapping("/scrape")
    public String scrapeWebsite(@RequestParam String url) {
        return scrapingService.scrapeData(url);
    }
}

Step 3: Test the Application

Run the Application

Start the application using the following command:

mvn spring-boot:run

Test the API with a Browser or Postman

Access the following URL:

http://localhost:8080/scrape?url=https://example.com

Step 4: Best Practices for Scraping

Limit Request Frequency: Avoid sending too many requests in a short period to prevent being blocked by the target site.
Respect the Website's Rules: Check the robots.txt file to ensure scraping is allowed.
Handle Exceptions: Anticipate network errors or changes in the website’s HTML structure.

Step 5: Additional Optimizations

Set a Custom User-Agent: Some websites block default Jsoup requests.

Document document = Jsoup.connect(url).userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36").get();

Use Proxy Configuration if scraping from a network-restricted environment.

Example Repository for Reference

To complement this tutorial, I've created a GitHub repository showcasing an example implementation of the basic scraping API described above. It includes:

Setup instructions for running the project.
Authentication via token for secure API access.
Fully working scraping examples.

You can access the repository here: Basic Scraping API Example.

public List<ResponseDTO> searchBooks(String token, String search, int page, String language) {
    if (!this.token.isEmpty() && !this.token.equals(token)) {
        throw new RuntimeException("Bad login");
    }

    List<String> urls = new ArrayList<>();
    try {
        String urlWithParam = url + "/s/" + search + "?page="+page;
        if(language != null && !language.isEmpty()) {
            urlWithParam = urlWithParam + "&languages%5B%5D="+language;
        }
        Document document = Jsoup.connect(urlWithParam)
                .userAgent("Mozilla")
                .get();
        Element element = document.getElementById("searchResultBox");
        Elements elements = element.getElementsByTag("bookcard");

        for (Element ads: elements) {
            if (!ads.getElementsByTag("bookcard").isEmpty()) {
                urls.add(ads.getElementsByTag("bookcard").attr("href"));
            }
        }
    } catch (IOException ex) {
        ex.printStackTrace();
    }

    return getBooksDetail(urls);
}

Feel free to explore the repository and adapt the code for your use case! 🚀