Java remains one of the most powerful languages for enterprise web scraping in 2025. This comprehensive guide covers everything from basic HTML parsing with JSoup to advanced JavaScript-heavy scraping with Selenium and HtmlUnit.
Whether you're building a data pipeline for business intelligence, monitoring competitor prices, or collecting research data, this tutorial will give you the skills to scrape any website efficiently and responsibly.
If you're considering other languages or tools, check out our comprehensive comparison of modern browser automation tools including Playwright and Puppeteer.
Java Web Scraping Libraries Compared
Choose the right tool for your scraping needs. Here's how the top Java web scraping libraries compare:
JSoup BEST FOR STATIC
Perfect for: Static HTML content, fast parsing, CSS selectors
- Speed: Very Fast
- Memory: Low usage
- JavaScript: No support
- Learning curve: Easy
- Best use case: News sites, blogs, static e-commerce
HtmlUnit BEST FOR JS
Perfect for: JavaScript-heavy sites, headless browsing, AJAX
- Speed: Fast
- Memory: Medium usage
- JavaScript: Full support
- Learning curve: Moderate
- Best use case: SPAs, dynamic content, forms
Official Resources: Website | GitHub | Getting Started
Prerequisites & Setup
Before diving into web scraping, ensure you have the proper development environment configured:
What You'll Need
Create a New Maven Project
Start by creating a new Maven project with this basic structure:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.example</groupId>
<artifactId>java-web-scraper</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<maven.compiler.source>11</maven.compiler.source>
<maven.compiler.target>11</maven.compiler.target>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<!-- JSoup for HTML parsing -->
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.16.1</version>
</dependency>
<!-- HtmlUnit for JavaScript support -->
<dependency>
<groupId>net.sourceforge.htmlunit</groupId>
<artifactId>htmlunit</artifactId>
<version>3.3.0</version>
</dependency>
<!-- Selenium WebDriver -->
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-java</artifactId>
<version>4.15.0</version>
</dependency>
<!-- Chrome WebDriver Manager -->
<dependency>
<groupId>io.github.bonigarcia</groupId>
<artifactId>webdrivermanager</artifactId>
<version>5.6.2</version>
</dependency>
</dependencies>
</project>
Maven Dependencies & Resources
Find the latest versions and documentation for each library:
- JSoup: Maven Central | Download Options
- HtmlUnit: Maven Central | Dependencies Guide
- Selenium: Maven Central | Official Downloads
- WebDriverManager: Maven Central | GitHub Repository
JSoup: Static Content Scraping
JSoup is the go-to library for scraping static HTML content. It's fast, lightweight, and perfect for most web scraping tasks. Learn more in the official JSoup cookbook.
Basic JSoup Connection and Parsing
Start with a simple example that connects to a website and extracts basic information:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class BasicJSoupScraper {
public static void main(String[] args) {
try {
// Connect to the website with proper headers
Document doc = Jsoup.connect("https://books.toscrape.com/")
.userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")
.timeout(10000)
.get();
System.out.println("Page title: " + doc.title());
// Extract all book titles
Elements bookTitles = doc.select("h3 a");
System.out.println("Found " + bookTitles.size() + " books:");
for (Element title : bookTitles) {
System.out.println("- " + title.attr("title"));
}
} catch (Exception e) {
System.err.println("Error scraping: " + e.getMessage());
e.printStackTrace();
}
}
}
Advanced JSoup: Data Extraction and Cleaning
Extract structured data and handle edge cases with proper error handling:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
public class AdvancedJSoupScraper {
public static class Book {
private String title;
private String price;
private String availability;
private String rating;
// Constructor, getters, setters
public Book(String title, String price, String availability, String rating) {
this.title = title;
this.price = price;
this.availability = availability;
this.rating = rating;
}
@Override
public String toString() {
return String.format("Book{title='%s', price='%s', rating='%s', availability='%s'}",
title, price, rating, availability);
}
}
public static void main(String[] args) {
List<Book> books = scrapeBooks("https://books.toscrape.com/");
// Print first 5 books
books.stream().limit(5).forEach(System.out::println);
System.out.println("\nTotal books scraped: " + books.size());
}
public static List<Book> scrapeBooks(String url) {
List<Book> books = new ArrayList<>();
try {
Document doc = Jsoup.connect(url)
.userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")
.timeout(10000)
.followRedirects(true)
.get();
Elements bookContainers = doc.select("ol.row li article.product_pod");
for (Element container : bookContainers) {
try {
// Extract title with fallback
String title = container.select("h3 a").attr("title");
if (title.isEmpty()) {
title = container.select("h3 a").text();
}
// Extract price and clean it
String price = container.select("p.price_color").text();
price = price.replaceAll("[^\\d.]", ""); // Keep only digits and dots
// Extract availability
String availability = container.select("p.instock.availability").text();
availability = availability.replace("In stock (", "").replace(" available)", "");
// Extract rating from class name
String rating = "0";
Elements ratingElements = container.select("p.star-rating");
if (!ratingElements.isEmpty()) {
String ratingClass = ratingElements.first().className();
rating = extractRatingFromClass(ratingClass);
}
books.add(new Book(title, price, availability, rating));
} catch (Exception e) {
System.err.println("Error processing book: " + e.getMessage());
continue; // Skip this book and continue with next
}
}
} catch (Exception e) {
System.err.println("Error connecting to website: " + e.getMessage());
}
return books;
}
private static String extractRatingFromClass(String className) {
Map<String, String> ratingMap = new HashMap<>();
ratingMap.put("One", "1");
ratingMap.put("Two", "2");
ratingMap.put("Three", "3");
ratingMap.put("Four", "4");
ratingMap.put("Five", "5");
for (String key : ratingMap.keySet()) {
if (className.contains(key)) {
return ratingMap.get(key);
}
}
return "0";
}
}
HtmlUnit: JavaScript-Enabled Scraping
When websites rely heavily on JavaScript, HtmlUnit provides a headless browser that can execute JavaScript and handle dynamic content. Explore the getting started guide and GitHub repository for more examples.
When to Use HtmlUnit
- Content loaded via AJAX calls
- Single Page Applications (SPAs)
- Dynamic forms and interactions
- Sites that modify DOM with JavaScript
- When you need faster performance than Selenium
Basic HtmlUnit Setup
Configure HtmlUnit for JavaScript-heavy websites:
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.gargoylesoftware.htmlunit.html.HtmlElement;
import java.util.List;
public class HtmlUnitScraper {
public static void main(String[] args) {
// Configure WebClient
try (final WebClient webClient = new WebClient()) {
// Configure browser settings
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setUseInsecureSSL(true);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setTimeout(10000);
// Set realistic user agent
webClient.addRequestHeader("User-Agent",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36");
// Load the page and wait for JavaScript
final HtmlPage page = webClient.getPage("https://quotes.toscrape.com/js/");
// Wait for JavaScript to load content (important!)
webClient.waitForBackgroundJavaScript(3000);
System.out.println("Page title: " + page.getTitleText());
// Extract quotes using XPath
List<HtmlElement> quotes = page.getByXPath("//div[@class='quote']");
System.out.println("Found " + quotes.size() + " quotes:");
for (HtmlElement quote : quotes) {
String text = quote.getFirstByXPath(".//span[@class='text']").getTextContent();
String author = quote.getFirstByXPath(".//small[@class='author']").getTextContent();
System.out.println("Quote: " + text);
System.out.println("Author: " + author);
System.out.println("---");
}
} catch (Exception e) {
System.err.println("Error: " + e.getMessage());
e.printStackTrace();
}
}
}
Advanced HtmlUnit: Forms and Interactions
Handle form submissions and interactive elements:
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.*;
public class AdvancedHtmlUnitScraper {
public static void main(String[] args) {
try (final WebClient webClient = new WebClient()) {
// Configure for AJAX-heavy sites
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setUseInsecureSSL(true);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getOptions().setThrowExceptionOnScriptError(false);
// Increase timeout for slow JavaScript
webClient.getOptions().setTimeout(15000);
webClient.setJavaScriptTimeout(10000);
// Load initial page
HtmlPage page = webClient.getPage("https://httpbin.org/forms/post");
// Find and fill form
HtmlForm form = page.getFirstByXPath("//form");
HtmlTextInput customerNameField = form.getInputByName("custname");
HtmlTextInput customerTelField = form.getInputByName("custtel");
HtmlTextInput customerEmailField = form.getInputByName("custemail");
HtmlSelect sizeSelect = form.getSelectByName("size");
HtmlTextArea commentsArea = form.getTextAreaByName("comments");
// Fill form fields
customerNameField.type("John Doe");
customerTelField.type("555-1234");
customerEmailField.type("john@example.com");
sizeSelect.setSelectedAttribute("large", true);
commentsArea.type("This is a test comment");
// Submit form
HtmlSubmitInput submitButton = form.getInputByValue("Submit order");
HtmlPage resultPage = submitButton.click();
// Wait for response
webClient.waitForBackgroundJavaScript(2000);
System.out.println("Form submitted successfully!");
System.out.println("Response: " + resultPage.asNormalizedText());
} catch (Exception e) {
System.err.println("Error: " + e.getMessage());
e.printStackTrace();
}
}
}
Selenium: Full Browser Automation
For the most complex scraping tasks requiring full browser functionality, Selenium WebDriver provides complete control over a real browser. The WebDriver documentation covers all supported languages and browsers.
For a detailed comparison with other browser automation tools, see our Playwright vs Selenium vs Puppeteer guide.
Selenium Setup with Chrome
Configure Selenium WebDriver for robust browser automation:
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.openqa.selenium.By;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.support.ui.WebDriverWait;
import org.openqa.selenium.support.ui.ExpectedConditions;
import io.github.bonigarcia.wdm.WebDriverManager;
import java.time.Duration;
import java.util.List;
public class SeleniumScraper {
public static void main(String[] args) {
// Automatically manage ChromeDriver
WebDriverManager.chromedriver().setup();
// Configure Chrome options
ChromeOptions options = new ChromeOptions();
options.addArguments("--headless"); // Run in background
options.addArguments("--no-sandbox");
options.addArguments("--disable-dev-shm-usage");
options.addArguments("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36");
WebDriver driver = new ChromeDriver(options);
WebDriverWait wait = new WebDriverWait(driver, Duration.ofSeconds(10));
try {
// Navigate to page
driver.get("https://quotes.toscrape.com/");
// Wait for quotes to load
wait.until(ExpectedConditions.presenceOfElementLocated(By.className("quote")));
// Find all quotes
List<WebElement> quotes = driver.findElements(By.className("quote"));
System.out.println("Found " + quotes.size() + " quotes:");
for (WebElement quote : quotes) {
String text = quote.findElement(By.className("text")).getText();
String author = quote.findElement(By.className("author")).getText();
System.out.println("Quote: " + text);
System.out.println("Author: " + author);
System.out.println("---");
}
// Navigate through pagination
WebElement nextButton = driver.findElement(By.linkText("Next"));
if (nextButton.isEnabled()) {
nextButton.click();
// Wait and scrape next page...
}
} catch (Exception e) {
System.err.println("Error: " + e.getMessage());
e.printStackTrace();
} finally {
driver.quit();
}
}
}
Advanced Selenium: Handling Dynamic Content
Deal with infinite scroll, AJAX loading, and complex interactions:
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.openqa.selenium.By;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.JavascriptExecutor;
import org.openqa.selenium.support.ui.WebDriverWait;
import org.openqa.selenium.support.ui.ExpectedConditions;
import org.openqa.selenium.interactions.Actions;
import io.github.bonigarcia.wdm.WebDriverManager;
import java.time.Duration;
import java.util.List;
import java.util.Set;
import java.util.HashSet;
public class AdvancedSeleniumScraper {
public static void main(String[] args) {
WebDriverManager.chromedriver().setup();
ChromeOptions options = new ChromeOptions();
// options.addArguments("--headless"); // Comment out to see browser
options.addArguments("--disable-blink-features=AutomationControlled");
options.setExperimentalOption("excludeSwitches", new String[]{"enable-automation"});
options.setExperimentalOption("useAutomationExtension", false);
WebDriver driver = new ChromeDriver(options);
WebDriverWait wait = new WebDriverWait(driver, Duration.ofSeconds(15));
JavascriptExecutor js = (JavascriptExecutor) driver;
// Remove webdriver property to avoid detection
js.executeScript("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})");
try {
driver.get("https://infinite-scroll.com/");
Set<String> uniqueItems = new HashSet<>();
int lastCount = 0;
int noChangeCount = 0;
// Infinite scroll handling
while (noChangeCount < 3) { // Stop if no new content for 3 attempts
// Scroll to bottom
js.executeScript("window.scrollTo(0, document.body.scrollHeight);");
// Wait for new content to load
Thread.sleep(2000);
// Check for loading indicator
try {
wait.until(ExpectedConditions.invisibilityOfElementLocated(
By.className("loading")));
} catch (Exception e) {
// Loading indicator might not exist
}
// Collect current items
List<WebElement> items = driver.findElements(By.className("post"));
for (WebElement item : items) {
try {
String title = item.findElement(By.tagName("h2")).getText();
if (!title.isEmpty()) {
uniqueItems.add(title);
}
} catch (Exception e) {
// Skip problematic items
continue;
}
}
System.out.println("Current items collected: " + uniqueItems.size());
// Check if we got new items
if (uniqueItems.size() == lastCount) {
noChangeCount++;
} else {
noChangeCount = 0;
lastCount = uniqueItems.size();
}
}
System.out.println("\nFinal results:");
System.out.println("Total unique items: " + uniqueItems.size());
// Print first 10 items
uniqueItems.stream().limit(10).forEach(item ->
System.out.println("- " + item));
} catch (Exception e) {
System.err.println("Error: " + e.getMessage());
e.printStackTrace();
} finally {
driver.quit();
}
}
}
Best Practices & Error Handling
Professional web scraping requires robust error handling, respect for websites, and efficient resource management.
Respect & Ethics
- Check robots.txt before scraping
- Add delays between requests (1-3 seconds)
- Use realistic User-Agent headers
- Don't overload servers with concurrent requests
- Respect rate limits and server responses
Error Handling
- Implement retry logic with exponential backoff
- Handle network timeouts gracefully
- Log errors for debugging
- Validate data before processing
- Use try-catch blocks around scraping operations
Performance
- Reuse connections when possible
- Close resources properly (WebDriver, WebClient)
- Use connection pooling for multiple requests
- Implement caching for repeated data
- Monitor memory usage in long-running scrapers
Production-Ready Scraper Template
Here's a robust template that implements all best practices:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.util.concurrent.TimeUnit;
import java.util.logging.Logger;
import java.util.logging.Level;
public class ProductionScraper {
private static final Logger logger = Logger.getLogger(ProductionScraper.class.getName());
private static final int MAX_RETRIES = 3;
private static final long BASE_DELAY_MS = 1000;
public static void main(String[] args) {
ProductionScraper scraper = new ProductionScraper();
scraper.scrapeWithRetry("https://example.com");
}
public Document scrapeWithRetry(String url) {
for (int attempt = 1; attempt <= MAX_RETRIES; attempt++) {
try {
logger.info("Scraping attempt " + attempt + " for: " + url);
Document doc = Jsoup.connect(url)
.userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")
.timeout(10000)
.followRedirects(true)
.ignoreHttpErrors(false) // Will throw exception on HTTP errors
.get();
logger.info("Successfully scraped: " + url);
// Add polite delay
addDelay(BASE_DELAY_MS);
return doc;
} catch (Exception e) {
logger.log(Level.WARNING,
"Attempt " + attempt + " failed for " + url + ": " + e.getMessage());
if (attempt == MAX_RETRIES) {
logger.log(Level.SEVERE, "All attempts failed for: " + url);
throw new RuntimeException("Failed to scrape after " + MAX_RETRIES + " attempts", e);
}
// Exponential backoff
long delay = BASE_DELAY_MS * (long) Math.pow(2, attempt - 1);
addDelay(delay);
}
}
return null;
}
private void addDelay(long delayMs) {
try {
TimeUnit.MILLISECONDS.sleep(delayMs);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
throw new RuntimeException("Thread interrupted", e);
}
}
public void checkRobotsTxt(String domain) {
try {
Document robotsTxt = Jsoup.connect(domain + "/robots.txt")
.timeout(5000)
.get();
String content = robotsTxt.text();
if (content.contains("Disallow: /")) {
logger.warning("robots.txt contains restrictions for: " + domain);
// Implement your logic to respect robots.txt
}
} catch (Exception e) {
logger.info("Could not fetch robots.txt for: " + domain);
}
}
}
Enterprise Web Scraping Solutions
While building your own scrapers is educational and works for small projects, enterprise applications often benefit from dedicated web scraping services that handle the complexity for you.
Approach | Setup Time | Maintenance | Scalability | Success Rate | Best For |
---|---|---|---|---|---|
DIY Java Scraping | Days to Weeks | High | Limited | 60-80% | Learning, small projects |
Prompt Fuel API | Minutes | None | Unlimited | 99.9% | Production applications |
Proxy + DIY | Weeks | Very High | Medium | 85-95% | Custom requirements |
Java Integration with Web Scraping APIs
For production applications, consider using a web scraping API. Here's how to integrate with Prompt Fuel:
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import java.net.URI;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.fasterxml.jackson.databind.JsonNode;
public class PromptFuelIntegration {
private static final String API_KEY = "your-api-key-here";
private static final String BASE_URL = "https://api.promptfuel.io/scrape";
public static void main(String[] args) {
PromptFuelIntegration scraper = new PromptFuelIntegration();
try {
String result = scraper.scrapeWithAPI("https://quotes.toscrape.com/");
System.out.println("Scraped content length: " + result.length());
// Parse with JSoup for familiar API
Document doc = Jsoup.parse(result);
System.out.println("Page title: " + doc.title());
} catch (Exception e) {
e.printStackTrace();
}
}
public String scrapeWithAPI(String url) throws Exception {
HttpClient client = HttpClient.newHttpClient();
// Build request body
String requestBody = String.format("""
{
"url": "%s",
"render": true,
"format": "html",
"premium_proxy": true
}
""", url);
HttpRequest request = HttpRequest.newBuilder()
.uri(URI.create(BASE_URL))
.header("Authorization", "Bearer " + API_KEY)
.header("Content-Type", "application/json")
.POST(HttpRequest.BodyPublishers.ofString(requestBody))
.build();
HttpResponse<String> response = client.send(request,
HttpResponse.BodyHandlers.ofString());
if (response.statusCode() == 200) {
// Parse JSON response
ObjectMapper mapper = new ObjectMapper();
JsonNode jsonResponse = mapper.readTree(response.body());
return jsonResponse.get("html").asText();
} else {
throw new RuntimeException("API request failed: " + response.statusCode());
}
}
}
Why Choose API Over DIY?
- 99.9% success rate vs 60-80% with DIY solutions
- Built-in proxy rotation - no IP blocking issues
- JavaScript rendering - handles SPAs automatically
- CAPTCHA solving - bypass anti-bot measures
- Zero maintenance - focus on your business logic
- Scalable - handle millions of requests
Getting Started with Java Web Scraping
You now have a complete toolkit for Java web scraping in 2025. Here's how to choose the right approach:
For Learning & Small Projects
Start with JSoup for static content, then graduate to HtmlUnit for JavaScript-heavy sites. This gives you a solid foundation in web scraping concepts.
For Production Applications
Consider web scraping APIs like Prompt Fuel that handle the complexity, provide better success rates, and let you focus on business logic instead of infrastructure.
Next Steps
- Choose your library: JSoup for static, HtmlUnit for dynamic, Selenium for complex
- Build a simple scraper: Start with the examples in this guide
- Add error handling: Implement retries, logging, and graceful failures
- Scale responsibly: Add delays, respect robots.txt, monitor performance
- Consider APIs: For production apps, evaluate if a service makes more sense
Remember: the best scraper is one that works reliably in production with minimal maintenance. Choose your approach based on your specific needs, timeline, and resources.