library(tidyverse)
library(rvest)
library(tidytext)
library(wordcloud)Week 9_Web Scraping
Introduction to Web Scraping
👩💻 What is Web Scraping
This tutorial introduces you to the basics of web scraping using rvest, an R package within the tidyverse, designed for easy web scraping.
Web scraping is the process of automating the extraction of data from websites. Think of it this way: instead of manually clicking around a website, highlighting text, and copy-pasting numbers into an Excel sheet, you write a script to build a “data bot.” This bot systematically reads the website’s underlying source code, targets the exact information you ask for, and neatly packs it away into an R dataframe.
⚙️ Prerequisites & Toolkit
Before we can build our data bot, we need to give R the right tools. We will use a combination of four core packages to gather, clean, and visualize our data.
⚠️ The Legal & Ethical Disclaimer
Before we start scraping, we must talk about data ethics.
Some websites hate web scrapers. If you write a loop that hits a small business website 10,000 times a second, your script acts like a cyberattack and will likely crash their server, disrupt their business, and get your IP address permanently banned.
The Golden Rules of Scraping:
Public vs. Private:
If you have to log in to a website with a username and password to see data, it is private. Scraping behind a login screen almost always violates a site’s Terms of Service and can cross legal lines.
🛑 Look up the infamous OkCupid disaster of 2016, where researchers scraped and publicly released the deeply personal profiles of 70,000 users without their consent or anonymization. Remember, “publicly accessible” does not always mean “free to scrape and exploit.”
Facts vs. Copyright:
Under intellectual property law, you cannot copyright raw facts. For example, a list of food ingredients or a textbook’s table of contents cannot be copyrighted. You can legally scrape those facts. However, creative expressions are copyrighted. You cannot legally scrape, analyze, and/or re-publish copyrighted content without permission.
Read the Website’s Permission Slip:
Every major website has a publicly accessible text file called robots.txt that tells automated bots exactly what they are allowed to look at. You can view it by adding /robots.txt to the end of any web address.
A standard robots.txt looks like this:
User-agent: *
Disallow: /products/
Allow: /blog/- If a website explicitly writes
Disallow: /products/in this file, your bot should respect that rule, and not read anything inside the/products/folder of that website
Be Polite:
When scraping multiple pages, always build artificial pauses into your code using functions like Sys.sleep() or packages like polite. This spaces out your requests, ensuring you don’t overwhelm the website’s host server.
Real-World Note
Finally, some websites have systems in place to identify scraping and/or make it harder. For example, Amazon, eBay, and StubHub use highly advanced anti-bot protections to actively block scrapers. They do this to protect user privacy and prevent competitors from monitoring their real-time pricing.
HTML & CSS Structure
Before we can scrape data, we need to first look at how websites work. When you look at a webpage, you see a polished, beautiful screen. But your scraping robot doesn’t care about aesthetics. It just sees a massive, chaotic wall of text called HTML (HyperText Markup Language) and CSS (Cascading Style Sheets).
HTML (The Skeleton): You can think of HTML as the skeleton of a webpage; it provides the framework that holds all the content together. It says, “Here is a paragraph of text, here is a button, and here is an image.”
CSS (The Clothes/Skin): CSS handles the visual design. It says, “Take that paragraph, make the font Helvetica, dye it blue, and move it to the right side of the screen.”
As a data scraper, you are searching for the HTML because that’s where the actual data (the words and numbers) lives. However, you will use the CSS classes and IDs as your map to find it.
A basic piece of HTML looks like this:
<html>
<head>
<title>Page title</title>
</head>
<body>
<h1 id='first'>Heading</h1>
<p>Paragraph</p>
<p>Another paragraph with <b>some bold text.</b></p>
<p class="text">Hi! My name is <b>Fatemeh</b>.</p>
</body>
</html>Let’s zoom in on that last line of code to understand the anatomy of what your R script is looking at:
<p class="text"> Hi! My name is <b>Fatemeh</b>. </p>Tags:
<p>and<b>are tags. They define what the content is.<p>means paragraph, and<b>means bold text. Tags always open (<p>) and close (</p>) to wrap around data.Elements: An element is the whole package. Everything from the opening
<p>to the closing</p>tag, including the text inside, is considered an element.Attributes:
class="text"is an attribute. Attributes live inside the opening tag and act like metadata labels stuck on a box. We will use these exact labels to tell R which box to open.
Extracting Data with CSS Selectors
Okay, we know how websites work. But how do we actually tell R where to look?
HTML isn’t a flat text document; it’s a giant filing cabinet. We don’t want all the information on a webpage; we just want to open specific drawers so we don’t scramble our data. To target these drawers with pinpoint accuracy, we turn the HTML components we just learned into CSS Selectors (our digital addresses).
Here is how you translate HTML components into a selector address:
Tags (e.g.,
porh1): Targets a general category. For example, asking R for thepselector will grab every single paragraph on the entire website.Classes (e.g.,
.text): Targets a specific style group. If a website has 50 different items labeled with the same class, hunting for that class will instantly grab all 50 items at once.
Notice the dot (.)! To target an HTML attribute likeclass="text",you must put a dot in front of it when writing your R code (i.e.,.text).IDs (e.g.,
#first): Targets a completely unique element. IDs are unique, there is only ever one per page, so R will only extract that single, specific item.
Notice the hashtag (#)! To target an HTML attribute likeid="first", you must use a hashtag in front of it in R (i.e.,#first).
🛠️ You don’t have to guess these CSS selectors by digging through raw HTML code all day. Professional developers use a couple of incredible free shortcuts:
- SelectorGadget: This is a free Chrome extension. You simply click on the text or table you want directly on your web browser screen, and it automatically generates the exact CSS selector address for your R code. You can install it here.
There is also a free browser game where you use CSS selectors to select items on a dinner table: Play CSS Diner here.
The Core Functions of rvest
Now that you know how to find a web address using CSS selectors, let’s look at the actual tools we use in R to retrieve that data. The rvest package relies on five core functions to handle nearly 90% of all web scraping tasks.
Think of these functions as the sequential steps in your data collection pipeline:
Connecting to the Page
read_html(): downloads the entire raw HTML structure of the page and loads it into R so your script can read it.
Targeting the Data
html_elements(): Finds all matches of a specific selector and returns them as a list. We use this to grab repeated items, like an entire list of movie titles or product cards.html_element(): Finds the first match, or preserves matching row-by-row. We use this singular version to search inside our collected lists. This ensures that if a specific item is missing from a product card, R safely records it as anNAblank instead of skipping it and throwing your columns out of alignment.
Extracting the Content
html_text2(): Extracts the clean text inside the HTML tags. In other words, this function strips away all of that structural code text, leaving you with only the clean, human-readable words visible on the screenhtml_attr(): Extracts data hidden inside the HTML attributes. Sometimes the information you need isn’t displayed as regular text on the screen. Instead, it is embedded inside the HTML tag itself, such as a web address link (href) or an image source file (src). This function allows you to extract those hidden details.
Note: Both html_text2() and html_attr() are text-extraction tools, meaning they always return data as text strings (characters). If you extract a price like "£51.77" or a year like "2025", R will initially treat them as plain words rather than numbers. You need to use standard data wrangling tools to convert them into actual numeric values before doing any math or plotting.
Guided Practical Applications
Now that we have reviewed our core functions, we will progress through five practical examples. We will start with an easy example of text extraction, move on to capturing simple lists, and ultimately automate our scripts using loops and custom functions.
Example 1. Scraping Unstructured Text
🎯 Our Mission: Go to the University of Oregon’s Wikipedia page and extract the main descriptive text.
To scrape, you first need to read the HTML for the page into R. This gives you a similar structural output to the HTML examples we looked at above.
# Request and download the web page code
wiki_html <- read_html("https://en.wikipedia.org/wiki/University_of_Oregon")
# Print the object
wiki_html{html_document}
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-menu-disabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 skin-theme-clientpref-day vector-sticky-header-enabled vector-toc-available skin-thumbsize-clientpref-standard" lang="en" dir="ltr">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="skin--responsive skin-vector skin-vector-search-vue mediawik ...
Next, you need to find the CSS selector for the information you want to scrape. For this, you have two primary options:
Right-Click & Inspect: Go to the website, right-click on the text you want, and select “Inspect”. This opens a developer sidebar filled with HTML.
💡 Pro-tip: Click the small “Arrow/Cursor” icon at the top left of the Inspect sidebar. Now, when you hover your mouse over different parts of the webpage, it will automatically highlight the corresponding HTML element in the sidebar!
SelectorGadget: Use the free Chrome extension we mentioned earlier to point and click on the elements you want.
For Wikipedia, all standard body text is wrapped in paragraph <p> tags. Once we know that, we can use html_text2() to extract the clean text inside those tags.
💡 Note: Any messy HTML escape characters are automatically cleaned up by
rvestbehind the scenes. For instance, raw source code containing&will be automatically decoded into a standard ampersand (&) by the time it reaches your script.
# Extract text from all paragraph tags (<p>)
wiki_paragraphs <- wiki_html %>%
html_elements("p")%>%
html_text2() Now you have your data! In the next step, you can clean it and use it in whatever way works best for your analysis.
For example, we can make a Word Cloud:
# Convert the text vector into a data frame and tokenize into individual words
wiki_dataframe <- data.frame(text_content = wiki_paragraphs)
tokenized_words <- wiki_dataframe %>%
unnest_tokens(output = word, input = text_content)
# Create custom stop words (removing 'university' and 'oregon' since they are obvious)
custom_filters <- bind_rows(
tibble(word = c("university", "oregon"), lexicon = c("custom")),
stop_words
)
# Build the word cloud
tokenized_words %>%
count(word) %>%
anti_join(custom_filters, by = join_by(word)) %>%
with(wordcloud(
words = word,
freq = n,
color = "forestgreen",
rot.per = 0,
scale = c(2.8, 0.25)
))Or, we can build a Sentiment Profile to see the emotional language used on the page:
# Join with the Bing sentiment lexicon and count frequencies
sentiment_scores <- tokenized_words %>%
inner_join(get_sentiments("bing"), by = "word") %>%
count(word, sentiment, sort = TRUE) %>%
anti_join(custom_filters, by = "word") %>%
ungroup()
# Visualize the emotional profile
sentiment_scores %>%
group_by(sentiment) %>%
slice_max(n, n = 7, with_ties = FALSE) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = n, y = word, fill = sentiment)) +
geom_col(show.legend = FALSE) +
theme_minimal() +
facet_wrap(~sentiment, scales = "free_y") +
scale_fill_manual(values = c("negative" = "firebrick", "positive" = "forestgreen")) +
labs(
title = "Wikipedia Sentiment Profile: University of Oregon",
subtitle = "Analysis of positive vs. negative emotional descriptors in body text",
x = "Word Frequency Count",
y = NULL
)Example 2. Extracting Structured Tables
🎯 Our Mission: Extract the Major League Baseball “Team Standard Batting” table for the 2025 season.
In our Wikipedia example, targeting every single paragraph ("p") worked perfectly because it was unstructured text. But on most modern websites, data is scattered across thousands of nested tags. Requesting a broad tag like table might pull a mountain of junk data or sidebar menus.
To get pinpoint accuracy, we can target a specific CSS ID. Since IDs are completely unique, we can grab the exact table element we want, and then use html_table() to instantly convert that HTML grid into a clean R data frame.
# Define URL
baseball_html <- "https://www.baseball-reference.com/leagues/majors/2025.shtml"
# Download the page source
page <- read_html(baseball_html)
# Isolate the table by its ID, parse it, and pull the first data frame
batting_2025 <- page %>%
html_elements("#teams_standard_batting") %>%
html_table() %>% pluck(1)
# Preview the scraped table
head(batting_2025)# A tibble: 6 × 29
Tm `#Bat` BatAge `R/G` G PA AB R H `2B` `3B` HR
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Arizona D… 65 27.8 4.88 162 6210 5480 791 1377 277 38 214
2 Athletics 58 26.1 4.52 162 6151 5547 733 1403 296 16 219
3 Atlanta B… 71 28.3 4.47 162 6186 5508 724 1349 243 19 190
4 Baltimore… 70 26.5 4.18 162 6020 5416 677 1273 251 19 191
5 Boston Re… 56 27.6 4.85 162 6206 5562 786 1414 324 24 186
6 Chicago C… 52 28.4 4.90 162 6162 5495 793 1371 267 29 223
# ℹ 17 more variables: RBI <chr>, SB <chr>, CS <chr>, BB <chr>, SO <chr>,
# BA <chr>, OBP <chr>, SLG <chr>, OPS <chr>, `OPS+` <chr>, TB <chr>,
# GDP <chr>, HBP <chr>, SH <chr>, SF <chr>, IBB <chr>, LOB <chr>
Example 2.b
So far, we have learned how to harvest individual pieces of information from a page (like a block of paragraphs or a single table). While this is powerful, you could technically achieve the same result by manually copying and pasting the information into an Excel file.
The real magic of web scraping happens when we automate! What if you want to analyze baseball data across multiple seasons? Instead of manually navigating to and copy-pasting seven different web tables, we can write a for loop to build a “crawler” that auto-navigates across years and automatically stacks the datasets together.
🎯 Our Mission: Build a loop that scrapes Team Standard Batting data from 2020 through 2026, appends the correct calendar year, and merges everything into one giant dataset.
# Make an empty data frame to act as our master repository
all_batting <- data.frame()
# Loop through each year in our target range
for(year in 2020:2026){
# Dynamically construct the URL for the current year
sr_url <- paste0("https://www.baseball-reference.com/leagues/majors/", year,".shtml")
# Read the page HTML
page <- read_html(sr_url)
# Scrape the table and append a column tracking the calendar year
batting <- page %>%
html_elements("#all_teams_standard_batting") %>%
html_table() %>% pluck(1) %>%
mutate(year = rep(year, length(Tm)))
# Stack the current year's data onto our master data frame
all_batting <- rbind(all_batting, batting)
}
# Preview the combined historical dataset
head(all_batting)# A tibble: 6 × 30
Tm `#Bat` BatAge `R/G` G PA AB R H `2B` `3B` HR
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Arizona D… 45 29.1 4.48 60 2238 1997 269 482 101 12 58
2 Atlanta B… 48 28.2 5.80 60 2344 2074 348 556 130 3 103
3 Baltimore… 45 26.3 4.57 60 2242 2026 274 523 102 7 77
4 Boston Re… 47 27.0 4.87 60 2304 2083 292 552 118 7 81
5 Chicago C… 47 27.9 4.42 60 2214 1918 265 422 82 8 74
6 Chicago W… 48 27.6 5.10 60 2267 2047 306 534 94 6 96
# ℹ 18 more variables: RBI <chr>, SB <chr>, CS <chr>, BB <chr>, SO <chr>,
# BA <chr>, OBP <chr>, SLG <chr>, OPS <chr>, `OPS+` <chr>, TB <chr>,
# GDP <chr>, HBP <chr>, SH <chr>, SF <chr>, IBB <chr>, LOB <chr>, year <int>
Example 3. Extracting Repeated Items (Plural vs. Singular)
🎯 Our mission: Capture a list of the most popular movies currently playing in theaters.
For this example, we will collect the titles of trending films from “Rotten Tomatoes” website using a CSS class selector. This exercise teaches a vital lesson about how rvest handles data structures depending on whether you use a plural or singular function.
Part A. The Wide Net (html_elements)
Here, our goal is to grab a complete list of all the popular movies on the page. For this, we use the plural html_elements() function. This acts like a massive dragnet, pulling every single matching card container labeled with the class .js-tile-link.
# Establish the target URL
rt_url <- "https://www.rottentomatoes.com/browse/movies_in_theaters/sort:popular"
# Read the HTML structure
rt_html <- read_html(rt_url)
# Extract text using html_elements function
popular_movies <- rt_html %>%
html_elements(".js-tile-link") %>%
html_text2()
# Convert to a clean table format
rt_df <- as_tibble(popular_movies)
# View the first 5 rows
head(rt_df, 5)# A tibble: 5 × 1
value
<chr>
1 62% 88% Star Wars: The Mandalorian and Grogu Opened May 22, 2026
2 47% 53% Passenger Opened May 22, 2026
3 92% 72% I Love Boosters Opened May 22, 2026
4 85% Backrooms Opens May 29, 2026
5 94% 92% Tuner Opened May 22, 2026
📊 Notice that
rt_df_pluralresults in a long, multi-row data frame. R found dozens of elements sharing that class tag and packed them into a single character vector.
Part B: The Singular Target (html_element)
Now, let’s observe what happens if we accidentally make a tiny typo and leave off the “s”, changing the function to the singular html_element().
# Extract text using html_element function
popular_movies <- rt_html %>%
html_element(".js-tile-link") %>%
html_text2()
# Convert to a clean table format for viewing
rt_df <- as_tibble(popular_movies)
head(rt_df, 5)# A tibble: 1 × 1
value
<chr>
1 62% 88% Star Wars: The Mandalorian and Grogu Opened May 22, 2026
Here, instead of a long list of movies, your data frame has exactly one row containing a single movie title (the very first movie on the page). When applied directly to a webpage’s main HTML body, the singular html_element() function stops hunting the moment it hits its very first match.
Example 4. Scaling Up with Custom Functions
🎯 Our mission: Build a reusable R function that loops through a list of Google Scholar profile URLs, extracts faculty names and their core citation metrics (Total Citations, h-index, and i10-index), and aggregates them into a master database.
Writing a standalone for loop works great for a one-off task. However, if you plan to scrape multiple departments or monitor faculty metrics over time, wrapping your scraping pipeline inside a custom function makes your code clean, modular, and easy to reuse.
When we look at a Google Scholar profile, the citation metrics are tucked inside a neat visual table. Under the hood, however, html_elements("td") strips away the table structure and flattens all cells into a single linear text vector.
To pull out the numbers we want, we use positional indexing based on how Google Scholar organizes its rows:
stats[1]="Citations"(Row Header) \(\rightarrow\)stats[2]= Total Citationsstats[4]="h-index"(Row Header) \(\rightarrow\)stats[5]= Total h-indexstats[7]="i10-index"(Row Header) \(\rightarrow\)stats[8]= Total i10-index
Here is how we assemble this logic into a robust, automated function:
# Define a custom web scraping function
gs_stats <- function(url){
# Make an empty master data frame
df_out <- data.frame()
# Loop through each profile URL
for(link in url){
# Download the HTML
page <- read_html(link)
# Extract the scholar's name
person <- page %>%
html_elements("#gsc_prf_in") %>%
html_text2()
# Get the table cells for total citation stats
stats <- page %>%
html_elements("td") %>%
html_text2()
# Make data frame by hard-coding indexing 'stats'
df_tmp <- data.frame(name = person,
citation = as.numeric(stats[2]),
h_index = as.numeric(stats[5]),
i10_index = as.numeric(stats[8]))
# Append the row onto our master repository
df_out <- rbind(df_out, df_tmp)
}
# Return the finalized, complete dataset
return(df_out)
}Now that the function is ready, let’s try it on a list of UO Psychology faculty members:
# Get list of some faculty Google Scholar URLs
uo_urls <- c("https://scholar.google.com/citations?user=vHEQPGUAAAAJ&hl=en",
"https://scholar.google.com/citations?user=4X4X4xkAAAAJ&hl=en",
"https://scholar.google.com/citations?user=jCxd8-UAAAAJ&hl=en",
"https://scholar.google.com/citations?hl=en&user=WOAdX44AAAAJ",
"https://scholar.google.com/citations?hl=en&user=hZ-YQ3AAAAAJ",
"https://scholar.google.com/citations?hl=en&user=vkhJVbkAAAAJ",
"https://scholar.google.com/citations?hl=en&user=RG9lc0QAAAAJ",
"https://scholar.google.com/citations?hl=en&user=Y7S8ybkAAAAJ",
"https://scholar.google.com/citations?hl=en&user=RewU7lQAAAAJ",
"https://scholar.google.com/citations?hl=en&user=1bzbVeYAAAAJ",
"https://scholar.google.com/citations?hl=en&user=PsFofkQAAAAJ",
"https://scholar.google.com/citations?hl=en&user=181SoTAAAAAJ",
"https://scholar.google.com/citations?hl=en&user=Wy5XKmEAAAAJ",
"https://scholar.google.com/citations?hl=en&user=SRSfyf4AAAAJ",
"https://scholar.google.com/citations?hl=en&user=Jap8Z-cAAAAJ",
"https://scholar.google.com/citations?hl=en&user=hC6IzXMAAAAJ",
"https://scholar.google.com/citations?hl=en&user=krj_eKwAAAAJ",
"https://scholar.google.com/citations?hl=en&user=oqLsBt0AAAAJ"
)
# Function
citations <- gs_stats(uo_urls)
head(citations) name citation h_index i10_index
1 Robert S. Chavez 2667 21 23
2 Michael I. Posner 100881 157 446
3 Elliot T Berkman 11143 50 90
4 Sara J Weston 2328 26 39
5 Kate Celis Mills 12648 42 65
6 Chanel Meyers 676 10 10
Example 5. Bringing It All Together
Now, let’s practice everything we have learned on a real-world sandbox. The web community maintains a live, public website called Books to Scrape specifically designed for developers to practice web scraping safely, legally, and ethically.
🎯 Our mission: Build a master data frame containing every single book’s Title, Price, Stock Status, Numeric Star Rating, and Detail Link, and then crawl inside to extract its Product Description.
To make this capstone project manageable, we will break it down into 3 progressive steps:
Step 1: Extract the Capture the available information on the first page: Title, Price, Stock Status, Star Rating, and Link.
Step 2: Wrap our blueprint inside a loop to glide through all 50 directory pages.
Step 3: Write a custom function to step inside those book links and extract the hidden product descriptions.
Step 1. Build the Single-Page Blueprint
Before writing any code, let’s look at the layout of the Books to Scrape homepage. The website displays a clean, repeating grid of 20 books. Visually, each book sits inside its own self-contained “card” or box that holds its specific title, cover image, rating, price, and stock status.
Your first instinct might be to cast three independent nets across the whole page: one net to grab all 20 titles, one for all 20 prices, and one for all 20 stock statuses.
This works fine on a perfect webpage, but it introduces a massive real-world risk. If just one book on the page is missing a price, R will return 20 titles but only 19 prices. When you try to force these uneven lists into a data frame, R will either crash with a dimensional error, or worse, it will silently misalign your rows, attaching the wrong price to the wrong book for the entire rest of your dataset!
To prevent this, we mimic the visual layout of the page using a Parent-Child strategy:
The Parent Container: We use the plural
html_elements()to cut the webpage into 20 individual product “cards” (the parents).The Child Elements: We then loop inside each individual card one-by-one using the singular
html_element()to extract the title, price, and stock status (the children).
Now, if a child element (like a price) is missing inside a folder, R will automatically insert an NA blank, preserving the structural alignment of our data rows.
# Ingest the homepage HTML
books_url <- "http://books.toscrape.com/"
books_html <- read_html(books_url)
# Isolate the parent card containers for all 20 books on the page
book_cards <- books_html %>%
html_elements(".product_pod")
# Extract base attributes row-by-row
book_catalog <- tibble(
# Title (Attribute lookup prevents text truncation)
title = book_cards %>%
html_element("h3 a") %>%
html_attr("title"),
# Price
price = book_cards %>%
html_element(".price_color") %>%
html_text2(),
# Availability
stock_status = book_cards %>%
html_element(".instock.availability") %>%
html_text2(),
# Star Rating
rating = book_cards %>%
html_element(".star-rating") %>%
html_attr("class") %>%
# remove the repetitive text prefix
str_remove("star-rating "),
# The relative URL destination
link = book_cards %>%
html_element("h3 a") %>%
html_attr("href")
)
# Preview our table
head(book_catalog, 5)# A tibble: 5 × 5
title price stock_status rating link
<chr> <chr> <chr> <chr> <chr>
1 A Light in the Attic £51.77 In stock Three catalogue/a-…
2 Tipping the Velvet £53.74 In stock One catalogue/ti…
3 Soumission £50.10 In stock One catalogue/so…
4 Sharp Objects £47.82 In stock Four catalogue/sh…
5 Sapiens: A Brief History of Humankind £54.23 In stock Five catalogue/sa…
🔍 Deep Dive: Text vs. Attributes
To build this blueprint, we had to rely on two completely different harvesting tools: html_text2() and html_attr().
html_text2(): Extracts the literal text humans can read on the screen.- Example: If the HTML is
<p>Price: £10</p>, usinghtml_text2()gives you"Price: £10".
- Example: If the HTML is
html_attr(): Scans the hidden metadata written directly inside the HTML tag itself. Humans don’t see this text on the browser page, but the computer needs it to function.
If we right-click a book card and select Inspect using our browser’s Developer Tools, we can see exactly how this background HTML structure is built:
- Bypassing Text Truncation (title). Look closely at the text between the anchor tags
>A Light in the ...</a>. Because the title is long, the webpage visually cuts it off with an ellipsis. If we usedhtml_text2(), we would scrape that broken, incomplete name. However, notice the hidden attribute box right before it:title="A Light in the Attic". By callinghtml_attr("title"), we bypass the screen limitations and extract the full, uncut title text. - Gathering Hyperlinks for Future Navigation (href). Look at the anchor tag’s web path attribute:
href="catalogue/a-light-in-the-attic_1000/index.html". Web links are never spelled out as readable text on a screen; they are stored exclusively as metadata background attributes. We usehtml_attr("href")to harvest this path, which will eventually allow our scraper to navigate directly into this book’s unique sub-page. - Reading Star Ratings from Graphic Icons (class). Finally, on the screen, star ratings are just visual graphic icons. There are no words like “3 stars” printed on the page for a standard text-reader to scan. Instead, notice how the rating is embedded directly inside the paragraph tag’s class attribute:
<p class="star-rating Three">. Because there is no actual plain text written between the opening<p>and closing</p>tags, runninghtml_text2()here would return an empty, blank string. By targetinghtml_attr("class"), we pull the raw text “star-rating Three”, allowing us to cleanly strip away the repetitive prefix usingstr_remove().
💡 Summary Rule: Whenever the data you want isn’t written out in plain English on the browser window, look inside the HTML tag attributes. Web links (
href), image sources (src), and styling categories (class) are all structural barcodes waiting to be scanned byhtml_attr().
Step 2. Get the information for other pages (For Loop)
Now that our single-page code works perfectly, we can bring this logic into an automation engine. We will write a loop that constructs the URLs for all 50 directory pages, harvests their item grids using our blueprint, and stacks them together.
Note: For mass scraping operations, initializing an empty data frame and appending data rows sequentially using
bind_rows()is the standard industry practice for maintaining fast, stable data pipelines.
# Make an empty master data frame warehouse
all_books_catalog <- tibble()
# Begin the automation engine loop across all 50 directory pages
for (page_num in 1:50) {
# Construct the target URL path
page_url <- paste0("http://books.toscrape.com/catalogue/page-", page_num, ".html")
# Read the page HTML code source structure
homepage_html <- read_html(page_url)
# The code we wrote in the previous step
book_cards <- books_html %>%
html_elements(".product_pod")
page_table <- tibble(
title = book_cards %>%
html_element("h3 a") %>%
html_attr("title"),
price = book_cards %>%
html_element(".price_color") %>%
html_text2(),
stock_status = book_cards %>%
html_element(".instock.availability") %>%
html_text2(),
rating = book_cards %>%
html_element(".star-rating") %>%
html_attr("class") %>%
str_remove("star-rating "),
link = book_cards %>%
html_element("h3 a") %>%
html_attr("href")
) %>%
# Clean the data
mutate(
price_numeric = as.numeric(str_remove(price, "£")),
rating_numeric = case_match(
str_remove(rating, "star-rating "),
"One" ~ 1, "Two" ~ 2, "Three" ~ 3, "Four" ~ 4, "Five" ~ 5,
.default = NA
)
) %>%
select(-c(price, rating))
# Add this page's complete data row-strata onto the master data frame
all_books_catalog <- bind_rows(all_books_catalog, page_table)
# Broadcast live status logs to the R console console tracker
# message(paste("Successfully processed catalog page:", page_num, "/ 50"))
# Maintain server courtesy with systematic execution breaks
# Sys.sleep(0.1)
}Warning: There was 1 warning in `mutate()`.
ℹ In argument: `rating_numeric = case_match(...)`.
Caused by warning:
! `case_match()` was deprecated in dplyr 1.2.0.
ℹ Please use `recode_values()` instead.
# Verify rows dimension count matches 1,000 items exactly
print(dim(all_books_catalog))[1] 1000 5
# Output
head(all_books_catalog)# A tibble: 6 × 5
title stock_status link price_numeric rating_numeric
<chr> <chr> <chr> <dbl> <dbl>
1 A Light in the Attic In stock cata… 51.8 3
2 Tipping the Velvet In stock cata… 53.7 1
3 Soumission In stock cata… 50.1 1
4 Sharp Objects In stock cata… 47.8 4
5 Sapiens: A Brief History of H… In stock cata… 54.2 5
6 The Requiem Red In stock cata… 22.6 1
Step 3: Get the Product Description (Deep Crawling)
If you click on an individual book, you will find a rich paragraph containing its Product Description. Because this text block lives exclusively inside each item’s unique detail sub-page, our scraper must act like a “deep crawler”: it needs to grab the hyperlinks we collected in Step 1, travel inside each sub-page link, extract the description text, and bring it back to our master data frame.
When navigating between pages, you will run into a subtle pathing bug. On the primary homepage, the book links naturally include a folder prefix: http://books.toscrape.com/catalogue/relative_url. However, on the subsequent catalog pages (pages 2 through 50), the links drop that prefix and look like this: http://books.toscrape.com/relative_url.
If we blindly paste a base domain onto these inconsistent paths, half of our URLs will break! To build a bulletproof crawling engine, we will write a custom function that uses str_replace() to strip away any redundant "catalogue/" strings and cleanly build an absolute URL path every single time.
# Define our deep-crawling subpage extraction function
scrape_book_description <- function(relative_url) {
# Clean up the relative URL path to ensure we don't double-paste "catalogue/"
clean_path <- str_replace(relative_url, "^catalogue/", "")
full_url <- paste0("http://books.toscrape.com/catalogue/", clean_path)
# Ingest page source code safely; return NA if a network connection error occurs
detail_html <- tryCatch(read_html(full_url), error = function(e) return(NULL))
if (is.null(detail_html)) return(NA)
# Target the unique descriptive paragraph block inside the page article element
description_text <- detail_html %>%
html_element("article.product_page > p") %>%
html_text2()
return(description_text)
}
# --- Test the Deep Crawl Loop on the first 5 books ---
descriptions_vector <- c()
# Pull links directly from our page-one blueprint data frame
target_links <- book_catalog$link[1:5]
for (link in target_links) {
# Run our custom function recipe
current_desc <- scrape_book_description(link)
descriptions_vector <- c(descriptions_vector, current_desc)
# Pause for half a second to maintain responsible, polite pacing
# Sys.sleep(0.5)
}
# Bind our freshly scraped descriptions onto a subset of our data frame
deep_catalog_sample <- book_catalog %>%
slice(1:5) %>%
mutate(product_description = descriptions_vector)
# View our data
head(deep_catalog_sample)# A tibble: 5 × 6
title price stock_status rating link product_description
<chr> <chr> <chr> <chr> <chr> <chr>
1 A Light in the Attic £51.… In stock Three cata… "It's hard to imag…
2 Tipping the Velvet £53.… In stock One cata… "\"Erotic and abso…
3 Soumission £50.… In stock One cata… "Dans une France a…
4 Sharp Objects £47.… In stock Four cata… "WICKED above her …
5 Sapiens: A Brief History … £54.… In stock Five cata… "From a renowned h…
Final Note: Static vs. Dynamic Websites
Every example we built today was on a static website. This means the data is baked directly into the raw HTML code, allowing rvest::read_html() to catch it easily.
As you venture out to scrape other websites, you will eventually encounter dynamic websites (like modern e-commerce sites or social media feeds). These sites use JavaScript to load their data on the fly, meaning read_html() might fetch an empty page. To scrape those sites, you have to learn a separate set of advanced tools, such as browser automation (chromote or RSelenium), which allow R to open a live browser window and let the page finish loading before harvesting.
For now, celebrate your win! You have officially mastered the core foundation of HTML web scraping, which handles a massive portion of the open web!!
Mini Hacks
Mini Hack #1:
Go to the Books to Scrape homepage. Look at the left-hand side of the screen. There is a list of book Categories (Travel, Mystery, Historical Fiction, etc.). Write a short script to scrape these category names into a clean list or data frame.
base_url <- "http://books.toscrape.com/"Mini Hack #2:
Click on the first book on the website (“A Light in the Attic”) to go to its detail page. Scroll to the bottom of the page. You will see a structured Product Information table containing the UPC, Product Type, Price, and Tax. Extract this entire table into an R data frame.
detail_url <- "http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"Mini Hack #3:
Find a public website, extract some interesting data. It can be a table of numbers, a list of items, or paragraphs of text.
Clean your extracted data and load it into a neat R data frame. Print the head() of your data frame.
Run some simple descriptive statistics (like averages or frequency counts). You can also use ggplot2 or wordcloud to create a visualization of the trends you discovered.
your_url <- ""