Week 9_Web Scraping

Author

Fatemeh Ghazavi Khorasgani

Introduction to Web Scraping

👩‍💻 What is Web Scraping

This tutorial introduces you to the basics of web scraping using rvest, an R package within the tidyverse, designed for easy web scraping.

Web scraping is the process of automating the extraction of data from websites. Think of it this way: instead of manually clicking around a website, highlighting text, and copy-pasting numbers into an Excel sheet, you write a script to build a “data bot.” This bot systematically reads the website’s underlying source code, targets the exact information you ask for, and neatly packs it away into an R dataframe.

⚙️ Prerequisites & Toolkit

Before we can build our data bot, we need to give R the right tools. We will use a combination of four core packages to gather, clean, and visualize our data.

library(tidyverse)
library(rvest)
library(tidytext)
library(wordcloud)

⚠️ The Legal & Ethical Disclaimer

Before we start scraping, we must talk about data ethics.

Some websites hate web scrapers. If you write a loop that hits a small business website 10,000 times a second, your script acts like a cyberattack and will likely crash their server, disrupt their business, and get your IP address permanently banned.

The Golden Rules of Scraping:

Public vs. Private:

If you have to log in to a website with a username and password to see data, it is private. Scraping behind a login screen almost always violates a site’s Terms of Service and can cross legal lines.

🛑 Look up the infamous OkCupid disaster of 2016, where researchers scraped and publicly released the deeply personal profiles of 70,000 users without their consent or anonymization. Remember, “publicly accessible” does not always mean “free to scrape and exploit.”

Facts vs. Copyright:

Under intellectual property law, you cannot copyright raw facts. For example, a list of food ingredients or a textbook’s table of contents cannot be copyrighted. You can legally scrape those facts. However, creative expressions are copyrighted. You cannot legally scrape, analyze, and/or re-publish copyrighted content without permission.

Read the Website’s Permission Slip:

Every major website has a publicly accessible text file called robots.txt that tells automated bots exactly what they are allowed to look at. You can view it by adding /robots.txt to the end of any web address.

A standard robots.txt looks like this:

User-agent: *
Disallow: /products/
Allow: /blog/

If a website explicitly writes Disallow: /products/ in this file, your bot should respect that rule, and not read anything inside the /products/ folder of that website

Be Polite:

When scraping multiple pages, always build artificial pauses into your code using functions like Sys.sleep() or packages like polite. This spaces out your requests, ensuring you don’t overwhelm the website’s host server.

Real-World Note

Finally, some websites have systems in place to identify scraping and/or make it harder. For example, Amazon, eBay, and StubHub use highly advanced anti-bot protections to actively block scrapers. They do this to protect user privacy and prevent competitors from monitoring their real-time pricing.

HTML & CSS Structure

Before we can scrape data, we need to first look at how websites work. When you look at a webpage, you see a polished, beautiful screen. But your scraping robot doesn’t care about aesthetics. It just sees a massive, chaotic wall of text called HTML (HyperText Markup Language) and CSS (Cascading Style Sheets).

HTML (The Skeleton): You can think of HTML as the skeleton of a webpage; it provides the framework that holds all the content together. It says, “Here is a paragraph of text, here is a button, and here is an image.”
CSS (The Clothes/Skin): CSS handles the visual design. It says, “Take that paragraph, make the font Helvetica, dye it blue, and move it to the right side of the screen.”

As a data scraper, you are searching for the HTML because that’s where the actual data (the words and numbers) lives. However, you will use the CSS classes and IDs as your map to find it.

A basic piece of HTML looks like this:

<html>

  <head>
    <title>Page title</title>
  </head>

  <body>
    <h1 id='first'>Heading</h1>
    <p>Paragraph</p>
    <p>Another paragraph with <b>some bold text.</b></p>
    <p class="text">Hi! My name is <b>Fatemeh</b>.</p>
  </body>
  
</html>

Let’s zoom in on that last line of code to understand the anatomy of what your R script is looking at:

<p class="text"> Hi! My name is <b>Fatemeh</b>. </p>

Tags:  and  are tags. They define what the content is.  means paragraph, and  means bold text. Tags always open () and close () to wrap around data.
Elements: An element is the whole package. Everything from the opening  to the closing  tag, including the text inside, is considered an element.
Attributes: class="text" is an attribute. Attributes live inside the opening tag and act like metadata labels stuck on a box. We will use these exact labels to tell R which box to open.

Extracting Data with CSS Selectors

Okay, we know how websites work. But how do we actually tell R where to look?

HTML isn’t a flat text document; it’s a giant filing cabinet. We don’t want all the information on a webpage; we just want to open specific drawers so we don’t scramble our data. To target these drawers with pinpoint accuracy, we turn the HTML components we just learned into CSS Selectors (our digital addresses).

Here is how you translate HTML components into a selector address:

Tags (e.g., p or h1): Targets a general category. For example, asking R for the p selector will grab every single paragraph on the entire website.
Classes (e.g., .text): Targets a specific style group. If a website has 50 different items labeled with the same class, hunting for that class will instantly grab all 50 items at once.
Notice the dot (.)! To target an HTML attribute like class="text",you must put a dot in front of it when writing your R code (i.e., .text).
IDs (e.g., #first): Targets a completely unique element. IDs are unique, there is only ever one per page, so R will only extract that single, specific item.
Notice the hashtag (#)! To target an HTML attribute like id="first", you must use a hashtag in front of it in R (i.e., #first).

🛠️ You don’t have to guess these CSS selectors by digging through raw HTML code all day. Professional developers use a couple of incredible free shortcuts:

SelectorGadget: This is a free Chrome extension. You simply click on the text or table you want directly on your web browser screen, and it automatically generates the exact CSS selector address for your R code. You can install it here.
There is also a free browser game where you use CSS selectors to select items on a dinner table: Play CSS Diner here.

The Core Functions of rvest

Now that you know how to find a web address using CSS selectors, let’s look at the actual tools we use in R to retrieve that data. The rvest package relies on five core functions to handle nearly 90% of all web scraping tasks.

Think of these functions as the sequential steps in your data collection pipeline:

Connecting to the Page
- read_html(): downloads the entire raw HTML structure of the page and loads it into R so your script can read it.
Targeting the Data
- html_elements(): Finds all matches of a specific selector and returns them as a list. We use this to grab repeated items, like an entire list of movie titles or product cards.
- html_element(): Finds the first match, or preserves matching row-by-row. We use this singular version to search inside our collected lists. This ensures that if a specific item is missing from a product card, R safely records it as an NA blank instead of skipping it and throwing your columns out of alignment.
Extracting the Content
- html_text2(): Extracts the clean text inside the HTML tags. In other words, this function strips away all of that structural code text, leaving you with only the clean, human-readable words visible on the screen
- html_attr(): Extracts data hidden inside the HTML attributes. Sometimes the information you need isn’t displayed as regular text on the screen. Instead, it is embedded inside the HTML tag itself, such as a web address link (href) or an image source file (src). This function allows you to extract those hidden details.

Note: Both html_text2() and html_attr() are text-extraction tools, meaning they always return data as text strings (characters). If you extract a price like "£51.77" or a year like "2025", R will initially treat them as plain words rather than numbers. You need to use standard data wrangling tools to convert them into actual numeric values before doing any math or plotting.

Guided Practical Applications

Now that we have reviewed our core functions, we will progress through five practical examples. We will start with an easy example of text extraction, move on to capturing simple lists, and ultimately automate our scripts using loops and custom functions.

Example 1. Scraping Unstructured Text

🎯 Our Mission: Go to the University of Oregon’s Wikipedia page and extract the main descriptive text.

To scrape, you first need to read the HTML for the page into R. This gives you a similar structural output to the HTML examples we looked at above.

# Request and download the web page code
wiki_html <- read_html("https://en.wikipedia.org/wiki/University_of_Oregon")
# Print the object
wiki_html

{html_document}
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-menu-disabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 skin-theme-clientpref-day vector-sticky-header-enabled vector-toc-available skin-thumbsize-clientpref-standard" lang="en" dir="ltr">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="skin--responsive skin-vector skin-vector-search-vue mediawik ...

Next, you need to find the CSS selector for the information you want to scrape. For this, you have two primary options:

Right-Click & Inspect: Go to the website, right-click on the text you want, and select “Inspect”. This opens a developer sidebar filled with HTML.

💡 Pro-tip: Click the small “Arrow/Cursor” icon at the top left of the Inspect sidebar. Now, when you hover your mouse over different parts of the webpage, it will automatically highlight the corresponding HTML element in the sidebar!
SelectorGadget: Use the free Chrome extension we mentioned earlier to point and click on the elements you want.

For Wikipedia, all standard body text is wrapped in paragraph  tags. Once we know that, we can use html_text2() to extract the clean text inside those tags.

💡 Note: Any messy HTML escape characters are automatically cleaned up by rvest behind the scenes. For instance, raw source code containing & will be automatically decoded into a standard ampersand (&) by the time it reaches your script.

# Extract text from all paragraph tags (<p>)
wiki_paragraphs <- wiki_html %>%
  html_elements("p")%>%
  html_text2()

Now you have your data! In the next step, you can clean it and use it in whatever way works best for your analysis.

For example, we can make a Word Cloud:

# Convert the text vector into a data frame and tokenize into individual words
wiki_dataframe <- data.frame(text_content = wiki_paragraphs)

tokenized_words <- wiki_dataframe %>% 
  unnest_tokens(output = word, input = text_content) 

# Create custom stop words (removing 'university' and 'oregon' since they are obvious)
custom_filters <- bind_rows(
  tibble(word = c("university", "oregon"), lexicon = c("custom")),
  stop_words
)

# Build the word cloud
tokenized_words %>% 
  count(word) %>% 
  anti_join(custom_filters, by = join_by(word)) %>%  
  with(wordcloud(
    words = word, 
    freq = n, 
    color = "forestgreen", 
    rot.per = 0,
    scale = c(2.8, 0.25)
  ))

Or, we can build a Sentiment Profile to see the emotional language used on the page:

# Join with the Bing sentiment lexicon and count frequencies
sentiment_scores <- tokenized_words %>% 
  inner_join(get_sentiments("bing"), by = "word") %>% 
  count(word, sentiment, sort = TRUE) %>% 
  anti_join(custom_filters, by = "word") %>% 
  ungroup()

# Visualize the emotional profile
sentiment_scores %>% 
  group_by(sentiment) %>% 
  slice_max(n, n = 7, with_ties = FALSE) %>% 
  ungroup() %>% 
  mutate(word = reorder(word, n)) %>% 
  ggplot(aes(x = n, y = word, fill = sentiment)) +
  geom_col(show.legend = FALSE) + 
  theme_minimal() +
  facet_wrap(~sentiment, scales = "free_y") +
  scale_fill_manual(values = c("negative" = "firebrick", "positive" = "forestgreen")) +
  labs(
    title = "Wikipedia Sentiment Profile: University of Oregon", 
    subtitle = "Analysis of positive vs. negative emotional descriptors in body text",
    x = "Word Frequency Count",
    y = NULL
  )

Example 2. Extracting Structured Tables

🎯 Our Mission: Extract the Major League Baseball “Team Standard Batting” table for the 2025 season.

In our Wikipedia example, targeting every single paragraph ("p") worked perfectly because it was unstructured text. But on most modern websites, data is scattered across thousands of nested tags. Requesting a broad tag like table might pull a mountain of junk data or sidebar menus.

To get pinpoint accuracy, we can target a specific CSS ID. Since IDs are completely unique, we can grab the exact table element we want, and then use html_table() to instantly convert that HTML grid into a clean R data frame.

# Define URL
baseball_html <- "https://www.baseball-reference.com/leagues/majors/2025.shtml"

# Download the page source
page <- read_html(baseball_html)

# Isolate the table by its ID, parse it, and pull the first data frame
batting_2025 <- page %>% 
  html_elements("#teams_standard_batting") %>% 
  html_table() %>% pluck(1) 

# Preview the scraped table
head(batting_2025)

# A tibble: 6 × 29
  Tm         `#Bat` BatAge `R/G` G     PA    AB    R     H     `2B`  `3B`  HR   
  <chr>      <chr>  <chr>  <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Arizona D… 65     27.8   4.88  162   6210  5480  791   1377  277   38    214  
2 Athletics  58     26.1   4.52  162   6151  5547  733   1403  296   16    219  
3 Atlanta B… 71     28.3   4.47  162   6186  5508  724   1349  243   19    190  
4 Baltimore… 70     26.5   4.18  162   6020  5416  677   1273  251   19    191  
5 Boston Re… 56     27.6   4.85  162   6206  5562  786   1414  324   24    186  
6 Chicago C… 52     28.4   4.90  162   6162  5495  793   1371  267   29    223  
# ℹ 17 more variables: RBI <chr>, SB <chr>, CS <chr>, BB <chr>, SO <chr>,
#   BA <chr>, OBP <chr>, SLG <chr>, OPS <chr>, `OPS+` <chr>, TB <chr>,
#   GDP <chr>, HBP <chr>, SH <chr>, SF <chr>, IBB <chr>, LOB <chr>

Example 2.b

So far, we have learned how to harvest individual pieces of information from a page (like a block of paragraphs or a single table). While this is powerful, you could technically achieve the same result by manually copying and pasting the information into an Excel file.

The real magic of web scraping happens when we automate! What if you want to analyze baseball data across multiple seasons? Instead of manually navigating to and copy-pasting seven different web tables, we can write a for loop to build a “crawler” that auto-navigates across years and automatically stacks the datasets together.

🎯 Our Mission: Build a loop that scrapes Team Standard Batting data from 2020 through 2026, appends the correct calendar year, and merges everything into one giant dataset.

# Make an empty data frame to act as our master repository
all_batting <- data.frame()

# Loop through each year in our target range
for(year in 2020:2026){
  
  # Dynamically construct the URL for the current year
  sr_url <- paste0("https://www.baseball-reference.com/leagues/majors/", year,".shtml")
  
  # Read the page HTML
  page <- read_html(sr_url)
  
  # Scrape the table and append a column tracking the calendar year
  batting <- page %>% 
    html_elements("#all_teams_standard_batting") %>% 
    html_table() %>% pluck(1) %>% 
    mutate(year = rep(year, length(Tm)))
  
  # Stack the current year's data onto our master data frame
  all_batting <- rbind(all_batting, batting)
}

# Preview the combined historical dataset
head(all_batting)

# A tibble: 6 × 30
  Tm         `#Bat` BatAge `R/G` G     PA    AB    R     H     `2B`  `3B`  HR   
  <chr>      <chr>  <chr>  <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Arizona D… 45     29.1   4.48  60    2238  1997  269   482   101   12    58   
2 Atlanta B… 48     28.2   5.80  60    2344  2074  348   556   130   3     103  
3 Baltimore… 45     26.3   4.57  60    2242  2026  274   523   102   7     77   
4 Boston Re… 47     27.0   4.87  60    2304  2083  292   552   118   7     81   
5 Chicago C… 47     27.9   4.42  60    2214  1918  265   422   82    8     74   
6 Chicago W… 48     27.6   5.10  60    2267  2047  306   534   94    6     96   
# ℹ 18 more variables: RBI <chr>, SB <chr>, CS <chr>, BB <chr>, SO <chr>,
#   BA <chr>, OBP <chr>, SLG <chr>, OPS <chr>, `OPS+` <chr>, TB <chr>,
#   GDP <chr>, HBP <chr>, SH <chr>, SF <chr>, IBB <chr>, LOB <chr>, year <int>

Example 3. Extracting Repeated Items (Plural vs. Singular)

🎯 Our mission: Capture a list of the most popular movies currently playing in theaters.

For this example, we will collect the titles of trending films from “Rotten Tomatoes” website using a CSS class selector. This exercise teaches a vital lesson about how rvest handles data structures depending on whether you use a plural or singular function.

Part A. The Wide Net (`html_elements`)

Here, our goal is to grab a complete list of all the popular movies on the page. For this, we use the plural html_elements() function. This acts like a massive dragnet, pulling every single matching card container labeled with the class .js-tile-link.

# Establish the target URL
rt_url <- "https://www.rottentomatoes.com/browse/movies_in_theaters/sort:popular"

# Read the HTML structure
rt_html <- read_html(rt_url)

# Extract text using html_elements function
popular_movies <- rt_html %>% 
  html_elements(".js-tile-link") %>% 
  html_text2()

# Convert to a clean table format
rt_df <- as_tibble(popular_movies)

# View the first 5 rows
head(rt_df, 5)

# A tibble: 5 × 1
  value                                                           
  <chr>                                                           
1 62% 88% Star Wars: The Mandalorian and Grogu Opened May 22, 2026
2 47% 53% Passenger Opened May 22, 2026                           
3 92% 72% I Love Boosters Opened May 22, 2026                     
4 85% Backrooms Opens May 29, 2026                                
5 94% 92% Tuner Opened May 22, 2026

📊 Notice that rt_df_plural results in a long, multi-row data frame. R found dozens of elements sharing that class tag and packed them into a single character vector.

Part B: The Singular Target (`html_element`)

Now, let’s observe what happens if we accidentally make a tiny typo and leave off the “s”, changing the function to the singular html_element().

# Extract text using html_element function
popular_movies <- rt_html %>% 
  html_element(".js-tile-link") %>% 
  html_text2()

# Convert to a clean table format for viewing
rt_df <- as_tibble(popular_movies)

head(rt_df, 5)

# A tibble: 1 × 1
  value                                                           
  <chr>                                                           
1 62% 88% Star Wars: The Mandalorian and Grogu Opened May 22, 2026

Here, instead of a long list of movies, your data frame has exactly one row containing a single movie title (the very first movie on the page). When applied directly to a webpage’s main HTML body, the singular html_element() function stops hunting the moment it hits its very first match.

Example 4. Scaling Up with Custom Functions

🎯 Our mission: Build a reusable R function that loops through a list of Google Scholar profile URLs, extracts faculty names and their core citation metrics (Total Citations, h-index, and i10-index), and aggregates them into a master database.

Writing a standalone for loop works great for a one-off task. However, if you plan to scrape multiple departments or monitor faculty metrics over time, wrapping your scraping pipeline inside a custom function makes your code clean, modular, and easy to reuse.

When we look at a Google Scholar profile, the citation metrics are tucked inside a neat visual table. Under the hood, however, html_elements("td") strips away the table structure and flattens all cells into a single linear text vector.

To pull out the numbers we want, we use positional indexing based on how Google Scholar organizes its rows:

stats[1] = "Citations" (Row Header) \(\rightarrow\) stats[2] = Total Citations
stats[4] = "h-index" (Row Header) \(\rightarrow\) stats[5] = Total h-index
stats[7] = "i10-index" (Row Header) \(\rightarrow\) stats[8] = Total i10-index

Here is how we assemble this logic into a robust, automated function:

# Define a custom web scraping function
gs_stats <- function(url){
  # Make an empty master data frame
  df_out <- data.frame()
  
  # Loop through each profile URL
  for(link in url){
    
    # Download the HTML
    page <- read_html(link)
    
    # Extract the scholar's name
    person <- page %>% 
      html_elements("#gsc_prf_in") %>% 
      html_text2()
    
    # Get the table cells for total citation stats
    stats <- page %>% 
      html_elements("td") %>% 
      html_text2()
    
    # Make data frame by hard-coding indexing 'stats'
    df_tmp <- data.frame(name = person, 
                         citation = as.numeric(stats[2]),
                         h_index = as.numeric(stats[5]),
                         i10_index = as.numeric(stats[8]))
    
    # Append the row onto our master repository
    df_out <- rbind(df_out, df_tmp) 
  }
  
  # Return the finalized, complete dataset
  return(df_out)
}

Now that the function is ready, let’s try it on a list of UO Psychology faculty members:

# Get list of some faculty Google Scholar URLs
uo_urls <- c("https://scholar.google.com/citations?user=vHEQPGUAAAAJ&hl=en",
          "https://scholar.google.com/citations?user=4X4X4xkAAAAJ&hl=en",
          "https://scholar.google.com/citations?user=jCxd8-UAAAAJ&hl=en",
          "https://scholar.google.com/citations?hl=en&user=WOAdX44AAAAJ",
          "https://scholar.google.com/citations?hl=en&user=hZ-YQ3AAAAAJ",
          "https://scholar.google.com/citations?hl=en&user=vkhJVbkAAAAJ",
          "https://scholar.google.com/citations?hl=en&user=RG9lc0QAAAAJ",
          "https://scholar.google.com/citations?hl=en&user=Y7S8ybkAAAAJ",
          "https://scholar.google.com/citations?hl=en&user=RewU7lQAAAAJ",
          "https://scholar.google.com/citations?hl=en&user=1bzbVeYAAAAJ",
          "https://scholar.google.com/citations?hl=en&user=PsFofkQAAAAJ",
          "https://scholar.google.com/citations?hl=en&user=181SoTAAAAAJ",
          "https://scholar.google.com/citations?hl=en&user=Wy5XKmEAAAAJ",
          "https://scholar.google.com/citations?hl=en&user=SRSfyf4AAAAJ",
          "https://scholar.google.com/citations?hl=en&user=Jap8Z-cAAAAJ",
          "https://scholar.google.com/citations?hl=en&user=hC6IzXMAAAAJ",
          "https://scholar.google.com/citations?hl=en&user=krj_eKwAAAAJ",
          "https://scholar.google.com/citations?hl=en&user=oqLsBt0AAAAJ"
)

# Function
citations <- gs_stats(uo_urls)

head(citations)

               name citation h_index i10_index
1  Robert S. Chavez     2667      21        23
2 Michael I. Posner   100881     157       446
3  Elliot T Berkman    11143      50        90
4     Sara J Weston     2328      26        39
5  Kate Celis Mills    12648      42        65
6     Chanel Meyers      676      10        10

Example 5. Bringing It All Together

Now, let’s practice everything we have learned on a real-world sandbox. The web community maintains a live, public website called Books to Scrape specifically designed for developers to practice web scraping safely, legally, and ethically.

🎯 Our mission: Build a master data frame containing every single book’s Title, Price, Stock Status, Numeric Star Rating, and Detail Link, and then crawl inside to extract its Product Description.

To make this capstone project manageable, we will break it down into 3 progressive steps:

Step 1: Extract the Capture the available information on the first page: Title, Price, Stock Status, Star Rating, and Link.
Step 2: Wrap our blueprint inside a loop to glide through all 50 directory pages.
Step 3: Write a custom function to step inside those book links and extract the hidden product descriptions.

Step 1. Build the Single-Page Blueprint

Before writing any code, let’s look at the layout of the Books to Scrape homepage. The website displays a clean, repeating grid of 20 books. Visually, each book sits inside its own self-contained “card” or box that holds its specific title, cover image, rating, price, and stock status.

Your first instinct might be to cast three independent nets across the whole page: one net to grab all 20 titles, one for all 20 prices, and one for all 20 stock statuses.

This works fine on a perfect webpage, but it introduces a massive real-world risk. If just one book on the page is missing a price, R will return 20 titles but only 19 prices. When you try to force these uneven lists into a data frame, R will either crash with a dimensional error, or worse, it will silently misalign your rows, attaching the wrong price to the wrong book for the entire rest of your dataset!

To prevent this, we mimic the visual layout of the page using a Parent-Child strategy:

The Parent Container: We use the plural html_elements() to cut the webpage into 20 individual product “cards” (the parents).
The Child Elements: We then loop inside each individual card one-by-one using the singular html_element() to extract the title, price, and stock status (the children).

Now, if a child element (like a price) is missing inside a folder, R will automatically insert an NA blank, preserving the structural alignment of our data rows.

# Ingest the homepage HTML
books_url  <- "http://books.toscrape.com/"
books_html <- read_html(books_url)

# Isolate the parent card containers for all 20 books on the page
book_cards <- books_html %>% 
  html_elements(".product_pod")

# Extract base attributes row-by-row
book_catalog <- tibble(
  # Title (Attribute lookup prevents text truncation)
  title = book_cards %>% 
    html_element("h3 a") %>% 
    html_attr("title"), 
  
  # Price
  price = book_cards %>% 
    html_element(".price_color") %>% 
    html_text2(),
  
  # Availability
  stock_status = book_cards %>% 
    html_element(".instock.availability") %>% 
    html_text2(),
  
  # Star Rating
  rating = book_cards %>% 
    html_element(".star-rating") %>% 
    html_attr("class") %>% 
    # remove the repetitive text prefix
    str_remove("star-rating "),
  
  # The relative URL destination
  link = book_cards %>% 
    html_element("h3 a") %>% 
    html_attr("href")
)

# Preview our table
head(book_catalog, 5)

# A tibble: 5 × 5
  title                                 price  stock_status rating link         
  <chr>                                 <chr>  <chr>        <chr>  <chr>        
1 A Light in the Attic                  £51.77 In stock     Three  catalogue/a-…
2 Tipping the Velvet                    £53.74 In stock     One    catalogue/ti…
3 Soumission                            £50.10 In stock     One    catalogue/so…
4 Sharp Objects                         £47.82 In stock     Four   catalogue/sh…
5 Sapiens: A Brief History of Humankind £54.23 In stock     Five   catalogue/sa…

🔍 Deep Dive: Text vs. Attributes

To build this blueprint, we had to rely on two completely different harvesting tools: html_text2() and html_attr().

html_text2(): Extracts the literal text humans can read on the screen.
- Example: If the HTML is Price: £10, using html_text2() gives you "Price: £10".
html_attr(): Scans the hidden metadata written directly inside the HTML tag itself. Humans don’t see this text on the browser page, but the computer needs it to function.

If we right-click a book card and select Inspect using our browser’s Developer Tools, we can see exactly how this background HTML structure is built:

Bypassing Text Truncation (title). Look closely at the text between the anchor tags >A Light in the ...</a>. Because the title is long, the webpage visually cuts it off with an ellipsis. If we used html_text2(), we would scrape that broken, incomplete name. However, notice the hidden attribute box right before it: title="A Light in the Attic". By calling html_attr("title"), we bypass the screen limitations and extract the full, uncut title text.
Gathering Hyperlinks for Future Navigation (href). Look at the anchor tag’s web path attribute: href="catalogue/a-light-in-the-attic_1000/index.html". Web links are never spelled out as readable text on a screen; they are stored exclusively as metadata background attributes. We use html_attr("href") to harvest this path, which will eventually allow our scraper to navigate directly into this book’s unique sub-page.
Reading Star Ratings from Graphic Icons (class). Finally, on the screen, star ratings are just visual graphic icons. There are no words like “3 stars” printed on the page for a standard text-reader to scan. Instead, notice how the rating is embedded directly inside the paragraph tag’s class attribute: . Because there is no actual plain text written between the opening  and closing  tags, running html_text2() here would return an empty, blank string. By targeting html_attr("class"), we pull the raw text “star-rating Three”, allowing us to cleanly strip away the repetitive prefix using str_remove().

💡 Summary Rule: Whenever the data you want isn’t written out in plain English on the browser window, look inside the HTML tag attributes. Web links (href), image sources (src), and styling categories (class) are all structural barcodes waiting to be scanned by html_attr().

Step 2. Get the information for other pages (For Loop)

Now that our single-page code works perfectly, we can bring this logic into an automation engine. We will write a loop that constructs the URLs for all 50 directory pages, harvests their item grids using our blueprint, and stacks them together.

Note: For mass scraping operations, initializing an empty data frame and appending data rows sequentially using bind_rows() is the standard industry practice for maintaining fast, stable data pipelines.

# Make an empty master data frame warehouse
all_books_catalog <- tibble()

# Begin the automation engine loop across all 50 directory pages
for (page_num in 1:50) {
  
  # Construct the target URL path
  page_url <- paste0("http://books.toscrape.com/catalogue/page-", page_num, ".html")
  
  # Read the page HTML code source structure
  homepage_html <- read_html(page_url)
  
  # The code we wrote in the previous step
  book_cards <- books_html %>% 
    html_elements(".product_pod")
  
  page_table <- tibble(
  
    title = book_cards %>% 
      html_element("h3 a") %>% 
      html_attr("title"), 
  
    price = book_cards %>% 
      html_element(".price_color") %>% 
      html_text2(),
    
    stock_status = book_cards %>% 
      html_element(".instock.availability") %>% 
      html_text2(),
    
    rating = book_cards %>% 
      html_element(".star-rating") %>% 
      html_attr("class") %>% 
      str_remove("star-rating "),
    
    link = book_cards %>% 
      html_element("h3 a") %>% 
      html_attr("href")
    ) %>% 
    # Clean the data
      mutate(
        price_numeric = as.numeric(str_remove(price, "£")),
        rating_numeric = case_match(
          str_remove(rating, "star-rating "),
          "One"   ~ 1, "Two" ~ 2, "Three" ~ 3, "Four" ~ 4, "Five" ~ 5,
          .default = NA
        )
    ) %>% 
    select(-c(price, rating))
  
  # Add this page's complete data row-strata onto the master data frame
  all_books_catalog <- bind_rows(all_books_catalog, page_table)
  
  # Broadcast live status logs to the R console console tracker
  # message(paste("Successfully processed catalog page:", page_num, "/ 50"))
  
  # Maintain server courtesy with systematic execution breaks
  # Sys.sleep(0.1)
}

Warning: There was 1 warning in `mutate()`.
ℹ In argument: `rating_numeric = case_match(...)`.
Caused by warning:
! `case_match()` was deprecated in dplyr 1.2.0.
ℹ Please use `recode_values()` instead.

# Verify rows dimension count matches 1,000 items exactly
print(dim(all_books_catalog))

[1] 1000    5

# Output
head(all_books_catalog)

# A tibble: 6 × 5
  title                          stock_status link  price_numeric rating_numeric
  <chr>                          <chr>        <chr>         <dbl>          <dbl>
1 A Light in the Attic           In stock     cata…          51.8              3
2 Tipping the Velvet             In stock     cata…          53.7              1
3 Soumission                     In stock     cata…          50.1              1
4 Sharp Objects                  In stock     cata…          47.8              4
5 Sapiens: A Brief History of H… In stock     cata…          54.2              5
6 The Requiem Red                In stock     cata…          22.6              1

Step 3: Get the Product Description (Deep Crawling)

If you click on an individual book, you will find a rich paragraph containing its Product Description. Because this text block lives exclusively inside each item’s unique detail sub-page, our scraper must act like a “deep crawler”: it needs to grab the hyperlinks we collected in Step 1, travel inside each sub-page link, extract the description text, and bring it back to our master data frame.

When navigating between pages, you will run into a subtle pathing bug. On the primary homepage, the book links naturally include a folder prefix: http://books.toscrape.com/catalogue/relative_url. However, on the subsequent catalog pages (pages 2 through 50), the links drop that prefix and look like this: http://books.toscrape.com/relative_url.

If we blindly paste a base domain onto these inconsistent paths, half of our URLs will break! To build a bulletproof crawling engine, we will write a custom function that uses str_replace() to strip away any redundant "catalogue/" strings and cleanly build an absolute URL path every single time.

# Define our deep-crawling subpage extraction function
scrape_book_description <- function(relative_url) {
  
  # Clean up the relative URL path to ensure we don't double-paste "catalogue/"
  clean_path <- str_replace(relative_url, "^catalogue/", "")
  full_url   <- paste0("http://books.toscrape.com/catalogue/", clean_path)
  
  # Ingest page source code safely; return NA if a network connection error occurs
  detail_html <- tryCatch(read_html(full_url), error = function(e) return(NULL))
  if (is.null(detail_html)) return(NA)
  
  # Target the unique descriptive paragraph block inside the page article element
  description_text <- detail_html %>% 
    html_element("article.product_page > p") %>% 
    html_text2()
  
  return(description_text)
}

# --- Test the Deep Crawl Loop on the first 5 books ---
descriptions_vector <- c()

# Pull links directly from our page-one blueprint data frame
target_links <- book_catalog$link[1:5]

for (link in target_links) {
  # Run our custom function recipe
  current_desc <- scrape_book_description(link)
  descriptions_vector <- c(descriptions_vector, current_desc)
  
  # Pause for half a second to maintain responsible, polite pacing
  # Sys.sleep(0.5)
}

# Bind our freshly scraped descriptions onto a subset of our data frame
deep_catalog_sample <- book_catalog %>% 
  slice(1:5) %>% 
  mutate(product_description = descriptions_vector)

# View our data
head(deep_catalog_sample)

# A tibble: 5 × 6
  title                      price stock_status rating link  product_description
  <chr>                      <chr> <chr>        <chr>  <chr> <chr>              
1 A Light in the Attic       £51.… In stock     Three  cata… "It's hard to imag…
2 Tipping the Velvet         £53.… In stock     One    cata… "\"Erotic and abso…
3 Soumission                 £50.… In stock     One    cata… "Dans une France a…
4 Sharp Objects              £47.… In stock     Four   cata… "WICKED above her …
5 Sapiens: A Brief History … £54.… In stock     Five   cata… "From a renowned h…

Final Note: Static vs. Dynamic Websites

Every example we built today was on a static website. This means the data is baked directly into the raw HTML code, allowing rvest::read_html() to catch it easily.

As you venture out to scrape other websites, you will eventually encounter dynamic websites (like modern e-commerce sites or social media feeds). These sites use JavaScript to load their data on the fly, meaning read_html() might fetch an empty page. To scrape those sites, you have to learn a separate set of advanced tools, such as browser automation (chromote or RSelenium), which allow R to open a live browser window and let the page finish loading before harvesting.

For now, celebrate your win! You have officially mastered the core foundation of HTML web scraping, which handles a massive portion of the open web!!

Mini Hacks

Mini Hack #1:

Go to the Books to Scrape homepage. Look at the left-hand side of the screen. There is a list of book Categories (Travel, Mystery, Historical Fiction, etc.). Write a short script to scrape these category names into a clean list or data frame.

base_url <- "http://books.toscrape.com/"

Mini Hack #2:

Click on the first book on the website (“A Light in the Attic”) to go to its detail page. Scroll to the bottom of the page. You will see a structured Product Information table containing the UPC, Product Type, Price, and Tax. Extract this entire table into an R data frame.

detail_url <- "http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"

Mini Hack #3:

Find a public website, extract some interesting data. It can be a table of numbers, a list of items, or paragraphs of text.

Clean your extracted data and load it into a neat R data frame. Print the head() of your data frame.

Run some simple descriptive statistics (like averages or frequency counts). You can also use ggplot2 or wordcloud to create a visualization of the trends you discovered.

your_url <- ""