How to Automate Web Scraping with R and Google Searches Using Selenium and Docker

Introduction to Webscraping with R and Google Searches

Webscraping, the process of extracting data from websites, is a valuable skill in today’s digital age. With the rise of big data and machine learning, understanding how to scrape data from various sources has become crucial for many industries. In this blog post, we will explore how to webscrape with R on Google searches, focusing on overcoming common challenges like cookies and unstable tags.

Prerequisites

To follow along with this tutorial, you will need:

  • R (version 3.6 or later)
  • The RSelenium package for interacting with Selenium
  • The rvest package for HTML parsing
  • The stringr package for text manipulation
  • The xml2 package for XML parsing
  • Docker installed on your system

Installing Required Packages

Before we begin, make sure to install the required packages using R:

# Install required packages
install.packages("RSelenium")
install.packages("rvest")
install.packages("stringr")
install.packages("xml2")

library(RSelenium)
library(rvest)
library(stringr)
library(xml2)

Creating a Docker Container for Selenium

To interact with the web using R, we need to use a headless browser. We can achieve this by running a Selenium container in Docker:

# Create a new directory for the Selenium container
mkdir selenium-container

# Navigate to the directory containing the Selenium Dockerfile
cd selenium-container

# Build and run the Selenium container
docker build -t selenium/standalone-firefox .
docker run -d -p 4445:4444 selenium/standalone-firefox

This will start a new container running the standalone Firefox browser. We will use this container to interact with Google.

Extracting Data from Google

To extract data from Google, we can create a function that takes the page content as input and returns the desired data. Here’s an example of such a function:

extract_Info_By_Page <- function(page_Content) {
  # Parse the HTML content using rvest
  page <- read_html(page_content)
  
  # Extract the H3 tags with anchor links
  h3 <- page %>% html_nodes("a h3")
  
  # Extract the href attribute from the H3 tags
  h3_links <- page %>% html_nodes("a h3") %>% html_attr("href")
  
  # Extract the span and div elements
  basic <- page %>% html_nodes("span span")
  paragraphs <- page %>% html_nodes("div div")
  
  # Return a list containing the extracted data
  return(list(page = page, h3 = h3, h3_links = h3_links, basic = basic, paragraphs = paragraphs))
}

This function uses rvest to parse the HTML content and extracts the desired data. We will use this function in the next section to extract data from Google.

Extracting Data from Multiple Pages

To extract data from multiple pages, we can create a loop that navigates to each page and calls the extract_Info_By_Page function. Here’s an example:

links_Other_Pages <- str_extract_all(page_Content, pattern = "\\&lt;a.aria-label=(.*)\\&lt;/a\\&gt;")
links_Other_Pages <- str_split(links_Other_Pages, pattern = "\\&lt;a.aria-label")
links_Other_Pages <- str_extract_all(links_Other_Pages, 'href[^&lt;]*\\&lt;span class\\=\\\\')[[1]]
links_Other_Pages <- str_remove_all(links_Other_Pages, '\\&lt;span class')
links_Other_Pages <- str_remove_all(links_Other_Pages, 'id|pnnext|style|text|align|left')
links_Other_Pages <- unlist(str_extract_all(links_Other_Pages, 'search.*[:alnum:]'))

links_Pages <- c(first_Page, links_Other_Pages)
nb_Pages <- length(links_Pages)
result_By_Page <- list()

for(i in 1 : nb_Pages) {
  print(i)
  remDr$navigate(links_Pages[i])
  page_Content <- remDr$getPageSource()[[1]]
  result_By_Page[[i]] <- extract_Info_By_Page(page_Content)
}

This loop uses the str_extract_all function to extract links from the page content and then navigates to each link using Selenium. It calls the extract_Info_By_Page function for each page and stores the results in a list.

Putting it all Together

To summarize, we have:

  1. Created a Docker container for Selenium
  2. Built a function to extract data from HTML content
  3. Extracted links from the page content using rvest
  4. Navigated to each link using Selenium
  5. Called the extract_Info_By_Page function for each page and stored the results in a list

We can now use this code as a starting point for our own webscraping projects.

Conclusion

In conclusion, we have learned how to webscrape with R on Google searches, overcoming common challenges like cookies and unstable tags. We created a Docker container for Selenium, built a function to extract data from HTML content, extracted links from the page content using rvest, navigated to each link using Selenium, and called the extract_Info_By_Page function for each page and stored the results in a list.

This is just a basic example of how to webscrape with R on Google searches. There are many other techniques and strategies that can be used depending on the specific use case.


Last modified on 2023-09-01