Introduction to Webscraping with R and Google Searches
Webscraping, the process of extracting data from websites, is a valuable skill in today’s digital age. With the rise of big data and machine learning, understanding how to scrape data from various sources has become crucial for many industries. In this blog post, we will explore how to webscrape with R on Google searches, focusing on overcoming common challenges like cookies and unstable tags.
Prerequisites
To follow along with this tutorial, you will need:
- R (version 3.6 or later)
- The RSelenium package for interacting with Selenium
- The rvest package for HTML parsing
- The stringr package for text manipulation
- The xml2 package for XML parsing
- Docker installed on your system
Installing Required Packages
Before we begin, make sure to install the required packages using R:
# Install required packages
install.packages("RSelenium")
install.packages("rvest")
install.packages("stringr")
install.packages("xml2")
library(RSelenium)
library(rvest)
library(stringr)
library(xml2)
Creating a Docker Container for Selenium
To interact with the web using R, we need to use a headless browser. We can achieve this by running a Selenium container in Docker:
# Create a new directory for the Selenium container
mkdir selenium-container
# Navigate to the directory containing the Selenium Dockerfile
cd selenium-container
# Build and run the Selenium container
docker build -t selenium/standalone-firefox .
docker run -d -p 4445:4444 selenium/standalone-firefox
This will start a new container running the standalone Firefox browser. We will use this container to interact with Google.
Extracting Data from Google
To extract data from Google, we can create a function that takes the page content as input and returns the desired data. Here’s an example of such a function:
extract_Info_By_Page <- function(page_Content) {
# Parse the HTML content using rvest
page <- read_html(page_content)
# Extract the H3 tags with anchor links
h3 <- page %>% html_nodes("a h3")
# Extract the href attribute from the H3 tags
h3_links <- page %>% html_nodes("a h3") %>% html_attr("href")
# Extract the span and div elements
basic <- page %>% html_nodes("span span")
paragraphs <- page %>% html_nodes("div div")
# Return a list containing the extracted data
return(list(page = page, h3 = h3, h3_links = h3_links, basic = basic, paragraphs = paragraphs))
}
This function uses rvest to parse the HTML content and extracts the desired data. We will use this function in the next section to extract data from Google.
Extracting Data from Multiple Pages
To extract data from multiple pages, we can create a loop that navigates to each page and calls the extract_Info_By_Page function. Here’s an example:
links_Other_Pages <- str_extract_all(page_Content, pattern = "\\<a.aria-label=(.*)\\</a\\>")
links_Other_Pages <- str_split(links_Other_Pages, pattern = "\\<a.aria-label")
links_Other_Pages <- str_extract_all(links_Other_Pages, 'href[^<]*\\<span class\\=\\\\')[[1]]
links_Other_Pages <- str_remove_all(links_Other_Pages, '\\<span class')
links_Other_Pages <- str_remove_all(links_Other_Pages, 'id|pnnext|style|text|align|left')
links_Other_Pages <- unlist(str_extract_all(links_Other_Pages, 'search.*[:alnum:]'))
links_Pages <- c(first_Page, links_Other_Pages)
nb_Pages <- length(links_Pages)
result_By_Page <- list()
for(i in 1 : nb_Pages) {
print(i)
remDr$navigate(links_Pages[i])
page_Content <- remDr$getPageSource()[[1]]
result_By_Page[[i]] <- extract_Info_By_Page(page_Content)
}
This loop uses the str_extract_all function to extract links from the page content and then navigates to each link using Selenium. It calls the extract_Info_By_Page function for each page and stores the results in a list.
Putting it all Together
To summarize, we have:
- Created a Docker container for Selenium
- Built a function to extract data from HTML content
- Extracted links from the page content using rvest
- Navigated to each link using Selenium
- Called the
extract_Info_By_Pagefunction for each page and stored the results in a list
We can now use this code as a starting point for our own webscraping projects.
Conclusion
In conclusion, we have learned how to webscrape with R on Google searches, overcoming common challenges like cookies and unstable tags. We created a Docker container for Selenium, built a function to extract data from HTML content, extracted links from the page content using rvest, navigated to each link using Selenium, and called the extract_Info_By_Page function for each page and stored the results in a list.
This is just a basic example of how to webscrape with R on Google searches. There are many other techniques and strategies that can be used depending on the specific use case.
Last modified on 2023-09-01