Webscraping With Selenium



In part 1 of this blog post series we mentioned the most common approach to web scraping and its issues. We also made a small example on how to start web scraping with C#, Selenium and QueryStorm in Excel. Now we’ll expand on the example from part 1 and create a more useful web scraper.

  1. Python Web Scraping With Selenium Using Java
  2. Web Scraping With Selenium C#
  3. Web Scraping With Selenium And Beautiful Soup
  4. Selenium And Scrapy
  5. Web Scraping With Selenium Python

Navigating to and scraping paginated items

It’s time to kick the web scraping up a notch. For instance, let’s scrape the names and prices of the top items on the home page, navigate to the laptops category and scrape all of the laptops as well.

Preparing the table

We should delete the current table rows as they are irrelevant. We can use ResultsTable.Clear() to delete all current table entries instead of deleting them by hand.

Selenium is a Web Browser Automation Tool. Primarily, it is for automating web applications for testing purposes, but is certainly not limited to just that. It allows you to open a browser of your. Selenium was originally developed by Jason Huggins in 2004 as an internal tool at ThoughtWorks. It was mostly used for testing at that time, but now it’s widely used for browser automation platforms and, of course, web scraping! It is available as Selenium WebDriver, Selenium IDE, and Selenium Grid. Selenium may be known as a chemical element (Se) to scientists or to medical professionals a mineral that plays a key role in metabolism. For data scientists, selenium is another tool for scraping websites with Python.

I’ve already written about how the new No CAPTCHA ReCaptcha works, and even had some success breaking it with an iMacros’ browser automation. But, the latest scraping tools are – for most part – driven by Python, so now I want to try the same experiment with Selenium + Python.


In addition, we should also edit the ResultsTable by renaming the Results column to Product Name and by adding a new column named Price.


Web scraping with selenium
Getting the price

To get the price along with the title of the top items, our script needs only minor modifications. First, we find all of the items by their CSS selector (div.thumbnail). Then we find the name and the price of the items by finding their respective elements (name CSS selector – h4 > a, price CSS selector – div.caption > h4.pull-right.price) inside of the parent item element.


Calibre gmail.

Preventing CSS selector issues

Just a heads up – without any changes to the default driver initializer, the browser will open as a small window. That means that there’s a chance that the page will have a mobile/tablet layout so your CSS selectors (that are copied from the DevTools of a maximized browser window) will be invalid. To prevent this issue, we start the driver with some options where we specify that the browser should start maximized.


Page navigation

The next step is navigating – first to the Computers page and then to the Laptops page.

We can navigate by clicking on the Computers menu item and waiting for the Computers page to load. Subsequently, we should click on the Laptops menu item and wait for the Laptops page to load.

Python selenium scrape table

Note: We could just navigate to the URL https://webscraper.io/test-sites/e-commerce/ajax/computers/laptops instead of clicking on the side menu items, but I feel it’s better to demonstrate how to click and wait for the page to load as it is a pretty common problem in web scraping.

Clicking the button

Clicking is easy – we find the element and call its Click method.


Waiting for a page to load


Waiting itself is not an issue as we can use the WebDriverWait class that provides us with a way to wait a certain amount of time until an arbitrary condition happens. However, this condition can prove to be a problem.


In our case, the condition is to wait until the new page has loaded. To do that we need to determine when exactly has an old page unloaded and a new page has loaded. The most robust way to achieve this would be to wait for an element on the old page to go “stale” (no longer attached to the DOM). We also have to wait for an element on the new page to be displayed.

As a sort of a helping hand, we could install and use the DotNetSeleniumExtras.WaitHelpers NuGet package to check if a new page has loaded. However, the project is no longer maintained and the relevant code isn’t complicated, so we can write the code for the conditions ourselves.

The WebDriverWait‘s Until method has a parameter of type Func<IWebDriver, TResult>. Therefore, we have to create a NewPageLoaded method that returns the specified delegate to the Until method. The code can look something like this…


To complete the NewPageLoaded method, we need to replace the dots with concrete staleness and visibility checks. These checks can also return a delegate so they can be used as regular methods and by the Until method. So, let’s define the methods to check for staleness and visibility.

Element staleness

An element is stale if any of these conditions are met:

Python web scraping with selenium using java
  • The element is disabled
  • The element is missing (null)
  • Accessing the element throws a StaleElementReferenceException

Download edge browser mac. Biscuit.

Element visibility

Also, an element is visible if:

  • The driver can find the element
  • The element is displayed


Page loaded condition and navigation

Finally, the NewPageLoaded method looks like this:

Python Web Scraping With Selenium Using Java


And once we decide what elements on the pages we’re going to use to identify if a new page has loaded, we’re ready to navigate to the Computers page and the Laptops page. I chose the following:


Now we can finally perform navigation:


Scraping paginated laptop items

Since we’ve navigated to the Laptops page, we can now scrape the laptop items. We need a couple of things to do that.

First of all, we need a reference to the “Next” button element – by clicking it we can load the items, page by page (button CSS selector – button.btn.btn-default.next).

The second thing to have in mind is that we have to wait until the next page of items is loaded. Luckily, we’ve made a method to check the staleness of elements, so we can infer that a new page of items has loaded when the laptop items from the current page go stale.

Web Scraping With Selenium C#

And lastly, we should check whether the “Next” button is enabled or disabled, so we know if we’ve reached the last items page or not.


We are almost done with our scraper! Let’s run the script with F5 and wait a couple of seconds. As a result of running the script, we can see 120 scraped products in our table. However, we should do one more thing – refactor the code a bit.

Finishing steps

First of all, the code for saving home items and laptop items is the same. Therefore, we can extract a method for saving items.

We can also extract a method for page navigation. We just need to pass different CSS selectors when calling the method.

Web Scraping With Selenium And Beautiful Soup

And lastly, to keep the main part of the script nice and readable, we can do two things. We can create a new method just for scraping laptop items. Also, we can create a new method for direct navigation to the Laptops page.

We’re done!

Selenium And Scrapy

Finally, here’s the full code for the tutorial:

With

Web Scraping With Selenium Python


In the next and final part of this web scraping tutorial, we’ll turn our script into a shareable workbook-application that any user with the QueryStorm Runtime can execute.