JavaScript Supported Web Scraping using Perl and Selenium

I explain JavaScript supported web scraping using Perl and Selenium::Remote::Driver. Selenium::Remote::Driver is a Perl module for Selenium. Selenium provides the APIs for JavaScript supported web scraping.

Download and Install Perl

See the following articles about how to donwload and install Perl. Windows, Max, Linux/UNIX are supported.

Download and Install Google Chrome

I explain the way to install web browsers.

Windows and Mac

You can download and install Google Chrome in the following page.

Linux

In Linux you maybe install Google Chrome using package managers. I try the installing Google Chrome on the some Linux distributions.

Ubuntu
CentOS

Check the version of Google Chrome

Check the version of Google Chrome.

google-chrome --version

An example of the output:

Google Chrome 87.0.4280.88

Download and Install ChromeDriver

ChromeDriver is a web driver. Web drivers is tools to control Web browsers by using programs.

Ubuntu
CentOS

Check the version of ChromeDriver

Check the version of ChromeDriver.

chromedriver --version

An example of the output:

ChromeDriver 87.0.4280.88 (89e2380a3e36c3464b5dd1302349b1382549290d-refs/branch-heads/4280@{#1761})

It is good that the version of ChromeDriver is the same as Google Chrome to prevent problems.

Install Selenium::Remote::Driver

Let's install Selenium::Remote::Driver using cpanm or cpan. Selenium::Remote::Driver is a Perl module for Selenium.

# cpanm
cpanm Selenium::Remote::Driver

# cpan
cpan Selenium::Remote::Driver

Get a Web Page using Selenium

Get a Web Page using Selenium.

Load Selenium::Chrome

Let's load Selenium::Chrome. Selenium::Chrome is a sub class of Selenium::Remote::Driver for Google Chrome.

use Selenium::Chrome;

Create a Selenium::Chrome Object

Create a Selenium::Chrome object with headless options.

my $driver = Selenium::Chrome->new(
  extra_capabilities => {
    'goog:chromeOptions' => {
      args => ['headless', 'disable-gpu', 'window-size=1920,1080', 'no-sandbox' ]
    }
  }
);

new method is a constructor of normal Perl Objcet Oriented Programing. Options is specified using Hash References and Array References

These options is needed for headless execution. Headless execution means you get a page without the GUI of Google Chrome.

If there are no these options and you don't have GUI display, you will see the following error message.

Could not create new session: unknown error: Chrome failed to start: exited abnormally.

(unknown error: DevToolsActivePort file doesn't exist) (The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.) at access.pl line 79.

Shutdown a Selenium::Chrome Object

If you finish the use of Selenium::Chrome, you need to shutdown it.

$driver->shutdown_binary;

Get a Web Page and the Title

You can get a web page using a get method.

$driver->get('http://www.google.com');

It is good to wait for a few seconds because the page is maybe not yet rendered.

sleep 3;

You can get the title using a get_title method.

my $title = $driver->get_title;

Examples:

This is an example to get a page and the title.

use Selenium::Chrome;

# Create a Selenium::Chrome object
my $driver = Selenium::Chrome->new(
  extra_capabilities => {
    'goog:chromeOptions' => {
      args => ['headless', 'disable-gpu', 'window-size=1920,1080', 'no-sandbox' ]
    }
  }
);

# Get a page
$driver->get('http://www.google.com');

sleep 3;

# Get the title
my $title = $driver->get_title;

print "$title\n";

# Shutdown
$driver->shutdown_binary;

The output:

Google

Get the Screenshot

You can get the screenshot using capture_screenshot method.

$driver->capture_screenshot("screenshot.png");

If you want to save the screenshot to the program directory, you can use FindBin module.

use FindBin;

$driver->capture_screenshot("$FindBin::Bin/screenshot.png");

If you want to serve the screenshot file and see it on your web browser, you can use Mojolicious::Plugin::Directory or Mojolicious::Plugin::Directory::Stylish.

Mojolicious::Plugin::Directory

# Insltall
cpanm Mojolicious::Plugin::Directory

# Serve current directories on your command line
perl -Mojo -E 'a->plugin("Directory")->start' daemon

Mojolicious::Plugin::Directory::Stylish

# Insltall
cpanm Mojolicious::Plugin::Directory::Stylish

# Serve current directories on your command line
perl -Mojo -E 'a->plugin("Directory::Stylish")->start' daemon

Install Fonts

If you see character corruption when you get screenshots, the installing of fonts is needed.

I explain the way to install the fonts on some OSs.

Ubuntu

CentOS

CSS Selector

If you are familiar with CSS Selector, you can make CSS Selector as the default finder.

# Make CSS Selector as the default finder
$driver->default_finder('css');

In this page, I use CSS selector to select elements of HTML.

Get An Element

You can get an element using find_element method with CSS selector. The return value is Selenium::Remote::WebElement

# Get an element using id CSS selector
my $element = $driver->find_element("#myid");

# Get an element using class CSS selector
my $element = $driver->find_element(".myclass");

# Get an element using attirbute CSS selector
my $element = $driver->find_element("[name=id]");

Note that find_element method throws exception when the element is not found.

If you want to avoid exceptions, you need to use eval block.

my $element;
eval {
  $element = $driver->find_element("#myid");
};

Often Used CSS Selectors

I introduce often used CSS selectors.

# id starts with "myid_"
my $selects_first = $driver->find_element('[id^=myid_]');

Get a Child Element

If you get a child element of the element, you can use find_child_element method.

# Get a child elemen
my $child_element = $driver->find_child_element($element, ".myclass");

Get Multiple Elements

You can get multiple elements using find_elements method with CSS selector. The return value is an Array References which contains the list Selenium::Remote::WebElement.

In this example, I get the button that contains "Signin" using if statement and for statement and Regular Expressions.

# Get Multiple Elements
my $buttons = $driver->find_elements('button');
for my $button (@$buttons) {
  my $text = $button->get_text;
  if ($text =~ /Signin/) {
    $button->click;
  }
}

Note that find_elements method throws exception when all the elements is not found.

If you want to avoid exceptions, you need to use eval block.

my $buttons;
eval {
  $buttons = $driver->find_elements('button');
};

Get Child Elements

If you get child elements of the element, you can use find_child_elements method.

# Get a child elemen
my $child_elements = $driver->find_child_elements($element, ".myclass");

Text Fields, Password Fields, Text Area

I explain the manipulation of text fields, password fields, text areas.

Input Text

You can input text in text fields, password fields, text areas using send_keys method of Selenium::Remote::WebElement. Text are Perl Strings.

# Input text
$input_element->send_keys("perlclub");

send_keys method supports Unicode. You need to write "use utf8;" and save the source code as UTF-8.

# Input unicode text
use strict;
use warnings;
use utf8;

# ...

$input_element->send_keys("あいうえお");

Input Keys from Keybord

I explain the details of input keys from keybord.

Input Characters

You can input keys to an element using send_keys method of Selenium::Remote::WebElement.

# Input text
$element->send_keys("perlclub");

Input Unicode Characters

send_keys method supports Unicode. You need to write "use utf8;" and save the source code as UTF-8.

# Input unicode text
use strict;
use warnings;
use utf8;

# ...

$element->send_keys("あいうえお");

Input Special Keys

You can input special keys such as "space", "enter" using send_keys method with KEYS function Selenium::Remote::WDKeys.

use Selenium::Remote::WDKeys;

# ...

# Input space
$element->send_keys(KEYS->{'space'});

# Input enter
$element->send_keys(KEYS->{'enter'});

# Input "up arrow" x 10
$element->send_keys(KEYS->{'up_arrow'} x 10);

# Input "down arrow" x 5
$element->send_keys(KEYS->{'down_arrow'} x 5);

The following is the list of constant keys.

null
cancel
help
backspace
tab
clear
return
enter
shift
control
alt
pause
escape
space
page_up
page_down
end
home
left_arrow
up_arrow
right_arrow
down_arrow
insert
delete
semicolon
equals
numpad_0
numpad_1
numpad_2
numpad_3
numpad_4
numpad_5
numpad_6
numpad_7
numpad_8
numpad_9
multiply
add
separator
subtract
decimal
divide
f1
f2
f3
f4
f5
f6
f7
f8
f9
f10
f11
f12
command_meta
ZenkakuHankaku

Checkbox and Radio Button

I explain how to check the chekking of checkboxes and radio buttons and set the chekking of checkboxes and radio buttons.

Check the Chekking of Checkboxes and Radio Buttons

You can check the checking of checkboxes or radio buttons using is_selected method of Selenium::Remote::WebElement.

# Check the Chekking of Checkboxes or Radio Buttons
my $is_selected = $element->is_selected;

Set the Chekking of Checkboxes or Radio Buttons

You can set the chekking of checkboxes or radio buttons set_selected method of Selenium::Remote::WebElement.

# Set the Chekking of Checkboxes or Radio Buttons
$element->set_selected;

You maybe see the following error "Other element would receive the click".

Error while executing command: element click intercepted: element click intercepted: Element <input type="checkbox" class="myclass" value="true" id="myid"> is not clickable at point (1009, 210). Other element would receive the click: <label class="mylabel" for="myid">...</label>
  (Session info: headless chrome=87.0.4280.88) at /home/perlclub/perl5/perlbrew/perls/perl-5.20.3/lib/site_perl/5.20.3/Selenium/Remote/Driver.pm line 410.
 at /home/perlclub/perl5/perlbrew/perls/perl-5.20.3/lib/site_perl/5.20.3/Selenium/Remote/Driver.pm line 361.

This is because label block receive the click in the logic of set_selected method.

You can write the following codes to click the label. The label block is a sibling of #myid

my $checkbox = $driver->find_element("#myid");
unless ($checkbox->is_selected) {
  my $checkbox_label = $driver->find_element("#myid ~ label");
  $checkbox_label->set_selected;
}

Click Buttons

You can click buttons using click method of Selenium::Remote::WebElement.

# Click Buttons
$button->click;

You maybe see the following error "is not clickable".

...<button>...</button> is not clickable ... at /home/perlclub/perl5/perlbrew/perls/perl-5.20.3/lib/site_perl/5.20.3/Selenium/Remote/Driver.pm line 410.
 at /home/perlclub/perl5/perlbrew/perls/perl-5.20.3/lib/site_perl/5.20.3/Selenium/Remote/Driver.pm line 361.

Note that the clicked button is needed to be displayed.

You can scroll the windows to display the button in the middle of the window using execute_script method and JavaScript's scrollIntoView method.

# Display the button in the middle of the window
$driver->execute_script('arguments[0].scrollIntoView({block: "center"});', $button);

# Click Buttons
$button->click;

Examples to click buttons

Click the button that text is "Search".

# Click the button that text is "Search".
my $buttons = $driver->find_elements('button');
for my $button (@$buttons) {
  my $text = $button->get_text;
  if ($text =~ /^\s*Search\s*$/) {
    $button->click;
    last;
  }
}

Get Text

You can get text of an element using get_text method of Selenium::Remote::WebElement.

# Get text
my $text = $element->get_text;

If you click buttons, you maybe wait a few seconds until the next page is shown using sleep function.

sleep 3;

Execute JavaScript

You can execute JavaScript using execute_script method. The first argument is JavaScript you want to execute. The second argument is an element. This elemenet is passed to the first argument of arguments variable of JavaScript.

$driver->execute_script('arguments[0].scrollIntoView({block: "center"});', $button);

If you can't found operations you want to execute in the Selenium::Remote::Driver documents, you can execute any JavaScript using execute_script method.

Select Fields

I explain how to select the value of Select Fields.

1. Display the select field

2. Click the select field.

3. Input down keys

4. Input enter key

# Display the select field
$driver->execute_script('arguments[0].scrollIntoView({block: "center"});', $select);

# Click the select field
$select->click;

# Input down keys
$select->send_keys(KEYS->{'down_arrow'} x $key_down_count);

# Input enter key
$select->send_keys(KEYS->{'enter'});

"x" is a Perl Repeat Operator.

Get Page Source

You can get the page source that is parsed after the web browser using get_page_source method. This is not a pure raw page source.

# Get the page source
my $page_source = $driver->get_page_source;

Get body Text

You can get the text in the body tag using get_body method. The return value doesn't contain HTML tags, only contains text.

# Get the body text
my $body_text = $driver->get_body;

Table Manipulation

I introduce examples of table manipulation to iterate each row of the table.

# Table manipulation
my $rows;
eval { $rows = $driver->find_elements('table.mytable > tr') };
if ($rows) {
  for my $row (@$rows) {
    my $row_text = $row->get_text;
    if ($row_text =~ /\b(\d{12})\b/a) {
      my $id = $1;
      
      print "$id\n";
    }
  }
}

Donwload Files

I explain how to donwload files using Selenium::Chrome.

Set the Download Directory

You can set the download directory using C option of new method of Selenium::Chrome.

my $driver = Selenium::Chrome->new(
  extra_capabilities => {
    'goog:chromeOptions' => {
      args => ['headless', 'disable-gpu', 'window-size=1920,1080', 'no-sandbox' ],
      prefs => {
          'download.default_directory' => '/tmp'
      },
    }
  }
);

Note that you can't specify home direcitories as download directories for security reasones.

Download Files

Selenium doesn't have a special features to donload files. Normally, you can download files using click method. You need to wait the finish of the donwloading using sleep method.

# Donwload files
$download_button->click;
sleep 3;

Examples:

# Get the download button
my $download_buttons = $driver->find_elements('button');
my $download_button;
for my $download_button (@$download_buttons) {
  my $download_button_text = $download_button->get_text;
  
  if ($download_button_text =~ /^\s*Download\s*$/) {
    $download_button = $download_button;
    last;
  }
}

# Donwload the file
if ($download_button) {
  $download_button->click;
  sleep 3;
}

Related Informatrion