JavaScript Supported Web Scraping using Perl and Selenium
I explain JavaScript supported web scraping using Perl and Selenium::Remote::Driver. Selenium::Remote::Driver is a Perl module for Selenium. Selenium provides the APIs for JavaScript supported web scraping.
Download and Install Perl
See the following articles about how to donwload and install Perl. Windows, Max, Linux/UNIX are supported.
Download and Install Google Chrome
I explain the way to install web browsers.
Windows and Mac
You can download and install Google Chrome in the following page.
Linux
In Linux you maybe install Google Chrome using package managers. I try the installing Google Chrome on the some Linux distributions.
Ubuntu
CentOS
Check the version of Google Chrome
Check the version of Google Chrome.
google-chrome --version
An example of the output:
Google Chrome 87.0.4280.88
Download and Install ChromeDriver
ChromeDriver is a web driver. Web drivers is tools to control Web browsers by using programs.
Ubuntu
CentOS
Check the version of ChromeDriver
Check the version of ChromeDriver.
chromedriver --version
An example of the output:
ChromeDriver 87.0.4280.88 (89e2380a3e36c3464b5dd1302349b1382549290d-refs/branch-heads/4280@{#1761})
It is good that the version of ChromeDriver is the same as Google Chrome to prevent problems.
Install Selenium::Remote::Driver
Let's install Selenium::Remote::Driver using cpanm or cpan. Selenium::Remote::Driver is a Perl module for Selenium.
# cpanm cpanm Selenium::Remote::Driver # cpan cpan Selenium::Remote::Driver
Get a Web Page using Selenium
Get a Web Page using Selenium.
Load Selenium::Chrome
Let's load Selenium::Chrome. Selenium::Chrome is a sub class of Selenium::Remote::Driver for Google Chrome.
use Selenium::Chrome;
Create a Selenium::Chrome Object
Create a Selenium::Chrome object with headless options.
my $driver = Selenium::Chrome->new( extra_capabilities => { 'goog:chromeOptions' => { args => ['headless', 'disable-gpu', 'window-size=1920,1080', 'no-sandbox' ] } } );
new method is a constructor of normal Perl Objcet Oriented Programing. Options is specified using Hash References and Array References
These options is needed for headless execution. Headless execution means you get a page without the GUI of Google Chrome.
If there are no these options and you don't have GUI display, you will see the following error message.
Could not create new session: unknown error: Chrome failed to start: exited abnormally. (unknown error: DevToolsActivePort file doesn't exist) (The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.) at access.pl line 79.
Shutdown a Selenium::Chrome Object
If you finish the use of Selenium::Chrome, you need to shutdown it.
$driver->shutdown_binary;
Get a Web Page and the Title
You can get a web page using a get method.
$driver->get('http://www.google.com');
It is good to wait for a few seconds because the page is maybe not yet rendered.
sleep 3;
You can get the title using a get_title method.
my $title = $driver->get_title;
Examples:
This is an example to get a page and the title.
use Selenium::Chrome; # Create a Selenium::Chrome object my $driver = Selenium::Chrome->new( extra_capabilities => { 'goog:chromeOptions' => { args => ['headless', 'disable-gpu', 'window-size=1920,1080', 'no-sandbox' ] } } ); # Get a page $driver->get('http://www.google.com'); sleep 3; # Get the title my $title = $driver->get_title; print "$title\n"; # Shutdown $driver->shutdown_binary;
The output:
Get the Screenshot
You can get the screenshot using capture_screenshot method.
$driver->capture_screenshot("screenshot.png");
If you want to save the screenshot to the program directory, you can use FindBin module.
use FindBin; $driver->capture_screenshot("$FindBin::Bin/screenshot.png");
If you want to serve the screenshot file and see it on your web browser, you can use Mojolicious::Plugin::Directory or Mojolicious::Plugin::Directory::Stylish.
Mojolicious::Plugin::Directory
# Insltall cpanm Mojolicious::Plugin::Directory # Serve current directories on your command line perl -Mojo -E 'a->plugin("Directory")->start' daemon
Mojolicious::Plugin::Directory::Stylish
# Insltall cpanm Mojolicious::Plugin::Directory::Stylish # Serve current directories on your command line perl -Mojo -E 'a->plugin("Directory::Stylish")->start' daemon
Install Fonts
If you see character corruption when you get screenshots, the installing of fonts is needed.
I explain the way to install the fonts on some OSs.
Ubuntu
CentOS
CSS Selector
If you are familiar with CSS Selector, you can make CSS Selector as the default finder.
# Make CSS Selector as the default finder $driver->default_finder('css');
In this page, I use CSS selector to select elements of HTML.
Get An Element
You can get an element using find_element method with CSS selector. The return value is Selenium::Remote::WebElement
# Get an element using id CSS selector my $element = $driver->find_element("#myid"); # Get an element using class CSS selector my $element = $driver->find_element(".myclass"); # Get an element using attirbute CSS selector my $element = $driver->find_element("[name=id]");
Note that find_element method throws exception when the element is not found.
If you want to avoid exceptions, you need to use eval block.
my $element; eval { $element = $driver->find_element("#myid"); };
Often Used CSS Selectors
I introduce often used CSS selectors.
# id starts with "myid_" my $selects_first = $driver->find_element('[id^=myid_]');
Get a Child Element
If you get a child element of the element, you can use find_child_element method.
# Get a child elemen my $child_element = $driver->find_child_element($element, ".myclass");
Get Multiple Elements
You can get multiple elements using find_elements method with CSS selector. The return value is an Array References which contains the list Selenium::Remote::WebElement.
In this example, I get the button that contains "Signin" using if statement and for statement and Regular Expressions.
# Get Multiple Elements my $buttons = $driver->find_elements('button'); for my $button (@$buttons) { my $text = $button->get_text; if ($text =~ /Signin/) { $button->click; } }
Note that find_elements method throws exception when all the elements is not found.
If you want to avoid exceptions, you need to use eval block.
my $buttons; eval { $buttons = $driver->find_elements('button'); };
Get Child Elements
If you get child elements of the element, you can use find_child_elements method.
# Get a child elemen my $child_elements = $driver->find_child_elements($element, ".myclass");
Text Fields, Password Fields, Text Area
I explain the manipulation of text fields, password fields, text areas.
Input Text
You can input text in text fields, password fields, text areas using send_keys method of Selenium::Remote::WebElement. Text are Perl Strings.
# Input text $input_element->send_keys("perlclub");
send_keys method supports Unicode. You need to write "use utf8;" and save the source code as UTF-8.
# Input unicode text use strict; use warnings; use utf8; # ... $input_element->send_keys("あいうえお");
Input Keys from Keybord
I explain the details of input keys from keybord.
Input Characters
You can input keys to an element using send_keys method of Selenium::Remote::WebElement.
# Input text $element->send_keys("perlclub");
Input Unicode Characters
send_keys method supports Unicode. You need to write "use utf8;" and save the source code as UTF-8.
# Input unicode text use strict; use warnings; use utf8; # ... $element->send_keys("あいうえお");
Input Special Keys
You can input special keys such as "space", "enter" using send_keys method with KEYS function Selenium::Remote::WDKeys.
use Selenium::Remote::WDKeys; # ... # Input space $element->send_keys(KEYS->{'space'}); # Input enter $element->send_keys(KEYS->{'enter'}); # Input "up arrow" x 10 $element->send_keys(KEYS->{'up_arrow'} x 10); # Input "down arrow" x 5 $element->send_keys(KEYS->{'down_arrow'} x 5);
The following is the list of constant keys.
null cancel help backspace tab clear return enter shift control alt pause escape space page_up page_down end home left_arrow up_arrow right_arrow down_arrow insert delete semicolon equals numpad_0 numpad_1 numpad_2 numpad_3 numpad_4 numpad_5 numpad_6 numpad_7 numpad_8 numpad_9 multiply add separator subtract decimal divide f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12 command_meta ZenkakuHankaku
Checkbox and Radio Button
I explain how to check the chekking of checkboxes and radio buttons and set the chekking of checkboxes and radio buttons.
Check the Chekking of Checkboxes and Radio Buttons
You can check the checking of checkboxes or radio buttons using is_selected method of Selenium::Remote::WebElement.
# Check the Chekking of Checkboxes or Radio Buttons my $is_selected = $element->is_selected;
Set the Chekking of Checkboxes or Radio Buttons
You can set the chekking of checkboxes or radio buttons set_selected method of Selenium::Remote::WebElement.
# Set the Chekking of Checkboxes or Radio Buttons $element->set_selected;
You maybe see the following error "Other element would receive the click".
Error while executing command: element click intercepted: element click intercepted: Element <input type="checkbox" class="myclass" value="true" id="myid"> is not clickable at point (1009, 210). Other element would receive the click: <label class="mylabel" for="myid">...</label> (Session info: headless chrome=87.0.4280.88) at /home/perlclub/perl5/perlbrew/perls/perl-5.20.3/lib/site_perl/5.20.3/Selenium/Remote/Driver.pm line 410. at /home/perlclub/perl5/perlbrew/perls/perl-5.20.3/lib/site_perl/5.20.3/Selenium/Remote/Driver.pm line 361.
This is because label block receive the click in the logic of set_selected method.
You can write the following codes to click the label. The label block is a sibling of #myid
my $checkbox = $driver->find_element("#myid"); unless ($checkbox->is_selected) { my $checkbox_label = $driver->find_element("#myid ~ label"); $checkbox_label->set_selected; }
Click Buttons
You can click buttons using click method of Selenium::Remote::WebElement.
# Click Buttons $button->click;
You maybe see the following error "is not clickable".
...<button>...</button> is not clickable ... at /home/perlclub/perl5/perlbrew/perls/perl-5.20.3/lib/site_perl/5.20.3/Selenium/Remote/Driver.pm line 410. at /home/perlclub/perl5/perlbrew/perls/perl-5.20.3/lib/site_perl/5.20.3/Selenium/Remote/Driver.pm line 361.
Note that the clicked button is needed to be displayed.
You can scroll the windows to display the button in the middle of the window using execute_script method and JavaScript's scrollIntoView method.
# Display the button in the middle of the window $driver->execute_script('arguments[0].scrollIntoView({block: "center"});', $button); # Click Buttons $button->click;
Examples to click buttons
Click the button that text is "Search".
# Click the button that text is "Search". my $buttons = $driver->find_elements('button'); for my $button (@$buttons) { my $text = $button->get_text; if ($text =~ /^\s*Search\s*$/) { $button->click; last; } }
Get Text
You can get text of an element using get_text method of Selenium::Remote::WebElement.
# Get text my $text = $element->get_text;
If you click buttons, you maybe wait a few seconds until the next page is shown using sleep function.
sleep 3;
Execute JavaScript
You can execute JavaScript using execute_script method. The first argument is JavaScript you want to execute. The second argument is an element. This elemenet is passed to the first argument of arguments variable of JavaScript.
$driver->execute_script('arguments[0].scrollIntoView({block: "center"});', $button);
If you can't found operations you want to execute in the Selenium::Remote::Driver documents, you can execute any JavaScript using execute_script method.
Select Fields
I explain how to select the value of Select Fields.
1. Display the select field
2. Click the select field.
3. Input down keys
4. Input enter key
# Display the select field $driver->execute_script('arguments[0].scrollIntoView({block: "center"});', $select); # Click the select field $select->click; # Input down keys $select->send_keys(KEYS->{'down_arrow'} x $key_down_count); # Input enter key $select->send_keys(KEYS->{'enter'});
"x" is a Perl Repeat Operator.
Get Page Source
You can get the page source that is parsed after the web browser using get_page_source method. This is not a pure raw page source.
# Get the page source my $page_source = $driver->get_page_source;
Get body Text
You can get the text in the body tag using get_body method. The return value doesn't contain HTML tags, only contains text.
# Get the body text my $body_text = $driver->get_body;
Table Manipulation
I introduce examples of table manipulation to iterate each row of the table.
# Table manipulation my $rows; eval { $rows = $driver->find_elements('table.mytable > tr') }; if ($rows) { for my $row (@$rows) { my $row_text = $row->get_text; if ($row_text =~ /\b(\d{12})\b/a) { my $id = $1; print "$id\n"; } } }
Donwload Files
I explain how to donwload files using Selenium::Chrome.
Set the Download Directory
You can set the download directory using C
my $driver = Selenium::Chrome->new( extra_capabilities => { 'goog:chromeOptions' => { args => ['headless', 'disable-gpu', 'window-size=1920,1080', 'no-sandbox' ], prefs => { 'download.default_directory' => '/tmp' }, } } );
Note that you can't specify home direcitories as download directories for security reasones.
Download Files
Selenium doesn't have a special features to donload files. Normally, you can download files using click method. You need to wait the finish of the donwloading using sleep method.
# Donwload files $download_button->click; sleep 3;
Examples:
# Get the download button my $download_buttons = $driver->find_elements('button'); my $download_button; for my $download_button (@$download_buttons) { my $download_button_text = $download_button->get_text; if ($download_button_text =~ /^\s*Download\s*$/) { $download_button = $download_button; last; } } # Donwload the file if ($download_button) { $download_button->click; sleep 3; }