Some tools useful in creating bot, spider, scraper.


Some of the most used capybara methods link or cheat sheet

Session methods link you can set expectation for current_path or current_url. In feature test page is actually Capybara::Session class so better is to use session name.

  • visit "/", visit new_project_path. Remember that if you want to go to different domain than Capybara.app_host than you need to use full url (with protocol) so instead visit ' use rather visit 'http://google.con'. Any relative url will use Capybara.app_host just note that it also needs protocol Capybara.app_host = 'http://...' otherwise error undefined method + for nil:NilClass
  • within "#login-form" do
  • generate capybara POST request using submit (this does not work for js: true, so please use js: false)

      session = Capybara.current_session.driver
      session.submit :post, customer_sign_up_path, mac: mac
  • scroll to the bottom of the page since since elements needs to be visible when js: true you can use page.execute_script "window.scrollBy(0,10000)" or make anchor and use hash url url/#my-form since execute script is not available when not js: true. For pagination I prefer to use helper
    def scroll_down
      var next = $('a.next_page[rel="next"]');
      var maxScrolls = 15;
      var myInterval = setInterval(function(){
        next = $('a.next_page[rel="next"]');
        maxScrolls -= 1;
        if (next.length && maxScrolls > 0) {
        } else {
      }, 200);
  • back button page.driver.go_back

Node actions are on session object. Argument is target element by their: id (without #), name, label text, alt text, inner text more Note that locator is case sensitive. You can NOT use css or xpath (this is only for finders). You can use substring or you can define exact: true

  • click_on "Submit" (both buttons and links) click_button "Sign in", click_link "Menu". click_on is alias of click_link_or_button which is similar to click_link but for union of :link and :button selectors.
  • fill_in "email", with: '[email protected]' locator is input name, id, test_id, placeholder, label text. Note that it is case sensitive. alternative is find set find("input[name='cc']").set '[email protected]', or using javascript page.execute_script "$('#my-id').val('[email protected]')"
  • check 'my checkbox' (or by id but without # check 'my_check_id'), choose 'my radio button', select 'My Option or Value', from: 'My Select Box', also uncheck uncheck 'my checkbox', unselect
  • page.attach_file 'doc[file]', "#{Rails.root}/test/fixtures/files/computer_text.png", make_visible: true to upload image to file field
  • If you need to fill_in iframe than you can access it by id or number using @response object
    within_frame 0 do
  • If you need to switch window ie jump into new tab opened by target _blank than you can ~~~ old_window = page.driver.browser.window_handles.last new_window = window_opened_by { click_link ‘Something’ }

page.within_window new_window do # code end


page.switch_to_window new_window


page.driver.browser.close # this will close tab, not whole window page.switch_to_window old_window

* Confirm alert dialog box in
  * selenium `page.driver.browser.switch_to.alert.accept  # can also be .dismiss`
  * webkit `page.accept_confirm { click_link "x" } }` so actions is wrapped with
this `page.accept_confirm`
* When you find some element you get `Capybara::Node::Element`, but you can
  create new node without going to selenium, using html text

node = Capybara.string «-HTML

HTML node.class # => Capybara::Node::Simple node.find(‘#projects’).text # => ‘Projects’

is actually

node.find(:css, ‘#projects’).text # => ‘Projects’ node.find(:link_or_button, ‘.logo’) # Capybara::ElementNotFound (Unable to find link or button “.logo”)

**Node finders**
use selector with params `:kind` (optional defaults to `:css`), `locator` and
filters (some elements has more filters for example `input` has `type`).
Note that `find(:link, 'Home')` will match two elements in above example, while
`find('a', text: 'Home')` will match only one, since first string parameter
`locator` beside text link, matches id, name, value, title, alt, label.
`text_id` attribute). So first symbol parameter is type (by default is `:css`,
and locator is CSS selector). If not `:css` or `:xpath` it matches name, label..
I prefer to enable aria labels and use that also
`Capybara.enable_aria_label = true` and `click 'my-aria-label'`

* `find 'th', text: 'Total Customers'`
* `find('ng-model="newExpense.amount"').set('123')`
* `find_all('input').first.set(123)` but I think it is better to use
  `find('input', match: :first)` since it do not need to find all, similarly you
  if there are many buttons, you can use `click_on 'Edit', match: :first`
* `find('[data-test="id"]', visible: false)` to find invisible element
* `find('#selector').find(:xpath, '..')` find parent node of selector
  `.find(:xpath, '../..')` is parent of parent (grandparent).
* label of child of next adjacent
  `<h3>Name2</h3><div><label>enabled</label><label>disabled></div>` is
  find(:xpath, "//h3[contains(text(),'Name2')]/following-sibling::div/label[contains(text(),'enabled')]")
* to click on select2 I use `find('#original-select-id+span').click` so we find
  first next sibling of original select which was disabled and replaced by
  select2 spans. Also works `find('li', text: select.label).click` but I can not
  find that `li` in dom.
* Node element [more]( `find('input').trigger('focus')` (does not work in selenium)

**Node matchers** and rspec matchers [more]( [rspecmatchers](

* `expect(page.has_css?('.asd')).to be true`
* `expect(page).to have_css(".title", text: "my title")`, `have_text /hi|bye/`,
`have_content`, `have_link`, `have_button`, `have_selector("#project_#{} .name",
text: 'duke')` .
`have_no_selector` for opposite. It is not same `expect(page).not_to have_text`
and `expect(page).to have_no_text` since in later case it will wait until it
tries to fulfill expectation.
To use wait mechanism for other assertions you can try with

# test/application_system_test_case.rb

  # In tests:
  # wait_until(time: 2.5) do
  #    page.page_path == current_path
  #  end
  def wait_until(time: Capybara.default_max_wait_time)
    Timeout.timeout(time) do
      until value = yield

With all you can use `text: '...'` and `count: 2` which is number of occurences.
Instead of `page.body.include? text` (this will compare html tags) use
`page.has_text? text` (this will compare only text).
For testing if something is on the page, for example 3 rows, you can use
`page.has_selector? '[name="customer_ids[]"]', count: 3`
* If element is not visible, you can provide `visible: false` (does not work
with `have_content "d", visible: false` but works with `have_css 'div', text:
'd', visible: false`. Note that this is triggered only if `js: true` (and pass
if `js: false`) so it is better to allways use visible: false for some elements
in popups.
* test sort order is with regex `expect(page).to have_text
/first.*second.*third/`  create three objects with first in the middle. If you
have new lines, than you can match multiline `/first.*second.*third/m` or
replace `page.body.gsub "\n", ''`

* test if input has value:
  * `expect(page).to have_xpath("//input[@value='John']")`
  * `expect(page).to have_selector("input[value='John']")`. To match disabled
    option you can use `expect(page).to have_selector(:option, 'Name of o',
    disabled: true)`
  * `expect(page).to have_field('Your name', with: 'John')` this does not match
  disabled input field. If you want to match disabled use `have_field('Your
  name', disabled: true, with: 'John')`

## Debug

Debug capybara
* `save_and_open_page` to visually inspect the page. It works when `js: false`
and uses `lunchy` gem. It does not load images with relative path (images on
your server).
* <> screen shots and
html are saved in `tmp/capybara`. You you use `chrome` or another driver that is
not `selenium` than register with

Also if you use system test you will see screenshot in terminal. to disable you
can `export RAILS_SYSTEM_TESTING_SCREENSHOT=simple` or set ENV


ENV[“RAILS_SYSTEM_TESTING_SCREENSHOT”] = ‘simple’ Capybara::Screenshot.register_driver(:chrome) do |driver, path| driver.browser.save_screenshot(path) end Capybara::Screenshot.register_driver(:headless_chrome) do |driver, path| driver.browser.save_screenshot(path) end

after Saver#save_html

Capybara::Screenshot.after_save_html do |path| $stderr.write(‘Press ENTER to continue’) && $stdin.gets end

after Saver#save_screenshot

Capybara::Screenshot.after_save_screenshot do |path| path end

You can create gifs from test in two steps. First capture all pages that you are
interested with manual screenshot `screenshot_and_save_page`, review them and
rename last one to `final.png` and than create animated gif
with a

convert -delay 50 -loop 0 tmp/capybara/m/screenshot_2018-*.png -delay 400 tmp/capybara/m/final.png animated.gif

## Waiting ajax

capybara is smart enough to wait if some ajax is called and text is not found.
So it will retry (default_max_wait_time=2) untill failure is not raised. Note
that `!page.has_xpath?('a')` is not the same as `page.has_xpath?('a')` in
example where you are removing `a` in ajax. First will fail since it find `a`
negate (it does not wait when capybara is success). Second will wait until it
is removed. So use expectations which are going to be met until after ajax.

Wait for ajax to finish is not needed in latest capybara, but here is reference:


module WaitHelper # You can use this flash and force driver to wait more time, expecially on # destroy action when there is slow deleting data # app/views/users/destroy.js.erb # window.location.assign(‘<%= customer_path @customer %>’); # = 1; # def wait_for_ajax printf “” start_time = Time.current Timeout.timeout(Capybara.default_max_wait_time) do loop until _finished_all_ajax_requests? end printf ‘%.2f’, Time.current - start_time rescue Timeout::Error printf “timeout#{Capybara.default_max_wait_time}” end

def _finished_all_ajax_requests? output = page.evaluate_script(‘’) printf “.” unless end

def wait_for_visible(target) Timeout.timeout(Capybara.default_max_wait_time) do loop until page.find(target).visible? end rescue Timeout::Error flunk “Expected #{target} to be visible.” end end RSpec.configure do |config| config.include WaitHelper, type: :feature end

for minitest use

class ActionDispatch::SystemTestCase include WaitHelper end

This is not needed any more, also
is removed from capybara.

I see only two problem: first is ajax loaded form in modal and you click on
button to submit it immediatelly, but modal uses `fade` and is not visible yet.
Solution is to remove `fade` class.

To remove fade transition and transform in test so you do not need to sleep and
wait for animation, put this in your layout
# app/views/layouts/application.html.erb
    <% if Rails.env.test? %>
        .fade, .modal-dialog {
          background: blue;
          transform: none!important;
          transition: none!important;
    <% end %>

Second is when respone is redirection `window.location.assign('/users/1')`
(usually just to reload a page). Capybara does not wait for this
`window.location` change when it is run in `headless_chrome` driver. I tried
with `window.location.replace` and `$(location).attr('href',)`. Only solution is
to use expectation that find element which is not yet on a page. Maybe `visit
customer_path customer` again, before expectation, could help.

I noticed that when using `js: true` session is preserved between examples even
from multiple files. This is on both chrome and headless_chrome. Since there is
randomization, that could be a tricky problem to reproduce. I tried to add

config.before(:example) do Capybara.reset_sessions! end before do Capybara.reset_session! browser = Capybara.current_session.driver.browser if browser.respond_to?(:clear_cookies) # Rack::MockSession browser.clear_cookies elsif browser.respond_to?(:manage) and browser.manage.respond_to?(:delete_all_cookies) # Selenium::WebDriver browser.manage.delete_all_cookies else raise “Don’t know how to clear cookies. Weird driver?” end end

but still problem, sometimes fails/sometimes pass.

## Download helpers

Inspect file that is downloaded like `respond_to do |format| format.csv { render
text: CSV.generate { |csv| csv << [1,2] } } end`



csv_content = DownloadHelpers.download_content

expect(csv_content.count(“\n”)).to eq 3

expect(csv_content).to include

# module DownloadFeatureHelpers TIMEOUT = 10 PATH = Rails.root.join(“tmp/downloads”)

extend self

def downloads Dir[PATH.join(“*”)] end

def download downloads.first end

def download_content wait_for_download end

def wait_for_download Timeout.timeout(TIMEOUT) do sleep 0.1 until downloaded? end end

def downloaded? !downloading? && downloads.any? end

def downloading? downloads.grep(/.crdownload$/).any? end

def clear_downloads FileUtils.rm_f(downloads) end end

RSpec.configure do |config| config.include DownloadFeatureHelpers, type: :feature end

Set headless browser

Old way is using profile

require “selenium/webdriver” Capybara.register_driver :chrome do |app| profile = profile[“download.default_directory”] = DownloadFeatureHelpers::PATH.to_s, browser: :chrome, profile: profile) end

another way to set profile is with prefs in desired_capabilities

Currently it does not work if you use headless chrome since it is not


Probably works in headless firefox.

Capybara.register_driver :headless_chrome do |app| desired_capabilities = chromeOptions: { args: %w(headless disable-gpu window-size=1024,768) }, prefs: { “download.default_directory”: DownloadFeatureHelpers::PATH.to_s, } ) app, browser: :chrome, desired_capabilities: desired_capabilities, ) end

Usage in test is to clear downloads first since failing expectation will break
and not remove files.

RSpec.describe ‘Location Reports’, js: true do it ‘downloads’ do DownloadFeatureHelpers.clear_downloads click_on ‘Generate Report’ csv_content = DownloadFeatureHelpers.download_content expect(csv_content.count(“\n”)).to eq 3 expect(csv_content).to include end end

# Selenium

Selenium for ruby use gem `selenium-webdriver`.
You also need executables for firefox `geckodriver` (just download from
<> to `/usr/local/bin`)
and for chrome `chromedriver` (download from
<> to
`/usr/local/bin`) or you can use `gem 'chromedriver-helper'` that will install
chromedriver to `.rvm/gems/ruby-2.3.3/bin/chromedriver`.

Make sure you have version of firefox and chrome that matches drivers.

Same DSL to drive browser (selenium-webdriver, chrome-driver or capybara-webkit)
or headless drivers (`:rack_test` or phantomjs). `Capybara.current_driver` could
be `:rack_test` (when no `js: true`) or `:headless_chrome` or `':chrome`.

## Errors

If you see error `unable to obtain stable firefox connection in 60 seconds
( (Selenium::WebDriver::Error::WebDriverError)` you need to `gem
update selenium-webdriver` or to install matched version.

Minitest is included in ruby and also in rails. If your system tests shows error
`Selenium::WebDriver::Error::WebDriverError: unable to connect to chromedriver` than add gem `gem 'chromedriver-helper'` to your development and
test group.

If you see `KeyError: key not found: 102` than upgrade chromedriver to 2.33 by
downloading from <> and `mv
chromedriver bin` or if you are using `chromedriver-helper` run

rm -rf ~/.chromedriver-helper/ chromedriver-update

For `Selenium::WebDriver::Error::UnknownError: unknown error: call function
result missing 'value' (Session info: headless chrome=67.0.3396.48)` you need to
update chromedriver to 2.38

If you see error `SocketError: getaddrinfo: Name or service not known` than make
sure you have defined localhost ` localhost` in `/etc/hosts`

If you see error `Selenium::WebDriver::Error::StaleElementReferenceError: stale
element reference: element is not attached to the page document` it could be
that element was removed, page reloaded in javascript, or when you use `within
#id` and than make expectation inside `within` block. Try to move expectation
outsite of `within` block or to reload

page.driver.browser.navigate.refresh page.evaluate_script ‘window.location.reload()’

Chrome driver usually starts with `data:,` url and than redirects to for example

In rails console you should see starting browser with

options = options.add_argument(‘–headless’) driver = Selenium::WebDriver.for :chrome #, options: options

To run in browser javascript use `js: true`. Note that in this mode, drop down
links are not visible, you need to click on dropdown. Also `data-confirm` will
be ignored.
Note that in this mode you can't use `rails-rspec` default
`config.use_transactional_fixtures` since selenium can't know that refresh or
navigation to another page should be in the same transaction, so you need to use
`database_cleaner` as we do configuration below.


require “selenium/webdriver”

Capybara.register_driver :chrome do |app| # set download directory using Profile (can be set using :prefs in options) profile = profile[“download.default_directory”] = DownloadFeatureHelpers::PATH.to_s, browser: :chrome, profile: profile) end

Capybara.register_driver :headless_chrome do |app|

I prefer to use Options instead Capabilities

capabilities =

chromeOptions: { args: %w(headless disable-gpu window-size=1024,768) }

) app, browser: :chrome, desired_capabilities: capabilities

options = args: %w[headless disable-gpu window-size=1024,768], # can not use prefs for headless driver since it is not supported # prefs: { # “download.default_directory”: DownloadFeatureHelpers::PATH.to_s, # } ), browser: :chrome, options: options) end

RSpec.configure do |config| files = config.instance_variable_get :@files_or_directories_to_run if files == [“spec”] # when run all spec use headless Capybara.javascript_driver = :headless_chrome else Capybara.javascript_driver = :chrome end end Capybara.enable_aria_label = true

if you need to use custom domain , you can set host, but also set server port

Capybara.app_host = “http://my-domain.loc:3333” Capybara.server_port = 3333

you can read host and port Capybara.current_session.server.port

normally chrome starts with url: data; and than redirects to app_host

app_host should ends with .loc or so it point to localhost

For newer Firefox I needed to download
[geckodriver]( and put it
somewhere like `/user/local/bin/geckodriver`. Also [Firefox
47.0.1]( is suggested, but
my ver 50 also works.

# Remote Selenium

The easiest way is to use docker

But when I use novnc it fails, but VNC Viewer works
Failed when connecting: Failed to connect to server ( (code: 1006))
It could be that REMOTE_HOST is not reachable from container. REMOTE_HOST can
be either `selenium` (docker container) or domain name or ip address of host on
which 5900 port is open (it should be reachable from this container). It does
not work for redirection or rDNS records like
(you can check with nmap

You can control remote selenium server. Download
[selenium-server-standalone.jar]( and run
selenium server.  For error `Unsupported major.minor
version 52.0` you need to update java: 51 -> java7, 52 -> java8, 53 -> java9.

java -jar selenium-server-standalone.jar


java -jar selenium-server-4.1.2.jar standalone

on initial session `driver = Selenium::WebDriver.for :chrome` there should be a
[LocalDistributor.newSession] - Session request received by the distributor: 
 [Capabilities {}]
19:34:48.755 INFO [ProtocolHandshake.createSession] - Detected dialect: W3C
19:34:48.783 INFO [LocalDistributor.newSession] - Session created by the distributor. Id: 2802E1FC-BAF0-492B-A435-60091646AA71, Caps: Capabilities {acceptInsecureCerts: false, browserName: Safari, browserVersion: 15.3, platformName: macOS, safari:automaticInspection: false, safari:automaticProfiling: false, safari:diagnose: false, safari:platformBuildVersion: 21D62, safari:platformVersion: 12.2.1, safari:useSimulator: false, setWindowRect: true, strictFileInteractability: false, webkit:WebRTC: {DisableICECandidateFiltering: false, DisableInsecureMediaCapture: false}}

I can create :chrome, :safari and :firefox in `rails c`


driver = Selenium::WebDriver.for :remote, capabilities: :chrome, url: “http://localhost:4444/wd/hub”

same as

driver = Selenium::WebDriver.for :chrome, url: “” ‘’ #=> nil


driver = Selenium::WebDriver.for :firefox, url: “”

headless chrome

options = args: %w[headless disable-gpu window-size=1024,768], ) driver = Selenium::WebDriver.for :chrome, url: “”, options: options

If there is a screen you can run both headless and not. If you run server from
ssh (there is no screen) than you can run headless or you can run server using X
virtual frame buffer

xvfb-run java -jar /usr/local/bin/selenium-server-standalone.jar

To save console logs you can use
TODO I still have not succeed to see console logs
# test/a/capybara.rb
# This is used only when you want to save javascript console logs

# Usage in your system test
# class SignUpWizzardTest < ApplicationSystemTestCase
#   test 'sign up' do
#    Capybara.current_driver = :headless_chrome_with_logging
#    visit root_path
#    save_console_logs

Capybara.register_driver :headless_chrome_with_logging do |app|
  caps = 'goog:loggingPrefs': {
    browser: 'ALL'

  opts =
  chrome_args = %w[--headless --no-sandbox]
  chrome_args.each { |a| opts.add_argument a } app, browser: :chrome, options: opts, desired_capabilities: caps

class ApplicationSystemTestCase < ActionDispatch::SystemTestCase
  def save_console_logs
    console_log = page.driver.browser.manage.logs.get(:browser).map(&:to_s).join("\n")
    File.write Rails.root.join('log', 'capybara_console.log'), console_log, mode: 'w'

# Kimurai

require 'kimurai'

class SimpleSpider < Kimurai::Base
  @name = "simple_spider"
  @engine = :selenium_chrome
  @start_urls = [""]

  def parse(response, url:, data: {})


Use console

kimurai console --url
# in plains ruby
url = ''
response = Nokogiri::HTML(open(url))

# use non headless, actually fire up a browser
HEADLESS=false kimurai console --engine selenium_chrome --url

To install chromedriver you can use `gem install webdrivers` and symlink
sudo ln -s /home/orlovic/.webdrivers/chromedriver /usr/local/bin/

# or on production with rvm, check that is it in Gemfile (not in TEST)
# /home/deploy/.rbenv/versions/2.6.3/lib/ruby/gems/2.6.0/gems/webdrivers-4.1.3/lib/webdrivers/chromedriver.rb is just ruby file
bundle exec rake webdrivers:chromedriver:update
sudo ln -s /home/deploy/.webdrivers/chromedriver /usr/local/bin/

Use data to pass data between pages

def parse(response, url:, data: {})
  response.xpath().each do |product_url|
    request_to :parse_product, url: product_url[:href], data:
    data.merge(product_name: product_url[:title])

def parse_product(response, url:, data: {})
  puts data[:product_name]

Save screenshot using `page.driver.browser.save_screenshot 'my-shot.png'`
Save cookies using `browser.driver.save_cookies`
Refresh page `browser.refresh` but to update `response` you need to `response =

# CSV, input & output

Your script probably needs some output, CSV is good enough (it will wrap inside
quotes if comma `,` is detected)

require ‘csv’“candidates.csv”,”w”) do |csv| csv « [id, name] end

or without indent

output =‘data/craiglist.csv’, ‘wb’) # folder data must exists output « [id, name] output.close

For strings, you can use [mustache](

require ‘mustache’ MESSAGE_TEXT = “Hi there,

I noticed your great page. Please see my profile here

Thanks!” element.send_keys Mustache.render( MESSAGE_TEXT, profile_url: profile_url)

Params can be passed as arguments (first param is ARGV[0]) or hard coded

OUTPUT_FILE = ARGV[0] || ‘output.csv’ TEST_MODE = true SIMULATE_REAL_USER_DELAY = true

sleep rand(8..15) if SIMULATE_REAL_USER_DELAY

# Debug

run with: ruby myscript.rb

debug with byebug

or in irb:

driver=wait=id=nil # selenium

agent=page=nil # mechanize


put rubocop ready break somewhere in your loops:

loop do

break if false != true # rubocop ready


# Mechanize

If you have simple site without ajax than you can use
[Mechanize]( It uses
nokogiri and `automatically stores and sends cookies, follows redirects, and can
follow links and submit forms`. It provides
so you can find forms (or links), fill input and submit.

require ‘rubygems’ require ‘mechanize’

agent = page = agent.get(‘’) page.link_with text: ‘Next’ # exact match‘#updates div a:first-child’) # css match

For using plain selenium (not capybara) you need to implement waiting for ajax
results. I used three steps

element = nil wait.until { element = driver.find_element(:name, ‘UserName’) } element.send_keys “asdasd”

In this way, it is waiting for element to appear. In case it is not showing
exception `TimeOutError` so if you expect that, you need to wrap inside `begin
rescue end` block. Last line in `until` block (ie return value) is important,
since if it is false `TimeOutError` is raised.

require “selenium-webdriver”

USER_EMAIL = “[email protected]” USER_PASSWORD = “asdasd” TEST_MODE = false SIMULATE_REAL_USER_DELAY = true

if driver.nil? # driver is defined if we use irb and eval‘f’).read driver = Selenium::WebDriver.for :firefox # driver.manage.timeouts.implicit_wait = 10 # do not use implicit wait since it can hang out wait = 30) # seconds “”

puts “Signing in…” element = nil wait.until { element = driver.find_element(:name, ‘UserName’) } element.send_keys USER_EMAIL

element = driver.find_element(:name, ‘Password’) element.send_keys USER_PASSWORD

# puts “Please fill in reCAPTCHA… and click on login” # gets element.submit end

puts “Finding profile_search…” begin wait.until { element = driver.find_element(:xpath, ‘//[text()[contains(.,”Profile”)]]’) } rescue Selenium::WebDriver::Error::TimeOutError puts “Missing Profile link…” retry or break or next end unless TEST_MODE end sleep rand(8..15) if SIMULATE_REAL_USER_DELAY begin phone_element = driver.find_element :xpath, ‘//li[contains(text(),”☎”)]’ phone = phone_element.text[1..-1].strip puts phone rescue Selenium::WebDriver::Error::NoSuchElementError # this error is raised when find_element is called outside of wait.until phone = nil end wait.until do begin driver.find_element(:xpath, “//[@data-cid]”).attribute(‘data-cid’) != id rescue Selenium::WebDriver::Error::StaleElementReferenceError puts “old elemenet is no longer attached to the DOM” false end end begin link[:href] rescue Net::ReadTimeout puts “timeout for #{link[:href]}” next rescue Selenium::WebDriver::Error::UnhandledAlertError puts “UnhandledAlertError probably some model dialog on page” next end begin rescue Selenium::WebDriver::Error::ElementNotVisibleError puts “apply button hidden” next end driver.switch_to.frame driver.find_elements( :tag_name, ‘iframe’).last

# XPath

Xpath is nice since it can traverse back up the dom tree with `..` and that can
select element based on existence of a child `//p[a]` (all `p` with `a` child)
7 type of nodes: element, attribute, text, namespace, processing-instructions,
comment and document node. *Atomic values* are nodes with no children or parent.
Items are Atomic values or nodes.
Each element and attribute has one parent. Element nodes may have zero, one or
more children. Sibling nodes are nodes with same parent. Ancestors are node's
parent and parent's parent etc... Descendant are node's children, children's
children etc...
Selecting node by:
* `nodename` selects all nodes with the name nodename
* `/` selects from the root (if path starts with `/` than it is absolute path).
  `bookstore/book` selects all book elements that are children of bookstore
* `//` select nodes from the current node that match the selection no matter
  where they are. `bookstore//book` selects all book elements that are
  descentant from bookstore
* `.` selects the current node
* `..` selects parent of the current node
* `@` selects attributes `//@lang` selects all attributes that are named lang

Predicates are used to find specific node `[1]` or that contains specific value.
Predicates are always in square brackets
* `/bookstore/book[1]` first book that is child of bookstore
* `/bookstore/book[last()]` last book that is child of bookstore
* `/bookstore/book[position()<3]` first two book elements that are child of
* `//title[@lang]` selects all title elements that have attribute named lang
* `//title[@lang='en']` selects all title elements that have attribute lang with
  value en
* `/bookstore/book[price>35]/title` selects title elements of book element
  which have price element with value (inner text) greater than 35

If we use `/bookstore/book/price[text()]` than it will select all text from
price nodes.

Wildcards can be used to select unknown nodes
* `*` matches any element node `/bookstore/*` select all child elements of
* `@*` matches any attribute node. `//title[@*]` select title elements which
  have at least one attribute
* `node()` matches any kind of node

Several path can be selected using pipe `|` (or operator) `//title | //price`
select all title and price elements. You can use also `=` equal, `+` addition,
`div` division operators...

Axes can be used to traverse (in addition to simply child `/`)
* `child::book`, `descentant::`, `descentant-or-self::`
* `ancestor::`, `ancestor-or-self::`, `parent::`
* `attribute::`
* `namespace::`
* `following::`, `following-sibling::` after current node. To select text after
  certain element you can `//foo/following-sibling::text()[1]`
* `preceding::`, `preceding-sibling::` before current node except ancestors,
  attribute and namespace nodes. for example select li before li with text my
* `self::` current node

You can check in [developer
with `$x('//*[po-my-button]')` in console, or with *CTRL+f* search in elements
In ruby you can parse some text with nokogiri
Nokogiri cheat sheet

doc = Nokogiri::HTML(html_page)
node ='some_xpath')
node = doc.at_css('h1')

* <](>
* [usefull selectors](

* find by id  `//*[@id='my_id']` (note that it needs quotes inside
* by class `//*a[contains(@class,'my_class')]`
* text `//*[contains(text(),'ABC')]` or `//*[text()='exact_match']`
* parrent `../`
* some child of this `.//`
* to get text without child nodes, call `text()` in xpath 
* get [input by
  `var input = element(by.xpath("//label[. = '" + labelName + "']/following-sibling::input"));`
* find all text with @, check if they look like an email and join them [link](

selenium example

email_text = driver.find_element(:xpath, ‘//[@id=”msg_container”]’).find_elements(:xpath, “.//[text()[contains(.,’@’)]]”).map { |e| e.text }.join(“,”)


email_text =“//text()”).map(&:text).join ‘,’

r =\b[a-zA-Z0-9._%+-][email protected][a-zA-Z0-9.-]+.[a-zA-Z]{2,4}\b/) emails = email_text.scan(r).uniq if emails.length > 1 puts “Found several emails in body “ + emails.join(‘,’) end user_email = emails.first.strip if emails.first

# Examples

## Search images and show target page in case of errors

class ImageService

attr_accessor :agent

def initialize @agent = end

def get_links(name) page = agent.get ‘’ form = page.form(‘f’) form.q = name page = agent.submit(form)

# results
table =".//table[@class='images_table']").first
results = []'a').each do |a|
  results.append( {
    site_url: a.attributes["href"].to_s[7..-1], # /url?q=
results   rescue
page.body.to_s   end end ~~~
<% if @results.class == Array %>
  <% results.each do |res| %>
    <%= res[:image_url] %>
  <% end %>
<% else %>
  <iframe id="FileFrame" src="about:blank"></iframe>
  <script type="text/javascript">
    var doc = document.getElementById('FileFrame').contentWindow.document;;
    doc.write('<%=raw @results %>');
<% end %>
  • some sites provide nice rss feed, for example elance

Headless chrome on heroku

To run chrome on heroku you need chrome and chromedriver After adding buildpacks you need to initialize Selenium with correct path to google chrome

options =
if chrome_bin = ENV.fetch('GOOGLE_CHROME_SHIM', nil)
  options.binary = chrome_bin
driver = Selenium::WebDriver.for :chrome, options: options

One alternative solution, which does not rely on selenium


  • Find selectors using javascript and chrome extension plugin addon selector gadget (click first on element to become yellow, and than click on all yellows to mark them as red to remove)
  • browser extension nice documentation with images
  • browser extension and api with example integrations
  • without coding, graphical algorithm