Skip to content

Instantly share code, notes, and snippets.

@eduardopoleo
Created July 24, 2016 17:29
Show Gist options
  • Save eduardopoleo/7efc741c881ac6c7132db9f8a87b73e0 to your computer and use it in GitHub Desktop.
Save eduardopoleo/7efc741c881ac6c7132db9f8a87b73e0 to your computer and use it in GitHub Desktop.
#lib/tasks/scrape.rake
#Mechanize setup
require 'mechanize'
agent = Mechanize.new
#The starting point which corresponds to page 1
page = agent.get('http://www.fin.gov.on.ca/en/publications/salarydisclosure/pssd/orgs-tbs.php?year=2014&organization=universities&page=1')
#Extracts all the pagination links over which we are going to iterate
page_links = page.search("//thead/tr/td[2]/a")
def cleaned_text(row, xpath)
row.at(xpath).text.strip
end
def money_value(row, xpath)
cleaned_text(row, xpath).tr("$,", "").to_f
end
page_links.each do |link|
page.link_with(text: "#{link.text}").click
puts "------>Scraping results for page #{link.text}<-----------"
# row = agent.page.at('//tbody/tr[1]')
rows = agent.page.search('//tbody/tr')
rows.each do |row|
Staff.create(
university: cleaned_text(row, 'td[1]/span'),
last_name: cleaned_text(row, 'td[2]'),
name: cleaned_text(row, 'td[3]'),
title: cleaned_text(row, 'td[4]/span'),
salary: money_value(row, 'td[5]'),
taxable_benefits: money_value(row, 'td[6]')
)
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment