Web Scraper with Ruby on Rails

Angelegt von andi Wed, 24 Feb 2010 13:55:00 GMT

Well, some good links to start could be Mechanize, and Hpricot

Here’s an example on how to use them, thats does only very basically scrape out the links of a search on ‘hpricot’ on the search engine ‘forestle’.

Let’s have a look:

# Released under Creative Commons Attribution-Noncommercial-Share Alike 3.0
require 'rubygems'
require 'open-uri'
require 'hpricot'
require 'mechanize'

agent = Mechanize.new

@page =  agent.get('http://de.forestle.org/search.php?q=hpricot')

doc = Hpricot(@page.body)

(doc/"/html/body/div[2]/div[2]/table/tr/td/div/div/a").each do |result|
  puts result.attributes['href']
end

Another example spider i wrote is this one, to scrape out all of my contacts as vcards from my xing account.

require 'rubygems'
require 'mechanize'

agent = Mechanize.new

page = agent.get 'https://www.xing.com/de/'

form = page.forms.last

form.login_user_name = 'andi@...'
form.login_password = 'Password'

page = agent.submit form

# click the MyContacts link
# Todo: this should go easier, huh?
page = page.links.select {|link| link.to_s.match(/My Contacts/)}.first.click

vcards = []
vcards += page.links.select {|link| link.uri.to_s.match(/vcard/)}

# go through pagination
while next_link = page.links.select {|link| link.to_s.match(/Next/)}.first
  page = next_link.click
  vcards += page.links.select {|link| link.uri.to_s.match(/vcard/)}
end

puts vcards.size

vcards.each do |vcard|
  card = vcard.click
  card.save("cards/#{card.filename.gsub('"','')}")
end

Any Questions left? Ask me, or your local scraping guru!

Good Luck!

Trackbacks

Verwenden Sie den folgenden Link zur Rückverlinkung von Ihrer eigenen Seite:
http://praktikanten.brueckenschlaeger.org/trackbacks?article_id=44

Leave a comment

Comments