Web Scraper with Ruby on Rails
Well, some good links to start could be Mechanize, and Hpricot
Here’s an example on how to use them, thats does only very basically scrape out the links of a search on ‘hpricot’ on the search engine ‘forestle’.
Let’s have a look:
# Released under Creative Commons Attribution-Noncommercial-Share Alike 3.0
require 'rubygems'
require 'open-uri'
require 'hpricot'
require 'mechanize'
agent = Mechanize.new
@page = agent.get('http://de.forestle.org/search.php?q=hpricot')
doc = Hpricot(@page.body)
(doc/"/html/body/div[2]/div[2]/table/tr/td/div/div/a").each do |result|
puts result.attributes['href']
endAnother example spider i wrote is this one, to scrape out all of my contacts as vcards from my xing account.
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get 'https://www.xing.com/de/'
form = page.forms.last
form.login_user_name = 'andi@...'
form.login_password = 'Password'
page = agent.submit form
# click the MyContacts link
# Todo: this should go easier, huh?
page = page.links.select {|link| link.to_s.match(/My Contacts/)}.first.click
vcards = []
vcards += page.links.select {|link| link.uri.to_s.match(/vcard/)}
# go through pagination
while next_link = page.links.select {|link| link.to_s.match(/Next/)}.first
page = next_link.click
vcards += page.links.select {|link| link.uri.to_s.match(/vcard/)}
end
puts vcards.size
vcards.each do |vcard|
card = vcard.click
card.save("cards/#{card.filename.gsub('"','')}")
endAny Questions left? Ask me, or your local scraping guru!
Good Luck!
Trackbacks
Verwenden Sie den folgenden Link zur Rückverlinkung von Ihrer eigenen Seite:
http://praktikanten.brueckenschlaeger.org/trackbacks?article_id=44