ScraperWiki - when a PDF or Website has to be indexed

Angelegt von custo Thu, 10 Feb 2011 18:34:00 GMT

Hier ist mal wieder ein Linkschmaus von mir für die Leute unter uns, die es bequem haben wollen:

ScraperWiki soll dabei helfen die Infos und Daten in heterogenen Formaten leicht zu strukturieren und collaborativ zu verbreiten (HTML, PDF, XML, JSON, etc). Have fun!

Web Scraper with Ruby on Rails

Angelegt von andi Wed, 24 Feb 2010 13:55:00 GMT

Well, some good links to start could be Mechanize, and Hpricot

Here’s an example on how to use them, thats does only very basically scrape out the links of a search on ‘hpricot’ on the search engine ‘forestle’.

Let’s have a look:

# Released under Creative Commons Attribution-Noncommercial-Share Alike 3.0
require 'rubygems'
require 'open-uri'
require 'hpricot'
require 'mechanize'

agent = Mechanize.new

@page =  agent.get('http://de.forestle.org/search.php?q=hpricot')

doc = Hpricot(@page.body)

(doc/"/html/body/div[2]/div[2]/table/tr/td/div/div/a").each do |result|
  puts result.attributes['href']
end

Another example spider i wrote is this one, to scrape out all of my contacts as vcards from my xing account.

require 'rubygems'
require 'mechanize'

agent = Mechanize.new

page = agent.get 'https://www.xing.com/de/'

form = page.forms.last

form.login_user_name = 'andi@...'
form.login_password = 'Password'

page = agent.submit form

# click the MyContacts link
# Todo: this should go easier, huh?
page = page.links.select {|link| link.to_s.match(/My Contacts/)}.first.click

vcards = []
vcards += page.links.select {|link| link.uri.to_s.match(/vcard/)}

# go through pagination
while next_link = page.links.select {|link| link.to_s.match(/Next/)}.first
  page = next_link.click
  vcards += page.links.select {|link| link.uri.to_s.match(/vcard/)}
end

puts vcards.size

vcards.each do |vcard|
  card = vcard.click
  card.save("cards/#{card.filename.gsub('"','')}")
end

Any Questions left? Ask me, or your local scraping guru!

Good Luck!

Rails Plugins

Angelegt von andi Wed, 24 Feb 2010 13:01:00 GMT

While reading an article about how the behavior of Rails Plugins will change with Rails 3.0 , i stumbled upon this new page ’railsplugins.org’ that engine yard has created to keep track about which plugins are compatible with rails 3, with ruby 1.9, if they run with jruby and if they are thread safe.

Though there are many more requirements one could have on rails plugin, like a solid test coverage, and a proper re-usability (e.g. a slightly different approach then the one the plugin was originally designed for, should still be easy to solve with the plugin), it’s a good start, and a nice list of plugins, currently counting 145.

After all, we should somehow always have an more complex overview over all those plugins, and collect all interesting meta-information, and remember at least which plugin we use in which application. A good start could be to extract this plugin list, and store it in some kind of database for its own, to be able to connect it with our own meta-information.

Who needs help on building a custom scraper, to scrape out the plugin list :)?