Archive for the ‘hpricot’ tag
Beginnings of a MySpace Music Scraper
A while back I had occasion to work with some folks that had a handful of PHP scripts for scraping some basic information off of MySpace Music. Scraping data off other sites is a bit of a grey area, I suppose, but in this case it was being used to create links to MySpace Band players, or pull copies of a bands promo photograph. Hardly a controversial usage, yet MySpace for whatever reason didn’t have a useful API.
These were simple little one-off scripts meant to gather data on bands. I was never really a fan of them, but always too busy to replace them with anything pretty.
Then hpricot came into my life and I finally had a reason to do it. Hpricot is sweet HTML parser that uses a syntax similar to jQuery – really handy for treating a page like a resource. Here is a simple class I created to parse and pull some specific data from a MySpace band page:
require 'rubygems'
require 'hpricot'
require 'open-uri'
class MySpace
attr_accessor :friend_id, :hmodel
def initialize(friend_id)
self.friend_id = friend_id
self.load
end
def load
@hmodel = Hpricot(open("http://profile.myspace.com/index.cfm?fuseaction=user.viewprofile&friendid=#{@friend_id}"))
end
def flashvars
for flashvars in (@hmodel/"//param[@name = 'flashvars']")
if flashvars.attributes["value"] && flashvars.attributes["value"].match("^uid")
return flashvars.attributes["value"]
end
end
return nil
end
def plid
if flashvars = self.flashvars
matches = flashvars.match('plid=(\d+)&')
if matches && matches[1]
return matches[1]
end
end
return nil
end
def artid
if flashvars = self.flashvars
matches = flashvars.match('artid=(\d+)&')
if matches && matches[1]
return matches[1]
end
end
return nil
end
def image
if image_link = @hmodel.search("a#ctl00_cpMain_ctl01_UserBasicInformation1_hlDefaultImage")
if image_field = image_link.at("img")
if image = image_field.attributes["src"]
return image unless (image == "http://x.myspacecdn.com/images/no_pic.gif")
end
end
end
return nil
end
end
The two main things I was trying to get at were the plid and the artid which you can use to create a nice MySpace Player popup. I added another function to pull the image, but that’s as far as I’ve taken it. At some point, it would be simple to add hometown, genre, band name and probably more.
Usage looks something like this (from a Rails app, in the Band model):
def update_myspace_data
# pass the scraper the band's myspace ID:
scrape = MySpace.new(self.myspace_id) if self.myspace_id
if scrape
self.myspace_plid = myspace.plid if myspace.plid
self.myspace_artid = myspace.artid if myspace.artid
end
end
This makes it a lot easer to update a band with a quick scrape of their page, and if (or probably when) MySpace changes their HTML, it should be fairly simple to locate and modify the corresponding functions in this class instead of hunting down really long regular expressions.