kirkbrown.com

Huh?

Archive for the ‘hpricot’ tag

Beginnings of a MySpace Music Scraper

without comments

A while back I had occasion to work with some folks that had a handful of PHP scripts for scraping some basic information off of MySpace Music. Scraping data off other sites is a bit of a grey area, I suppose, but in this case it was being used to create links to MySpace Band players, or pull copies of a bands promo photograph. Hardly a controversial usage, yet MySpace for whatever reason didn’t have a useful API.

These were simple little one-off scripts meant to gather data on bands. I was never really a fan of them, but always too busy to replace them with anything pretty.

Then hpricot came into my life and I finally had a reason to do it. Hpricot is sweet HTML parser that uses a syntax similar to jQuery – really handy for treating a page like a resource. Here is a simple class I created to parse and pull some specific data from a MySpace band page:

require 'rubygems'
require 'hpricot'
require 'open-uri'

class MySpace
  attr_accessor :friend_id, :hmodel

  def initialize(friend_id)
    self.friend_id = friend_id
    self.load
  end

  def load
    @hmodel = Hpricot(open("http://profile.myspace.com/index.cfm?fuseaction=user.viewprofile&friendid=#{@friend_id}"))
  end

  def flashvars
    for flashvars in (@hmodel/"//param[@name = 'flashvars']")
      if flashvars.attributes["value"] && flashvars.attributes["value"].match("^uid")
        return flashvars.attributes["value"]
      end
    end
    return nil
  end

  def plid
    if flashvars = self.flashvars
      matches = flashvars.match('plid=(\d+)&')
      if matches && matches[1]
        return matches[1]
      end
    end
    return nil
  end

  def artid
    if flashvars = self.flashvars
      matches = flashvars.match('artid=(\d+)&')
      if matches && matches[1]
        return matches[1]
      end
    end
    return nil
  end

  def image
    if image_link = @hmodel.search("a#ctl00_cpMain_ctl01_UserBasicInformation1_hlDefaultImage")
      if image_field = image_link.at("img")
        if image = image_field.attributes["src"]
          return image unless (image == "http://x.myspacecdn.com/images/no_pic.gif")
        end
      end
    end
    return nil
  end

end

The two main things I was trying to get at were the plid and the artid which you can use to create a nice MySpace Player popup. I added another function to pull the image, but that’s as far as I’ve taken it. At some point, it would be simple to add hometown, genre, band name and probably more.

Usage looks something like this (from a Rails app, in the Band model):

def update_myspace_data
  # pass the scraper the band's myspace ID:
  scrape = MySpace.new(self.myspace_id) if self.myspace_id
  if scrape
    self.myspace_plid  = myspace.plid if myspace.plid
    self.myspace_artid = myspace.artid if myspace.artid
  end
end

This makes it a lot easer to update a band with a quick scrape of their page, and if (or probably when) MySpace changes their HTML, it should be fairly simple to locate and modify the corresponding functions in this class instead of hunting down really long regular expressions.

Written by kirk

May 4th, 2009 at 4:26 pm

Posted in Ruby

Tagged with , ,