In my pet project Dealesque, I am trying to compare all offers on a number of Amazon items, the idea being that it can help decide which offers to use to minimize shipping and total cost. Using Amazon Product Advertising API was the logical first step, but it doesn’t return all the offers for an item. It does however return the “more offers URL” for each item. Hence, the old scrapin’ was due, and none too late!
Plain wget-like action would not suffice, since Amazon is taking care to block unwanted traffic. So, mechanize gem to the rescue! It actually allows you to impersonate a real browser:
1
agent = Mechanize.new { |agent| agent.user_agent_alias = 'Mac Safari' }
After that, you can navigate the site, click away, read any forms etc.
For scraping, what I actually ended up using was to get the content of the “more offers URL” page and parse it using Nokogiri. Something like:
1
2
3
page = agent.get(more_offers_url)
root = Nokogiri::HTML(page.content.strip)
scrape_content(root)
For the current development stage, this is doing just fine. Unfortunately, for production use it will not suffice. There will probably be some traffic throttling from Amazon and some benchmarking will need to be done to determine the limits. Also, proxying the requests will probably be required too. But, I leave this for some other times.
The result of scraping the offers for picked items:
Source: http://shcatula.wordpress.com/2013/05/08/scraping-amazon-item-offers/
Plain wget-like action would not suffice, since Amazon is taking care to block unwanted traffic. So, mechanize gem to the rescue! It actually allows you to impersonate a real browser:
1
agent = Mechanize.new { |agent| agent.user_agent_alias = 'Mac Safari' }
After that, you can navigate the site, click away, read any forms etc.
For scraping, what I actually ended up using was to get the content of the “more offers URL” page and parse it using Nokogiri. Something like:
1
2
3
page = agent.get(more_offers_url)
root = Nokogiri::HTML(page.content.strip)
scrape_content(root)
For the current development stage, this is doing just fine. Unfortunately, for production use it will not suffice. There will probably be some traffic throttling from Amazon and some benchmarking will need to be done to determine the limits. Also, proxying the requests will probably be required too. But, I leave this for some other times.
The result of scraping the offers for picked items:
Source: http://shcatula.wordpress.com/2013/05/08/scraping-amazon-item-offers/
hey nice source for me,thanks for sharing this topics of the amazon product name scraper.I have definitely bookmark this blog.
ReplyDelete