Amazon Data Scraping: April 2013

Monday, 29 April 2013

(W) Amazon leaking data found by one Thoughtful Hacker in July 2012

Recent press by Rapid7 on Amazon S3 Buckets leaking data was first shared with me by my partner Ian. In June/July of last year he was working on a side project and found some disturbing information with me.

I wanted to share his findings as Rapid7 missed something. Here is his comments to me.

"OK, here are some additional details that I didn't see touched upon. I've been keeping this one quiet within the community, but since Rapid7 broke it, might as well...

Back in June of last year I was working with an Amazon EC2 instance and something caught my eye. I made a mental note to come back and check it out. I did later and found a whopper.

In EC2 you can create additional S3 drives. When you go through that process, you can select a "public" image to use. Just by scrolling through the list, some of these looked like they shouldn't have been public. I later came back and started examining them manually. The first one I tried was pretty significant. Let's just say it was a company that has a fondness for a type of dog and the color red. 'Nuff said. There were tons of email addresses, SSH keys, and so forth all over this.

So I went to work writing a utility for scraping called Snoop.py so I could pull out and analyze more of this stuff and see what the common thread was. As you found, there is lots of stuff exposed on there that shouldn't be.

Now, here's where it gets really interesting.

I found that most, but NOT ALL, of the "public" drives were configured as "public". That is, there was a clear subset that were NOT marked as being public. And I found a really easy way of seeing this in the Amazon portal. Here's how it works. If you go to the S3 side of the house and go to the option where you can see all public images in a list, take a copy of that list. Now, go to where you would create a new drive and attach an image. Take a copy of that, and compare. Back in July through probably September, if you did this you would have a discrepancy -- you could attach to more drives that weren't yours than you could see on the public list. And, these were the most, let's say, juicy.

I let some folks know and made attempts to contact others. At some point, sometime around or after September, it seems those "extra" drives disappeared from view and things went back to normal.

Although clearly people still are leaking data. But I suspect, but have no hard proof, that there was something else wrong in the cloud."

Ian and I got side tracked with malware work, but planned to get back to it. Since Rapid7 released the info,it only seemed right to release this info.

Source: http://hackerhurricane.blogspot.in/2013/03/w-amazon-leaking-data-found-by-one.html

Note:

Roze Tailer is experienced web scraping consultant and writes articles on linkedin email scraping, linkedin profile scraping, amazon data scraping, amazon data scraping, yellowpages data scraping, product information scraping and yellowpages data scraping.

Amazon Price Scraping

Running a software company means that you have to be dynamic, creative, and most of all innovative. I strive every day to create unique and interesting new ways to do business online. Many of my clients sell their products on Amazon, Google Merchant Central, Shopping.com, Pricegrabber, NextTag, and other shopping sites.

Amazon is by far the most powerful, and so I focus much of my efforts on creating software specifically for their portal. I’ve created very lightweight programs that move data from CSV, XML, and other formats to Amazon AWS using the Amazon Inventory API. I’ve also created programs that push data from Magento directly to Amazon, and do this automatically, updating every few hours like clockwork. Some of my customers sell hundreds of thousands of products on Amazon due to this technology.

Doctrine ORM and Magento

I’m a strong believer in the power of Doctrine ORM in combination with Zend Framework, and I was an early adopter of this technology in production environments. More recently, I’ve been using Doctrine to generate models for Magento and then using these models in the development of advanced information scraping systems for price matching my client’s products against Amazon’s merchants. I prefer to use Doctrine because the documentation is awesome, the object model makes sense, and it is far easier to utilize outside of the Magento core.

What is price matching?
Price matching is when you take product data from your database and change it to just slightly below the lowest pricing available on Amazon, depending upon certain rules. The challenge here is that most products from distributors don’t have an ASIN (Amazon product id) number to check against. Here are the operations of my script to collect data about Amazon products:

    Loops through all SKUs in catalog_product_entity
    For each SKU, gets a name, asin, group, new/used price, url, manufacturer from Amazon
    If name, manufacturer, and asin exist it stores the entry in an array
    It loops through all the entries for each sku and it checks for _any_ of the following:
        Does full product name match?
        Does manufacture name match?
        Does the product group match?
        (break the product name into words) Do any words match?
        If any of the following are true, it will add the entry to the database
    If successful, it enters the data into attributes inside Magento:
        scrape_amazon_name
        scrape_amazon_asin
        scrape_amazon_group
        scrape_amazon_new_price
        scrape_amazon_used_price
        scrape_amazon_manufacturer
    If the data already exists, or partial data exists it updates the data
    If the data is null or corrupt, it ignores it

Data Harvesting
As you can see from the above instructions, my system first imports all the data that’s possible. This process is called harvesting. After all the data is harvested, I utilize a feed exporter to create a CSV file specifically in the Amazon format and push it via Amazon AWS encrypted upload.

Feed Export (Price Matching to Amazon’s Lowest Possible Price)
The feed generator then adjusts the pricing according to certain rules:

    Product price is calculated against a “lowest market” percentage. This calculates the absolute lowest price the client is willing to offer
    “Amazon Lowest Price” is then checked against “Absolute Lowest Sale Price” (A.L.S.P.)
    If the “Amazon Lowest Price” is higher than the A.L.S.P, then it calculates 1 dollar lower than A.L.P. and stores that as the price in the feed for use in Amazon.
    The system updates the price in the our database and freezes the product from future imports, then it archives the original import price for reference.
    If an ASIN number exists it pushes the data to amazon using that, if not it uses MPN/ SKU or UPC

Conclusion
This type of system is wonderful because it accurately stores Amazon product data for later use, this way we can see trends in price changes. It insures that my client will always be the absolute lowest price for hundreds of thousands of products on Amazon (or Google/ Shopping.com/ PriceGrabber/ NextTag/ Bing). Whenever the system needs to update, it takes around 10 hours to harvest 100,000 products. It takes 5 minutes to export the entire data set to amazon using my feed software. This makes updating very easy and it can be accomplished in one evening. This is something that we can progressively enhance to protect against competitors throughout the market cycles, and it’s a system that is easy to upgrade in the event Magento changes it’s data model.

Upgrades
Since we utilize Doctrine, it’s all outside of Magento. So we can go ahead and upgrade Magento to a newer version any time we want. Then we just re-generate the database models and our system becomes compliant with any changes Magento made automatically. I’ll probably come back and do another article on just this topic, as it’s one I’m very interested in writing about.

Source: http://www.christopherhogan.com/2011/11/12/amazon-price-scraping/

Note:

Blogging that Pays: How to Make Money with Amazon

Amazon.com is one of the biggest retailers on the Internet. Originally launched as an online bookstore, the site now sells everything from books to electronics to diapers. It’s popular with shoppers because of its selection, low shipping rates and fast service.

But Amazon also is popular with website owners, who can make money through the retailer’s Amazon Associates program.

It’s a simple concept: You place links to Amazon products on your site, and Amazon gives you a cut of the money spent by shoppers who follow those links. The amount you get depends on the number of sales you refer to Amazon, starting at 4 percent and going up to 8.5 percent.
Linking to Amazon

Amazon has an easy tool to help you customize product links.

Building Amazon Associates links is easy. Once you sign up, every time you visit Amazon you’ll see their “Site Stripe” at the top of the page. Just navigate to the item or category you want to promote, and click “link to this page” on the stripe. Then you’ll get the HTML code to add the link.

But links aren’t the only way to earn money through Amazon Associates. The site also offers widgets (interactive blocks that you can feature in your site’s sidebar, for example) and the aStore, which lets you “sell” Amazon products on your site.

First, widgets. Some are personalized, and let you hand-pick products to feature. Others can showcase Amazon’s daily deals or a search box. There’s also a widget that automatically chooses products to display based on the content of your site. The widgets come in many sizes so you can use them wherever you want on your site.

To spotlight even more products, you could set up an aStore. You can define categories, choose products and customize the look and colors to match your site. Plus you can add widgets to show things like your customers’ wish lists or related products.
Getting paid

The 'daily deals' widget is one option for referring your site's visitors to Amazon.

After your site’s visitors buy on Amazon, how do you get paid? You have a few options. You can choose to receive your earnings as an Amazon credit, or get money through direct deposit or check. Payment is issued monthly.

The Amazon Associates site has detailed reports so you can track your earnings, as well as see what your site’s visitors purchased.
Who should use Amazon Associates?

Most people who publish online can find a way to monetize their site with Amazon Associates.

If you’re a new mom who blogs about raising a baby, you can include links to the products you find helpful. Or if you’re a movie or TV buff with an entertainment site, you can send your readers to Amazon to buy DVDs.

But it’s not just bloggers who can benefit from Amazon Associates. Take, for example, a small business that does in-home computer repair but doesn’t have a retail location. The business website could include an aStore of software or accessories that the owner finds helpful. Or, a school could use an aStore to sell summer reading list books, school supplies (organized by grade level), musical instruments for band members or even uniforms. The school could also share its “wishlist” and allow parents and others to donate books, supplies or other needed items.

Source: http://saidigital.co/get-started-making-money-online-with-amazon-associates-3196/
Note:

Web Scraper – Amazon Scraper, Amazon.Com Scraper, Web Scraping, Web Scraping Tools

Amazon Scraper
Also known as web harvesting, web scraping is a computer software technique of extracting information from websites. The tools used for web scraping are generally the software programs that simulate human exploration of the web by either implementing low-level Hypertext Transfer Protocol or embedding certain full-fledged Web browsers, such as the Internet Explorer and the Mozilla Web browser. Now-a-days one can find web scraping tools specifically designed for particular websites. For instance, Amazon scraper is a web scraper tool used to crawl, scrap or extract the information from the site called amazon.com.

Scrapping expert.com provides amazon.com scraper that crawls and fetches information including site name, model number, title, description, seller detail, seller price, shipping price, URL from amazon.com in a clean & readable CSV format. This amazon.com extractor provides unlimited data extraction and is equipped with an option to enter multiple search criteria or multiple keywords at a time which saves huge amount of time & effort employed in content searching and extraction purposes. This “simple to use and operate” web scraper facilitates for extracting unique records and store them in simple & structured format in any database of the choice. It is compatible with most of the latest OS versions including Microsoft XP, Vista and Windows 7. With the feature of one screen dashboard, it shows the basic information on total extracted records, extracted keywords, view of results, elapsed time, etc.

For trial version of Amazon product scraper visit http://www.scrappingexpert.com.

Aruhat Technologies Pvt. Ltd. is a leading provider of web data extraction , screen scrapping, web crawling and data mining solutions. The company owns website ScrappingExpert.com to offer data extraction services, email extraction services, data collection services, data mining services, screen scrapping services, data extraction software, web grabbers, web spiders and web bots for extracting information from websites.

Source: http://scrappingexpertarticle.wordpress.com/2010/12/08/web-scraper-amazon-scraper-amazon-com-scraper-web-scraping-web-scraping-tools/

Note:

Web scraping Amazon and Rotten Tomatoes

[Rajesh] put web scraping to good use in order to gather the information important to him. He’s published two posts about it. One scrapes Amazon daily to see if the books he wants to read have reached a certain price threshold. The other scrapes Rotten Tomatoes in order to display the audience score next to the critics score for the top renting movies.

Web scraping uses scripts to gather information programmatically from HTML rather than using an API to access data. We recently featured a conceptual tutorial on the topic, and even came across a hack that scraped all of our own posts. [Rajesh's] technique is pretty much the same.

He’s using Python scripts with the Beautiful Soup module to parse the DOM tree for the information he’s after. In the case of the Amazon script he sets a target price for a specific book he’s after and will get an email automatically when it gets there. With Rotten Tomatoes he sometimes likes to see the audience score when considering a movie, but you can’t get it on the list at the website; you have to click through to each movie. His script keeps a database so that it doesn’t continually scrape the same information. The collected numbers are displayed alongside the critics scores as seen above.

Source: http://hackaday.com/2013/01/23/web-scraping-amazon-and-rotten-tomatoes/

Note: