Amazon Data Scraping: 2014

Monday, 29 December 2014

Web Data Scraping Services At Lowest Rate For Business Directory

We are the world's most trusted provider directory, your business data scrape, and scrape email scraping and sending the data needed. We scour the entire directory database or doctors, lawyers, brokers, financial advisers, etc. As the scraping of a particular industry category wise database scraping or data that can be adapted.

We are pioneers in the worldwide web scraping and data services. We must understand the value of our customer database, we email id with the greatest effort to collect data. We are lawyers, doctors, brokers, realtors, schools, students, universities, IT managers, pubs, bars, nightclubs, dance clubs, financial advisers, liquor stores, Face book, Twitter, pharmaceutical companies, mortgage broker scraped data, accounting firms, car dealers , artists, shop health and job portals.

Our business database development services to try and get real quality at the lowest possible industry. Example worked. We have a quick turnaround time can be a business mailing database. Our business database development services to try and get real quality at the lowest possible industry. Example worked. We have a quick turnaround time can be a business mailing database.

We are the world's most trusted provider directory, your business data scrape, and scrape email scraping and sending the data needed. We scour the entire directory database or doctors, lawyers, brokers, financial advisers, etc., as the scraping of a particular industry category wise database scraping or data that can be adapted.

We are pioneers in the worldwide web scraping and data services. We must understand the value of our customer database, we email id with the greatest effort to collect data. We are lawyers, doctors, brokers, realtors, schools, students, universities, IT managers, pubs, bars, nightclubs, dance clubs, financial advisers, liquor stores, Face book, Twitter, pharmaceutical companies, mortgage broker scraped data, accounting firms, car dealers , artists, shop health and job portals.

What a great resource for specific information or content with little success to gather and have tried to organize themselves in a folder? You no longer need to worry, and data processing services through our website search are the best solution for your problem.

We currently have an "information explosion" phase of the walk, where there is so much information and content information for an event or a small group of channels.

Order without the benefit of you and your customers a little truth to that information. You use information and material is easy to organize in a way that is needed. Something other than a small business guide, simply create a separate folder in less than an hour.

Our technology-specific Web database for you to a similar configuration and database development to use. In addition, we finished our services can help you through the data to identify the sources of information for web pages to follow. This is a cost effective way to create a database.

We offer directory database, company name, address, the state, country, phone, email and website URL to take. In recent projects we have completed. We have a quick turnaround time can be a business mailing database. Our business database development services to try and get real quality at the lowest possible industry.

Source:http://www.articlesbase.com/outsourcing-articles/web-data-scraping-services-at-lowest-rate-for-business-directory-5757029.html

Monday, 22 December 2014

Scrape Web data using R

Plenty of people have been scraping data from the web using R for a while now, but I just completed my first project and I wanted to share the code with you. It was a little hard to work through some of the “issues”, but I had some great help from @DataJunkie on twitter.

As an aside, if you are learning R and coming from another package like SPSS or SAS, I highly advise that you follow the hashtag #rstats on Twitter to be amazed by the kinds of data analysis that are going on right now.

One note. When I read in my table, it contained a wierd set of characters. I suspect that it is some sort of encoding, but luckily, I was able to get around it by recoding the data from a character factor to a number by using the stringr package and some basic regex expressions.

Bring on fantasy football!

################################################################

## Help from the followingn sources:

## @DataJunkie on twitter

## http://www.regular-expressions.info/reference.html

## http://stackoverflow.com/questions/1395528/scraping-html-tables-into-r-data-frames-using-the-xml-package

## http://stackoverflow.com/questions/1395528/scraping-html-tables-into-r-data-frames-using-the-xml-package

## http://stackoverflow.com/questions/2443127/how-can-i-use-r-rcurl-xml-packages-to-scrape-this-webpage

################################################################

library(XML)

library(stringr)

# build the URL

url <- paste("http://sports.yahoo.com/nfl/stats/byposition?pos=QB",

 "&conference=NFL&year=season_2009",
 "&timeframe=Week1", sep="")

# read the tables and select the one that has the most rows

tables <- readHTMLTable(url)

n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))

tables[[which.max(n.rows)]]

# select the table we need - read as a dataframe

my.table <- tables[[7]]

# delete extra columns and keep data rows

View(head(my.table, n=20))

my.table <- my.table[3:nrow(my.table), c(1:3, 5:12, 14:18, 20:21, 23:24) ]

# rename every column

c.names <- c("Name", "Team", "G", "QBRat", "P_Comp", "P_Att", "P_Yds", "P_YpA", "P_Lng", "P_Int", "P_TD", "R_Att",

 "R_Yds", "R_YpA", "R_Lng", "R_TD", "S_Sack", "S_SackYa", "F_Fum", "F_FumL")

names(my.table) <- c.names

# data get read in with wierd symbols - need to remove - initially stored as character factors

# for the loops, I am manually telling the code which regex to use - assumes constant behavior

# depending on where the wierd characters are -- is this an encoding?

front <- c(1)

back <- c(4:ncol(my.table))

for(f in front) {

 test.front <- as.character(my.table[, f])

 tt.front <- str_sub(test.front, start=3)

 my.table[,f] <- tt.front

}

for(b in back) {

 test <- as.character(my.table[ ,b])

 tt.back <- as.numeric(str_match(test, "\-*\d{1,3}[\.]*[0-9]*"))

 my.table[, b] <- tt.back
}

str(my.table)

View(my.table)

# clear memory and quit R

rm(list=ls())

q()

n

Source: http://www.r-bloggers.com/scrape-web-data-using-r/

Thursday, 18 December 2014

Affordable Tooth Extractions

In recent times, the cost of dental care has skyrocketed. This includes all types of dentistry including teeth cleaning, extractions, and dental surgery. For those who live in Denver, CO, there are many options to choose from when paying for routine or emergency dental care. In fact, having a tooth extraction Denver might just be more easily afforded than what some may be aware of.

The flat fee for a tooth extraction in Denver may vary between dental offices. The type of extraction can also cause a difference in the price. A simple extraction may cost between $60-$75, but a wisdom tooth extraction that requires more time and effort could cost much more.

One of the great aspects of having dental services performed in Denver is the variety of payment forms that many dental offices accept. Most dental offices in this area accept several different health insurance plans that will allow patients to only be required to pay a small copay at the time of service. If you have chosen an in-network dental provider for your plan, this copay can be even less.

Many dental offices also provide services to those who have state medicaid or medicare as well. While cosmetic dental work may not be covered by these forms of health care, extractions are covered because they are considered a necessary part of the patients good health. Yearly checkups and teeth cleanings are also normally covered as a preventative measure to avoid bad dental health.

For those who may not have any type of health insurance, dental insurance, or state provided health care plan, most dental offices will offer a payment plan. The total cost will be calculated and can be divided up over a few months to make dental care more easily affordable. This will need to be arranged before services and you may need to pay a percentage of the cost upfront before any dental work is performed.

So, if you live in the Denver area and need to have a tooth extraction or other dental care, do not fear that it is impossible to obtain. By calling each dental office and discussing the types of payment forms they accept, you may find a payment plan that fits your budget nicely. You can compare the prices and options of all dentists in your area so that you can make a well informed decision more easily.

Source:http://ezinearticles.com/?Affordable-Tooth-Extractions&id=3241427

Monday, 15 December 2014

Git workflow for Scrapy projects

Our customers often ask us what’s the best workflow for working with Scrapy projects. A popular approach we have seen and used in the past is to split the spiders folder (typically project/spiders) into two folders: project/spiders_prod and project/spiders_dev, and use the SPIDER_MODULES setting to control which spiders are loaded on each environment. This works reasonably well, until you have to make changes to common code used by many spiders (ie. code outside the spiders folder), for example common base spiders.

Nowadays, DVCS (in particular, git) have become more popular and people are quite used to branching, so we recommend using a simple git workflow (similar to GitHub flow) where you branch for every change you make. You keep all changes in a branch while they’re being tested and finally merge to master when they’re finished. This means that master branch is always stable and contains only “production-ready” spiders.

If you are using our Scrapy Cloud platform, you can have 2 projects (myproject-dev, myproject-prod) and use myproject-dev to test the changes in your branch. scrapy deploy in Scrapy 0.17 now adds the branch name to the version name (when using version=GIT or version=HG), so you can see which branch you are going to run directly on the panel. This is particularly useful with large teams working on a single Scrapy project, to avoid stepping into each other when making changes to common code.

Here is a concrete example to illustrate how this workflow works:y

•    the developer decides to work on issue 123 (could be a new spider or fixes to an existing spider)
•    the developer creates a new branch to work on the issue
•    git checkout -b issue123
•    the developer finishes working on the code and deploys to the panel (this assumes scrapy.cfg is configured with a deploy target, and using version=GIT – see here for more information)
•    scrapy deploy dev
•    the developer goes into the panel and runs the spider, where he’ll see the branch name (issue123) that will be run
•    the developer checks the scraped data looks fine through the item browser in the panel
•    whenever issues are found, the developer makes more fixes (always working on the same branch) and deploys new versions
•    once all issues are fixed, the developer merges the branch and deploys to production project
•    git checkout master
•    git merge issue123
•    git pull # make sure to pull latest code before deploying
•    scrapy deploy prod

We recommend you keep your common spiders well-tested and use Spider Contracts extensively to test your final spiders. Otherwise experience tell us that base spiders end up being copied (instead of reused) out of fear of breaking old spiders that depend on them, thus turning their maintenance into a nightmare.

Source:http://blog.scrapinghub.com/2013/03/06/git-workflow-scrapy-projects/

Saturday, 13 December 2014

Scrape it – Save it – Get it

I imagine I’m talking to a load of developers. Which is odd seeing as I’m not a developer. In fact, I decided to lose my coding virginity by riding the ScraperWiki digger! I’m a journalist interested in data as a beat so all I need to do is scrape. All my programming will be done on ScraperWiki, as such this is the only coding home I know. So if you’re new to ScraperWiki and want to make the site a scraping home-away-from-home, here are the basics for scraping, saving and downloading your data:

With these three simple steps you can take advantage of what ScraperWiki has to offer – writing, running and debugging code in an easy to use editor; collaborative coding with chat and user viewing functions; a dashboard with all your scrapers in one place; examples, cheat sheets and documentation; a huge range of libraries at your disposal; a datastore with API callback; and email alerts to let you know when your scrapers break.

So give it a go and let us know what you think!

Source:https://blog.scraperwiki.com/2011/04/scrape-it-save-it-get-it/

Sunday, 30 November 2014

What you have to know before requesting web scraping services?

Before you request web scraping services you have to know what are your needs (what data you need, structure of it and where you can find this data).

Step 1: Define what data you need?

Data needs depending on purpose, if you want to find new customers you probably need contact data from players in your industry. Also if you want to study your competitors you need to define who are they. Only after that you can select data sources (websites feeds or other electronic sources) for this extraction.

In many cases for discovering and defining data sources are used search engines like Google, Bing, Yahoo, and others.

Step 2: Structure of data

Data structure it’s directly linked to usage purpose. In many cases data structure it’s a table where a row represents an entity and a cell of this row represents a property of this entity. In other cases Data structure is a a chart or another graphic representation builder with data extracted from a web source.

Step 3: Number of data extraction

In many cases is needed one time data extraction. In other cases when you need a regular report, are needed periodically extractions.

If you have defined all of above points you are ready to request a quote and an amount estimation from this contact form.

Source: http://thewebminer.com/blog/2013/08/

Thursday, 27 November 2014

Scraping SSL Labs Server Test Results With R

NOTE: Qualys allows automated access to their SSL Server Test site in their T&C’s, and the R fucntion/script provided here does its best to adhere to their guidelines. However, if you launch multiple scripts at one time and catch their attention you will, no doubt, be banned.

This post will show you how to do some basic web page data scraping with R. To make it more palatable to those in the security domain, we’ll be scraping the results from Qualys’ SSL Labs SSL Test site by building an R function that will:

 fetch the contents of a URL with RCurl
 process the HTML page tags with R’s XML library
 identify the key elements from the page that need to be scraped
 organize the results into a usable R data structure

You can skip ahead to the code at the end (or in this gist) or read on for some expository that isn’t in the code’s comments.

Setting up the script and processing flow

We’ll need some assistance from three R packages to perform the scraping, processing and transformation tasks:

library(RCurl) # scraping
library(XML) # XML (HTML) processing
library(plyr) # data transformation

If you poke at the SSL Test site with a few different URLs, you’ll see there are three primary inputs to the GET request we’ll need to issue:

 d (the domain)
 s (the IP address to test)
 ignoreMismatch (which we’ll leave as ‘on‘)

You’ll also see that there’s often a delay between issuing a request and getting the results, so we’ll need to build in a GET+check-loop (like the javascript on the page does automagically). Finally, when the results are eventually displayed they are (at least for this example) usually either "Overall Rating" or "Assessment" and, we’ll use that status result in our tests for what to return.

We’ll account for the domain and IP address in the function parameters along with the amount of time we should pause between GET+check attempts. It’s also a good idea to provide a way to pass in any extra curl options (e.g. in the event folks are behind a proxy server and need to input that to make the requests work). We’ll define the function with some default parameters:

get_rating <- function(site="rud.is", ip="", pause=5, curl.opts=list()) {

}

This definition says that if we just call get_rating(), it will

 default to using "rud.is" as the domain (you can pick what you want in your implementation)
 not supply an IP address (which the script will then have to lookup with nsl)
 will pause 5s between GET+check attempts
 pass no extra curl options

Getting into the details

For the IP address logic, we’ll have to test if we passed in an an address string and perform a lookup if not:

# try to resolve IP if not specified; if no IP can be found, return
# a "NA" data frame

if (ip == "") {

 tmp <- nsl(site)
 if (is.null(tmp)) {
 return(data.frame(site=site, ip=NA, Certificate=NA,
 Protocol.Support=NA, Key.Exchange=NA,
 Cipher.Strength=NA)) }
 ip <- tmp
}

(don’t worry about the return(...) part yet, we’ll get there in a bit).

Once we have an IP address, we’ll need to make the call to the ssllabs.com test site and perform the check loop:

# get the contents of the URL (will be the raw HTML text)
# build the URL with sprintf

rating.dat <- getURL(sprintf("https://www.ssllabs.com/ssltest/analyze.html?d=%s&s=%s&ignoreMismatch=on", site, ip), .opts=curl.opts)

# while we don't find some indication of a completed request,
# pause and try again

while(!grepl("(Overall Rating|Assessment failed)", rating.dat)) {
Sys.sleep(pause)
rating.dat <- getURL(sprintf("https://www.ssllabs.com/ssltest/analyze.html?d=%s&s=%s&ignoreMismatch=on", site, ip), .opts=curl.opts)
}

We can then start making some decisions based on the results:

# if the assessment failed, return a data frame of NA's

if (grepl("Assessment failed", rating.dat)) {

return(data.frame(site=site, ip=NA, Certificate=NA,
 Protocol.Support=NA, Key.Exchange=NA,
 Cipher.Strength=NA))
}

# otherwise, parse the resultant HTML

x <- htmlTreeParse(rating.dat, useInternalNodes = TRUE)

Unfortunately, the results are not “consistent”. While there are plenty of uniquely identifiable <div>s, there are enough differences between runs that we have to be a bit generic in our selection of data elements to extract. I’ll leave the view-source: of a result as an exercise to the reader. For this example, we’ll focus on extracting:

 the overall rating (A-F)
 the “Certificate” score
 the “Protocol Support” score
 the “Key Exchange” score
 the “Cipher Strength” score

There are plenty of additional fields to extract, but you should be able to extrapolate and grab what you want to from the rest of the example.

Extracting the results

We’ll need to delve into XPath to extract the <div> values. We’ll use the xpathSApply function to perform this task. Since there sometimes is a tag within the <div> for the rating and since the rating has a class tag to help identify which color it should be, we use a starts-with selection parameter to just get anything beginning with rating_. If it returns an R list structure, we know we have the one with a element, so we re-issue the call with that extra XPath component.

rating <- xpathSApply(x,"//div[starts-with(@class,'rating_')]/text()", xmlValue)

if (class(rating) == "list") {

rating <- xpathSApply(x,"//div[starts-with(@class,'rating_')]/span/text()", xmlValue)
}

For the four attributes (and values) we’ll be extracting, we can use the getNodeSet call which will give us all of them into a structure we can process with xpathSApply

labs <- getNodeSet(x,"//div[@class='chartBody']/div[@class='chartRow']/div[@class='chartLabel']")

vals <- getNodeSet(x,"//div[@class='chartBody']/div[@class='chartRow']/div[starts-with(@class,'chartValue')]")

# convert them to vectors

labs <- xpathSApply(labs[[1]], "//div[@class='chartLabel']/text()", xmlValue)

vals <- xpathSApply(vals[[1]], "//div[starts-with(@class,'chartValue')]/text()", xmlValue)

At this point, labs will be a vector of label names and vals will be the corresponding values. We’ll put them, the original domain and the IP address into a data frame:

# rbind will turn the vector into row elements, with each

# value being in a column

rating.result <- data.frame(site=site, ip=ip,

 rating=rating, rbind(vals),
 row.names=NULL)

# we use the labs vector as the column names (in the right spot)

colnames(rating.result) <- c("site", "ip", "rating",

 gsub(" ", "\\.", labs))

and return the result:
return(rating.result)
Finishing up

If we run the whole function on one domain we’ll get a one-row data frame back as a result. If we use ldply from the plyr package to run the get_rating function repeatedly on a vector of domains, it will combine them all into one whole data frame. For example:

sites <- c("rud.is", "stackoverflow.com", "er-ant.com")

ratings <- ldply(sites, get_rating)

ratings

## site ip rating Certificate Protocol.Support Key.Exchange Cipher.Strength

## 1 rud.is 184.106.97.102 B 100 70 80 90

## 2 stackoverflow.com 198.252.206.140 A 100 90 80 90

## 3 er-ant.com <NA> <NA> <NA> <NA> <NA> <NA>

There are many tweaks you can make to this function to extract more data and perform additional processing. If you make some of your own changes, you’re encouraged to add to the gist (link above & below) and/or drop a note in the comments.

Hopefully you’ve seen how well-suited R is for this type of operation and have been encouraged to use it in your next attempt at some site/data scraping.

library(RCurl)
library(XML)
library(plyr)

#' get the Qualys SSL Labs rating for a domain+cert

#'

#' @param site domain to test SSL configuration of

#' @param ip address of \code{site} (will resolve it and take\cr

#' first response if not specified, but that may not always work as you expect)

#' @param hide.results ["on"|"off"] should the results show up in the SSL Labs history (default "on")

#' @param pause timeout between tries (default 5s)

#' @param curl.opts options to pass to \code{getURL} i.e. proxy setting

#' @return data frame of results

#'

get_rating <- function(site="rud.is", ip="", hide.results="on", pause=5, curl.opts=list()) {

# try to resolve IP if not specified; if no IP can be found, return

# a "NA" data frame

if (ip == "") {

tmp <- nsl(site)

if (is.null(tmp)) { return(data.frame(site=site, ip=NA, Certificate=NA,

Protocol.Support=NA, Key.Exchange=NA, Cipher.Strength=NA)) }

ip <- tmp

}

# need to let it actually process the certificate if not already cached

rating.dat <- getURL(sprintf("https://www.ssllabs.com/ssltest/analyze.html?d=%s&s=%s&ignoreMismatch=on&hideResults=%s", site, ip, hide.results), .opts=curl.opts)

while(!grepl("(Overall Rating|Assessment failed)", rating.dat)) {

Sys.sleep(pause)

rating.dat <- getURL(sprintf("https://www.ssllabs.com/ssltest/analyze.html?d=%s&s=%s&ignoreMismatch=on&hideResults=%s", site, ip, hide.results), .opts=curl.opts)

}

if (grepl("Assessment failed", rating.dat)) {

return(data.frame(site=site, ip=NA, Certificate=NA,

Protocol.Support=NA, Key.Exchange=NA, Cipher.Strength=NA))

}

x <- htmlTreeParse(rating.dat, useInternalNodes = TRUE)

# sometimes there is a tag in the <div>, which will result in an

# empty list() object being returned. we check for that and handle it

# appropriately.

rating <- xmlValue(x[["//div[starts-with(@class,'rating_')]/text()"]])

if (class(rating) == "list") {

rating <- xmlValue(x[["//div[starts-with(@class,'rating_')]/span/text()"]])

}

# extract the XML objects for the ratings labels & values

labs <- getNodeSet(x,"//div[@class='chartBody']/div[@class='chartRow']/div[@class='chartLabel']")

vals <- getNodeSet(x,"//div[@class='chartBody']/div[@class='chartRow']/div[starts-with(@class,'chartValue')]")

# convert them to vectors

labs <- xpathSApply(labs[[1]], "//div[@class='chartLabel']/text()", xmlValue)

vals <- xpathSApply(vals[[1]], "//div[starts-with(@class,'chartValue')]/text()", xmlValue)

# make them into a data frame

rating.result <- data.frame(site=site, ip=ip, rating=rating, rbind(vals), row.names=NULL)

colnames(rating.result) <- c("site", "ip", "rating", gsub(" ", "\\.", labs))

return(rating.result)

}

sites <- c("rud.is", "stackoverflow.com", "er-ant.com")

ratings <- ldply(sites, get_rating)

ratings

## site ip rating Certificate Protocol.Support Key.Exchange Cipher.Strength

## 1 rud.is 184.106.97.102 B 100 70 80 90

## 2 stackoverflow.com 198.252.206.140 A 100 90 80 90

## 3 er-ant.com <NA> <NA> <NA> <NA> <NA> <NA>

Source: http://www.r-bloggers.com/scraping-ssl-labs-server-test-results-with-r/

Sunday, 23 November 2014

A Content Marketer's Guide to Data Scraping

As digital marketers, big data should be what we use to inform a lot of the decisions we make. Using intelligence to understand what works within your industry is absolutely crucial within content campaigns, but it blows my mind to know that so many businesses aren't focusing on it.

One reason I often hear from businesses is that they don't have the budget to invest in complex and expensive tools that can feed in reams of data to them. That said, you don't always need to invest in expensive tools to gather valuable intelligence — this is where data scraping comes in.

Just so you understand, here's a very brief overview of what data scraping is from Wikipedia:

 "Data scraping is a technique in which a computer program extracts data from human-readable output coming from another program."

Essentially, it involves crawling through a web page and gathering nuggets of information that you can use for your analysis. For example, you could search through a site like Search Engine Land and scrape the author names of each of the posts that have been published, and then you could correlate this to social share data to find who the top performing authors are on that website.

Hopefully, you can start to see how this data can be valuable. What's more, it doesn't require any coding knowledge — if you're able to follow my simple instructions, you can start gathering information that will inform your content campaigns. I've recently used this research to help me get a post published on the front page of BuzzFeed, getting viewed over 100,000 times and channeling a huge amount of traffic through to my blog.

Disclaimer: One thing that I really need to stress before you read on is the fact that scraping a website may breach its terms of service. You should ensure that this isn't the case before carrying out any scraping activities. For example, Twitter completely prohibits the scraping of information on their site. This is from their Terms of Service:

 "crawling the Services is permissible if done in accordance with the provisions of the robots.txt file, however, scraping the Services without the prior consent of Twitter is expressly prohibited"

Google similarly forbids the scraping of content from their web properties:

 Google's Terms of Service do not allow the sending of automated queries of any sort to our system without express permission in advance from Google.

So be careful, kids.
Content analysis

Mastering the basics of data scraping will open up a whole new world of possibilities for content analysis. I'd advise any content marketer (or at least a member of their team) to get clued up on this.

Before I get started on the specific examples, you'll need to ensure that you have Microsoft Excel on your computer (everyone should have Excel!) and also the SEO Tools plugin for Excel (free download here). I put together a full tutorial on using the SEO tools plugin that you may also be interested in.

Alongside this, you'll want a web crawling tool like Screaming Frog's SEO Spider or Xenu Link Sleuth (both have free options). Once you've got these set up, you'll be able to do everything that I outline below.

So here are some ways in which you can use scraping to analyse content and how this can be applied into your content marketing campaigns:

1. Finding the different authors of a blog

Analysing big publications and blogs to find who the influential authors are can give you some really valuable data. Once you have a list of all the authors on a blog, you can find out which of those have created content that has performed well on social media, had a lot of engagement within the comments and also gather extra stats around their social following, etc.

I use this information on a daily basis to build relationships with influential writers and get my content placed on top tier websites. Here's how you can do it:

Step 1: Gather a list of the URLs from the domain you're analysing using Screaming Frog's SEO Spider. Simply add the root domain into Screaming Frog's interface and hit start (if you haven't used this tool before, you can check out my tutorial here).

Once the tool has finished gathering all the URLs (this can take a little while for big websites), simply export them all to an Excel spreadsheet.

Step 2: Open up Google Chrome and navigate to one of the article pages of the domain you're analysing and find where they mention the author's name (this is usually within an author bio section or underneath the post title). Once you've found this, right-click their name and select inspect element (this will bring up the Chrome developer console).

Within the developer console, the line of code associated to the author's name that you selected will be highlighted (see the below image). All you need to do now is right-click on the highlighted line of code and press Copy XPath.

For the Search Engine Land website, the following code would be copied:

//*[@id="leftCol"]/div[2]/p/span/a

This may not make any sense to you at this stage, but bear with me and you'll see how it works.

Step 3: Go back to your spreadsheet of URLs and get rid of all the extra information that Screaming Frog gives you, leaving just the list of raw URLs – add these to the first column (column A) of your worksheet.

Step 4: In cell B2, add the following formula:

=XPathOnUrl(A2,"//*[@id='leftCol']/div[2]/p/span/a")

Just to break this formula down for you, the function XPathOnUrl allows you to use the XPath code directly within (this is with the SEO Tools plugin installed; it won't work without this). The first element of the function specifies which URL we are going to scrape. In this instance I've selected cell A2, which contains a URL from the crawl I did within Screaming Frog (alternatively, you could just type the URL, making sure that you wrap it within quotation marks).

Finally, the last part of the function is our XPath code that we gathered. One thing to note is that you have to remove the quotation marks from the code and replace them with apostrophes. In this example, I'm referring to the "leftCol" section, which I've changed to ‘leftCol' — if you don't do this, Excel won't read the formula correctly.

Once you press enter, there may be a couple of seconds delay whilst the SEO Tools plugin crawls the page, then it will return a result. It's worth mentioning that within the example I've given above, we're looking for author names on article pages, so if I try to run this on a URL that isn't an article (e.g. the homepage) I will get an error.

For those interested, the XPath code itself works by starting at the top of the code of the URL specified and following the instructions outlined to find on-page elements and return results. So, for the following code:

//*[@id='leftCol']/div[2]/p/span/a

We're telling it to look for any element (//*) that has an id of leftCol (@id='leftCol') and then go down to the second div tag after this (div[2]), followed by a p tag, a span tag and finally, an a tag (/p/span/a). The result returned should be the text within this a tag.

Don't worry if you don't understand this, but if you do, it will help you to create your own XPath. For example, if you wanted to grab the output of an a tag that has rel=author attached to it (another great way of finding page authors), then you could use some XPath that looked a little something like this:

//a[@rel='author']

As a full formula within Excel it would look something like this:

=XPathOnUrl(A2,"//a[@rel='author']")

Once you've created the formula, you can drag it down and apply it to a large number of URLs all at once. This is a huge time-saver as you'd have to manually go through each website and copy/paste each author to get the same results without scraping – I don't need to explain how long this would take.

Now that I've explained the basics, I'll show you some other ways in which scraping can be used…

2. Finding extra details around page authors

So, we've found a list of author names, which is great, but to really get some more insight into the authors we will need more data. Again, this can often be scraped from the website you're analysing.

Most blogs/publications that list the names of the article author will actually have individual author pages. Again, using Search Engine Land as an example, if you click my name at the top of this post you will be taken to a page that has more details on me, including my Twitter profile, Google+ profile and LinkedIn profile. This is the kind of data that I'd want to gather because it gives me a point of contact for the author I'm looking to get in touch with.

Here's how you can do it.

Step 1: First we need to get the author profile URLs so that we can scrape the extra details off of them. To do this, you can use the same approach to find the author's name, with just a little addition to the formula:

=XPathOnUrl(A2,"//a[@rel='author']", "href")

The addition of the "href" part of the formula will extract the output of the href attribute of the atag. In Lehman terms, it will find the hyperlink attached to the author name and return that URL as a result.

Step 2: Now that we have the author profile page URLs, you can go on and gather the social media profiles. Instead of scraping the article URLs, we'll be using the profile URLs.

So, like last time, we need to find the XPath code to gather the Twitter, Google+ and LinkedIn links. To do this, open up Google Chrome and navigate to one of the author profile pages, right-click on the Twitter link and select Inspect Element.

Once you've done this, hover over the highlighted line of code within Chrome's developer tools, right-click and select Copy XPath.

Step 3: Finally, open up your Excel spreadsheet and add in the following formula (using the XPath that you've copied over):

=XPathOnUrl(C2,"//*[@id='leftCol']/div[2]/p/a[2]", "href")

Remember that this is the code for scraping Search Engine Land, so if you're doing this on a different website, it will almost certainly be different. One important thing to highlight here is that I've selected cell C2 here, which contains the URL of the author profile page and not just the article page. As well as this, you'll notice that I've included "href" at the end because we want the actual Twitter profile URL and not just the words ‘Twitter'.

You can now repeat this same process to get the Google+ and LinkedIn profile URLs and add it to your spreadsheet. Hopefully you're starting to see the value in this, and how it can be used to gather a lot of intelligence that can be used for all kinds of online activity, not least your SEO and social media campaigns.

3. Gathering the follower counts across social networks

Now that we have the author's social media accounts, it makes sense to get their follower counts so that they can be ranked based on influence within the spreadsheet.

Here are the final XPath formulae that you can plug straight into Excel for each network to get their follower counts. All you'll need to do is replace the text INSERT SOCIAL PROFILE URL with the cell reference to the Google+/LinkedIn URL:

Google+:

=XPathOnUrl(INSERTGOOGLEPROFILEURL,"//span[@class='BOfSxb']")

LinkedIn:

=XPathOnUrl(INSERTLINKEDINURL,"//dd[@class='overview-connections']/p/strong")

4. Scraping page titles

Once you've got a list of URLs, you're going to want to get an idea of what the content is actually about. Using this quick bit of XPath against any URL will display the title of the page:

=XPathOnUrl(A2,"//title")

To be fair, if you're using the SEO Tools plugin for Excel then you can just use the built-in feature to scrape page titles, but it's always handy to know how to do it manually!

A nice extra touch for analysis is to look at the number of words used within the page titles. To do this, use the following formula:

=CountWords(A2)

From this you can get an understanding of what the optimum title length of a post within a website is. This is really handy if you're pitching an article to a specific publication. If you make the post the best possible fit for the site and back up your decisions with historical data, you stand a much better chance of success.

Taking this a step further, you can gather the social shares for each URL using the following functions:

Twitter:

=TwitterCount(INSERTURLHERE)

Facebook:

=FacebookLikes(INSERTURLHERE)

Google+:

=GooglePlusCount(INSERTURLHERE)

Note: You can also use a tool like URL Profiler to pull in this data, which is much better for large data sets. The tool also helps you to gather large chunks of data from other social networks, link data sources like Ahrefs, Majestic SEO and Moz, which is awesome.

If you want to get even more social stats then you can use the SharedCount API, and this is how you go about doing it…

Firstly, create a new column in your Excel spreadsheet and add the following formula (where A2 is the URL of the webpage you want to gather social stats for):

=CONCATENATE("http://api.sharedcount.com/?url=",A2)

You should now have a cell that contains your webpage URL prefixed with the SharedCount API URL. This is what we will use to gather social stats. Now here's the Excel formula to use for each network (where B2 is the cell that contaiins the formula above):

StumbleUpon:

=JsonPathOnUrl(B2,"StumbleUpon")

Reddit:

=JsonPathOnUrl(B2,"Reddit")

Delicious:

=JsonPathOnUrl(B2,"Delicious")

Digg:

=JsonPathOnUrl(B2,"Diggs")

Pinterest:

=JsonPathOnUrl(B2,"Pinterest")

LinkedIn:

=JsonPathOnUrl(B2,"Linkedin")

Facebook Shares:

=JsonPathOnUrl(B2,"Facebook.share_count")

Facebook Comments:

=JsonPathOnUrl(B2,"Facebook.comment_count")

Once you have this data, you can start looking much deeper into the elements of a successful post. Here's an example of a chart that I created around a large sample of articles that I analysed within Upworthy.com.

The chart looks at the average number of social shares that an article on Upworthy receives vs the number of words within its title. This is invaluable data that can be used across a whole host of different on-page elements to get the perfect article template for the site you're pitching to.

See, big data is useful!

5. Date/time the post was published

Along with analysing the details of headlines that are working within a site, you may want to look at the optimal posting times for best results. This is something that I regularly do within my blogs to ensure that I'm getting the best possible return from the time I spend writing.

Every site is different, which makes it very difficult for an automated, one-size-fits-all tool to gather this information. Some sites will have this data within the <head> section of their webpages, but others will display it directly under the article headline. Again, Search Engine Land is a perfect example of a website doing this…

So here's how you can scrape this information from the articles on Search Engine Land:

=XPathOnUrl(INSERTARTICLEURL,"//*[@class='dateline']/text()")

Now you've got the date and time of the post. You may want to trim this down and reformat it for your data analysis, but you've got it all in Excel so that should be pretty easy.

Extra reading

Data scraping is seriously powerful, and once you've had a bit of a play around with it you'll also realise that it's not that complicated. The examples that I've given are just a starting point but once you get your creative head on, you'll soon start to see the opportunities that arise from this intelligence.

Here's some extra reading that you might find useful:

 http://findmyblogway.com/scraping-communities-with-xpath/

 http://builtvisible.com/data-entry-is-a-waste-of-time/

 http://www.seotakeaways.com/data-scraping-guide-for-seo/

 http://okdork.com/2014/04/30/the-step-by-step-guide-to-10x-growth-for-any-blog/

TL;DR

 Start using actual data to inform your content campaigns instead of going on your gut feeling.

 Gather intelligence around specific domains you want to target for content placement and create the perfect post for their audience.

 Get clued up on XPath and JSON through using the SEO Tools plugin for Excel.

 Spend more time analysing what content will get you results as opposed to what sites will give you links!

 Check the website's ToS before scraping.

Source:http://moz.com/blog/a-content-marketers-guide-to-data-scraping

Wednesday, 19 November 2014

NHL ending dry scraping of ice before overtime

TORONTO (AP) — The NHL will no longer dry scrape the ice before overtime.

Instituted this season in an effort to reduce the number of shootouts, the dry scraping will stop after Friday's games.

The general managers decided at their meeting Tuesday to make the change after the league talked to the players' union the past few days.

Beginning Saturday, ice crews around the league will again shovel the ice after regulation as they did in previous years. The GMs said the dry scrape was causing too much of a delay. Director of hockey operations Colin Campbell said the delays were lasting from more than four minutes to almost seven.

The dry scrape initially had been approved in hopes of reducing shootouts by improving scoring chances without unduly slowing play by recoating the ice.

The GMs also discussed expanded video review, including goaltender interference, and the possibility of three-on-three overtime. The American Hockey League is experimenting with the three-on-three format this season.

This annual meeting the day after the Hockey Hall of Fame induction usually doesn't produce actual changes, with the dry scrape providing an exception.

The main purpose is to set up the March meeting in Boca Raton, Florida, where these items will be further addressed.

Source:http://missoulian.com/sports/hockey/nhl-ending-dry-scraping-of-ice-before-overtime/article_3dd5473c-6102-5800-99f7-2c98be0f99ad.html

Saturday, 15 November 2014

Screenscraping from Java using jsoup – effective data gathering from websites

In a recent article I discussed screenscraping in a in hindsight fairly clumsy way (http://technology.amis.nl/blog/12786/building-java-object-graph-with-tour-de-france-results-using-screen-scraping-java-util-parser-and-assorted-facilities). While preparing for a series of articles on data visualizations, I had need of statistics regarding the Olympic Games – more specifically: the overall medal count per country during the 2008 Bejing Olympic Games. This information is readily available from dozens of websites. However, I could not find one hat offered the data in easy to process XML or CSV format – all websites had human consumers in mind.

Using screenscraping – we use a programmatic facility to consume the content that is intended to be displayed on screen to human users and subsequently process that content by extracting the required data from it. Some web-pages are easier to scrape than others – this depends on the richness of the HTML (the poorer the better for scraping), the required interactivity (JavaScript, AJAX – the less the better) and the structure used to present the data (tables, frequently despised by web developers, work rather well).

I came across a tool for screenscraping from Java, called jsoup – http://jsoup.org/. It turned out to be so incredibly easy to use – that I thouht I should share it.

Getting going with jsoup is as easy as can be:

1. download jsoup-1.6.1.jar (or whatever the latest version is) from http://jsoup.org/download

2. add this jar as a dependency in your project and/or application CLASSPATH

3. make use of jsoup in the code that does the screenscraping.

A simple example of code that uses jsoup (more examples on: http://jsoup.org/cookbook/):

One of the websites offering the overall medal count is http://www.databaseolympics.com/games/gamesyear.htm?g=26. The page looks as follows:

Image

Well, more importantly, the page looks like this:

Image

This means in terms of screenscraping: I will find the medal count for each country inside a TABLE element with styleclass pt8. Each country has a TR element. Only the first TR element does not represent a country score, as it is the table header. The first TD element in the TR represents the country. The name of the country can be retrieved as the text content from the A element in the TD. The next TD elements contain the numbers of medals in Gold, Silver, Bronze and Total.

The corresponding Java code with jsoup boils down to:

public static void main(String[] args) throws IOException, SQLException, InterruptedException {

        Document doc = Jsoup.connect(OlympicMedalMirrorProcessor.baseUrl + "?g=26").get();
        String title = doc.title();
        System.out.println(title);
        Element table = doc.select("table.pt8").get(0);
        Elements trs = table.select("tr");
        Iterator trIter = trs.iterator();
        boolean firstRow = true;
        while (trIter.hasNext()) {

            Element tr = (Element)trIter.next();
            if (firstRow) {
                firstRow = false;
                continue;
            }
            Elements tds = tr.select("td");
            Iterator tdIter = tds.iterator();
            int tdCount = 1;
            String country = null;
            Integer gold = null;
            Integer silver = null;
            Integer bronze = null;
            Integer total = null;
            // process new line
            while (tdIter.hasNext()) {

                Element td = (Element)tdIter.next();
                switch (tdCount++) {
                case 1:
                    country = td.select("a").text();
                    break;
                case 2:
                    gold = Integer.parseInt(td.text());
                    break;
                case 3:
                    silver = Integer.parseInt(td.text());
                    break;
                case 4:
                    bronze = Integer.parseInt(td.text());
                    break;
                case 5:
                    total = Integer.parseInt(td.text());
                    break;
                }

            }
            System.out.println(country + ": gold " + gold + " silver " + silver + " bronze " + bronze + " total " +
                               total);
        } //table rows

Source:http://technology.amis.nl/2011/08/03/screenscraping-from-java-using-jsoup-effective-data-gathering-from-websites/

Thursday, 13 November 2014

The PromptCloud Advantage- Web Scraping with an Edge

The global market is now more aware of its data scraping needs. And so with the demand, the list of suppliers has grown too. This post is dedicated to bringing out the PromptCloud Advantage among such providers.

PromptCloud-Winning-The Race

1. The know-how- Crawling the web, as mundane as it may sound, is a fairly complex task. No one is to be blamed for overlooking the complexity as these things surface only after you’ve tried it yourself and delved into the nitty-gritty. The design decisions you take sit at the core of what you build and eventually monetize. And the long-term effects of such architectural choices are as pleasing if you’ve done it right as disturbing they might turn out if you’re not far-sighted.

Although the expertise of building the tech stack for such large-scale data acquisition, distributing your clusters (and putting thoughts into their geographical locations), maintaining queues, databases and backups, does come from ‘been there done that’, we have been lucky to have the tech advantage imbibed into us since inception. Not that we got it right the first time, but our systems have evolved with technologies, improving each day. Now that we have been there in this business for the last 56 months, it does feel like a long journey for our stack and yes, we do know better :)

2. SLAs- SLAs are what bolsters the data itself. PromptCloud’s key SLAs are scale and quality; while not compromising the data coverage or the politeness policies on your sources. Since we perform focused crawls, there’s no dilution of data and you can consume it all or ask us to index it in order to search using logical combinations in queries. For your reference, here’s a list of all SLAs to visit while picking your data service provider.

changing_place_changing_time_changing_thouts_changing_future.

3. The Experience- There are many scraping tools and crawling services in the market which might just serve the need. What PromptCloud provides is a data acquisition experience; and we go as many number of extra miles as you’d like us to go for it. By leveraging our DaaS platform, we make sure you get what you need from the time you start your research for a data provider through importing the data feeds into your database. We hear your requirements in detail, make sure we’ve got it right by sharing samples and going multiple iterations of reprocessing the data to match your needs while you battle internally on freezing your requirements. But what’s more magical is the way all these feeds get delivered to you, at the intervals you requested; programatically.

It might be evident for the SLAs and the know-how fusing to provide the experience, but it’s that additional human touch that actually aids in sustaining it. We make sure you’re at peace while our systems handle the roadblocks and sort out the messiness on the web.

Source:https://www.promptcloud.com/blog/the-promptcloud-advantage-web-scraping/

Monday, 10 November 2014

Example of Scraping with Selenium WebDriver in C#

In this article I will show you how it is easy to scrape a web site using Selenium WebDriver. I will guide you through a sample project which is written in C# and uses WebDriver in conjunction with the Chrome browser to login on the testing page and scrape the text from the private area of the website.

Downloading the WebDriver

First of all we need to get the latest version of Selenium Client & WebDriver Language Bindings and the Chrome Driver. Of course, you can download WebDriver bindings for any language (Java, C#, Python, Ruby), but within the scope of this sample project I will use the C# binding only. In the same manner, you can use any browser driver, but here I will use Chrome.

After downloading the libraries and the browser driver we need to include them in our Visual

Studio solution:

VS Solution

Creating the scraping program

In order to use the WebDriver in our program we need to add its namespaces:

using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;
using OpenQA.Selenium.Support.UI;

Then, in the main function, we need to initialize the Chrome Driver:

using (var driver = new ChromeDriver())

{

This piece of code searches for the chromedriver.exe file. If this file is located in a directory different from the directory where our program is executed, then we need to specify explicitly its path in the ChromeDriver constructor.

When an instance of ChromeDriver is created, a new Chrome browser will be started. Now we can control this browser via the driver variable. Let’s navigate to the target URL first:

driver.Navigate().GoToUrl("http://testing-ground.scraping.pro/login");

Then we can find the web page elements needed for us to login in the private area of the website:

var userNameField = driver.FindElementById("usr");
var userPasswordField = driver.FindElementById("pwd");
var loginButton = driver.FindElementByXPath("//input[@value='Login']");

Here we search for user name and password fields and the login button and put them into the corresponding variables. After we have found them, we can type in the user name and the password and press the login button:

userNameField.SendKeys("admin");
userPasswordField.SendKeys("12345");
loginButton.Click();

At this point the new page will be loaded into the browser, and after it’s done we can scrape the text we need and save it into the file:

var result = driver.FindElementByXPath("//div[@id='case_login']/h3").Text;

File.WriteAllText("result.txt", result);

That’s it! At the end, I’d like to give you a bonus – saving a screenshot of the current page into a file:

driver.GetScreenshot().SaveAsFile(@"screen.png", ImageFormat.Png);

The complete program listing

using System.IO;
using System.Text;
using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;
using OpenQA.Selenium.Support.UI;

namespace WebDriverTest
{
    class Program
    {
        static void Main(string[] args)
        {
            // Initialize the Chrome Driver
            using (var driver = new ChromeDriver())
            {
                // Go to the home page
                driver.Navigate().GoToUrl("http://testing-ground.scraping.pro/login");

                // Get the page elements
                var userNameField = driver.FindElementById("usr");
                var userPasswordField = driver.FindElementById("pwd");
                var loginButton = driver.FindElementByXPath("//input[@value='Login']");

                // Type user name and password
                userNameField.SendKeys("admin");
                userPasswordField.SendKeys("12345");

                // and click the login button
                loginButton.Click();

                // Extract the text and save it into result.txt
                var result = driver.FindElementByXPath("//div[@id='case_login']/h3").Text;
                File.WriteAllText("result.txt", result);

                // Take a screenshot and save it into screen.png
                driver.GetScreenshot().SaveAsFile(@"screen.png", ImageFormat.Png);
            }
        }
    }
}

Also you can download a ready project here.

Conclusion

I hope you are impressed with how easy it is to scrape web pages using the WebDriver. You can naturally press keys and click buttons as you would in working with the browser. You don’t even need to understand what kind of HTTP requests are sent and what cookies are stored; the browser does all this for you. This makes the WebDriver a wonderful tool in the hands of a web scraping specialist.

Source:http://scraping.pro/example-of-scraping-with-selenium-webdriver-in-csharp/

Wednesday, 5 November 2014

Application of Web Data Mining in CRM

The process of improvising the customer relations and interactions and making them more amicable may be termed as Customer relationship management (CRM). Since web data mining is used in the utilization of the various modeling and data analysis methods in detecting given patterns and relationships in the data, it can be used as an effective tool in CRM. By the effectively using web data mining you are able to understand what your customers what.

It is important to note that web data mining can be used effectively in searching for the right and potential customers to be offered the right products at the right time. The result of this in any business is the increase in the revenue generated. This is made possible as you are able to respond to each customer in an effective and efficient way. The method further utilizes very few resources and can be therefore termed as an economical method.

In the next paragraphs we discuss the basic process of customer relationship management and its integration with web data mining service. The following are the basic process that should be used in understanding what your customers need, sending them the right offers and products, and reducing the resources used in managing your customers.

Defining the business objective. Web data mining can be used to define and inform your customers your business objective. By doing research you can be able to determine whether your business objective is communicated well to your customers and clients. Does your business objective take interest in the customers? Your business goal must be clearly outlined in your business CRM. By having a more precise and defined goal is the possible way of ensuring success in the customer relationship management.

Source:http://www.loginworks.com/blogs/web-scraping-blogs/application-web-data-mining-crm/

Monday, 8 September 2014

Scraping webdata from a website that loads data in a streaming fashion

I'm trying to scrape some data off of the FEC.gov website using python for a project of mine. Normally I use python

mechanize and beautifulsoup to do the scraping.

I've been able to figure out most of the issues but can't seem to get around a problem. It seems like the data is

streamed into the table and mechanize.Browser() just stops listening.

So here's the issue: If you visit http://query.nictusa.com/cgi-bin/can_ind/2011_P80003338/1/A ... you get the first 500

contributors whose last name starts with A and have given money to candidate P80003338 ... however, if you use

browser.open() at that url all you get is the first ~5 rows.

I'm guessing its because mechanize isn't letting the page fully load before the .read() is executed. I tried putting a

time.sleep(10) between the .open() and .read() but that didn't make much difference.

And I checked, there's no javascript or AJAX in the website (or at least none are visible when you use the 'view-

source'). SO I don't think its a javascript issue.

Any thoughts or suggestions? I could use selenium or something similar but that's something that I'm trying to avoid.

-Will

2 Answers

Why not use an html parser like lxml with xpath expressions.

I tried

>>> import lxml.html as lh
>>> data = lh.parse('http://query.nictusa.com/cgi-bin/can_ind/2011_P80003338/1/A')
>>> name = data.xpath('/html/body/table[2]/tr[5]/td[1]/a/text()')
>>> name
[' AABY, TRYGVE']
>>> name = data.xpath('//table[2]/*/td[1]/a/text()')
>>> len(name)
500
>>> name[499]
' AHMED, ASHFAQ'
>>>

Similarly, you can create xpath expression of your choice to work with.

Source: http://stackoverflow.com/questions/9435512/scraping-webdata-from-a-website-that-loads-data-in-a-streaming-

fashion

How can I circumvent page view limits when scraping web data using Python?

I am using Python to scrape US postal code population data from http:/www.city-data.com, through this directory: http://www.city-data.com/zipDir.html. The specific pages I am trying to scrape are individual postal code pages with URLs like this: http://www.city-data.com/zips/01001.html. All of the individual zip code pages I need to access have this same URL Format, so my script simply does the following for postal_code in range:

 Creates URL given postal code
 Tries to get response from URL
 If (2), Check the HTTP of that URL
 If HTTP is 200, retrieves the HTML and scrapes the data into a list
 If HTTP is not 200, pass and count error (not a valid postal code/URL)
 If no response from URL because of error, pass that postal code and count error
 At end of script, print counter variables and timestamp

The problem is that I run the script and it works fine for ~500 postal codes, then suddenly stops working and returns repeated timeout errors. My suspicion is that the site's server is limiting the page views coming from my IP address, preventing me from completing the amount of scraping that I need to do (all 100,000 potential postal codes).

My question is as follows: Is there a way to confuse the site's server, for example using a proxy of some kind, so that it will not limit my page views and I can scrape all of the data I need?

Thanks for the help! Here is the code:

##POSTAL CODE POPULATION SCRAPER##

import requests

import re

import datetime

def zip_population_scrape():

 """
 This script will scrape population data for postal codes in range
 from city-data.com.
 """
 postal_code_data = [['zip','population']] #list for storing scraped data

 #Counters for keeping track:
 total_scraped = 0
 total_invalid = 0
 errors = 0

 for postal_code in range(1001,5000):

 #This if statement is necessary because the postal code can't start
 #with 0 in order for the for statement to interate successfully
 if postal_code <10000:
 postal_code_string = str(0)+str(postal_code)
 else:
 postal_code_string = str(postal_code)

 #all postal code URLs have the same format on this site
 url = 'http://www.city-data.com/zips/' + postal_code_string + '.html'

 #try to get current URL
 try:
 response = requests.get(url, timeout = 5)
 http = response.status_code

 #print current for logging purposes
 print url +" - HTTP: " + str(http)

 #if valid webpage:
 if http == 200:

 #save html as text
 html = response.text

 #extra print statement for status updates
 print "HTML ready"

 #try to find two substrings in HTML text
 #add the substring in between them to list w/ postal code
 try:

 found = re.search('population in 2011: (.*) ', html).group(1)

 #add to # scraped counter
 total_scraped +=1

 postal_code_data.append([postal_code_string,found])

 #print statement for logging
 print postal_code_string + ": " + str(found) + ". Data scrape successful. " + str(total_scraped) + " total zips scraped."
 #if substrings not found, try searching for others
 #and doing the same as above
 except AttributeError:
 found = re.search('population in 2010: (.*) ', html).group(1)

 total_scraped +=1

 postal_code_data.append([postal_code_string,found])
 print postal_code_string + ": " + str(found) + ". Data scrape successful. " + str(total_scraped) + " total zips scraped."

 #if http =404, zip is not valid. Add to counter and print log
 elif http == 404:
 total_invalid +=1

 print postal_code_string + ": Not a valid zip code. " + str(total_invalid) + " total invalid zips."

 #other http codes: add to error counter and print log
 else:
 errors +=1

 print postal_code_string + ": HTTP Code Error. " + str(errors) + " total errors."

 #if get url fails by connnection error, add to error count & pass
 except requests.exceptions.ConnectionError:
 errors +=1
 print postal_code_string + ": Connection Error. " + str(errors) + " total errors."
 pass

 #if get url fails by timeout error, add to error count & pass
 except requests.exceptions.Timeout:
 errors +=1
 print postal_code_string + ": Timeout Error. " + str(errors) + " total errors."
 pass

 #print final log/counter data, along with timestamp finished
 now= datetime.datetime.now()
 print now.strftime("%Y-%m-%d %H:%M")
 print str(total_scraped) + " total zips scraped."
 print str(total_invalid) + " total unavailable zips."
 print str(errors) + " total errors."

Source: http://stackoverflow.com/questions/25452798/how-can-i-circumvent-page-view-limits-when-scraping-web-data-using-python

Sunday, 7 September 2014

Web data scraping (online news comments) with Scrapy (Python)

Since you seem like the try-first ask-question later type (that's a very good thing), I won't give you an answer, but a

(very detailed) guide on how to find the answer.

The thing is, unless you are a yahoo developer, you probably don't have access to the source code you're trying to

scrape. That is to say, you don't know exactly how the site is built and how your requests to it as a user are being

processed on the server-side. You can, however, investigate the client-side and try to emulate it. I like using Chrome

Developer Tools for this, but you can use others such as FF firebug.

So first off we need to figure out what's going on. So the way it works, is you click on the 'show comments' it loads

the first ten, then you need to keep clicking for the next ten comments each time. Notice, however, that all this

clicking isn't taking you to a different link, but lively fetches the comments, which is a very neat UI but for our

case requires a bit more work. I can tell two things right away:

 They're using javascript to load the comments (because I'm staying on the same page).
 They load them dynamically with AJAX calls each time you click (meaning instead of loading the comments with the

page and just showing them to you, with each click it does another request to the database).

Now let's right-click and inspect element on that button. It's actually just a simple span with text:

View Comments (2077)

By looking at that we still don't know how that's generated or what it does when clicked. Fine. Now, keeping the

devtools window open, let's click on it. This opened up the first ten. But in fact, a request was being made for us to

fetch them. A request that chrome devtools recorded. We look in the network tab of the devtools and see a lot of

confusing data. Wait, here's one that makes sense:

http://news.yahoo.com/_xhr/contentcomments/get_comments/?content_id=42f7f6e0-7bae-33d3-aa1d-

3dfc7fb5cdfc&_device=full&count=10&sortBy=highestRated&isNext=true&offset=20&pageNumber=2&_media.modules.content_commen

ts.switches._enable_view_others=1&_media.modules.content_comments.switches._enable_mutecommenter=1&enable_collapsed_com

ment=1

See? _xhr and then get_comments. That makes a lot of sense. Going to that link in the browser gave me a JSON object

(looks like a python dictionary) containing all the ten comments which that request fetched. Now that's the request you

need to emulate, because that's the one that gives you what you want. First let's translate this to some normal reqest

that a human can read:

go to this url: http://news.yahoo.com/_xhr/contentcomments/get_comments/
include these parameters: {'_device': 'full',
 '_media.modules.content_comments.switches._enable_mutecommenter': '1',
 '_media.modules.content_comments.switches._enable_view_others': '1',
 'content_id': '42f7f6e0-7bae-33d3-aa1d-3dfc7fb5cdfc',
 'count': '10',
 'enable_collapsed_comment': '1',
 'isNext': 'true',
 'offset': '20',
 'pageNumber': '2',
 'sortBy': 'highestRated'}

Now it's just a matter of trial-and-error. However, a few things to note here:

 Obviously the count is what decides how many comments you're getting. I tried changing it to 100 to see what

happens and got a bad request. And it was nice enough to tell me why - "Offset should be multiple of total rows". So

now we understand how to use offset

 The content_id is probably something that identifies the article you are reading. Meaning you need to fetch that

from the original page somehow. Try digging around a little, you'll find it.

 Also, you obviously don't want to fetch 10 comments at a time, so it's probably a good idea to find a way to fetch

the number of total comments somehow (either find out how the page gets it, or just fetch it from within the article

itself)

 Using the devtools you have access to all client-side scripts. So by digging you can find that that link to

/get_comments/ is kept within a javascript object named YUI. You can then try to understand how it is making the

request, and try to emulate that (though you can probably figure it out yourself)

 You might need to overcome some security measures. For example, you might need a session-key from the original

article before you can access the comments. This is used to prevent direct access to some parts of the sites. I won't

trouble you with the details, because it doesn't seem like a problem in this case, but you do need to be aware of it in

case it shows up.

 Finally, you'll have to parse the JSON object (python has excellent built-in tools for that) and then parse the

html comments you are getting (for which you might want to check out BeautifulSoup).

As you can see, this will require some work, but despite all I've written, it's not an extremely complicated task

either.

So don't panic.

It's just a matter of digging and digging until you find gold (also, having some basic WEB knowledge doesn't hurt).

Then, if you face a roadblock and really can't go any further, come back here to SO, and ask again. Someone will help

you.

Source: http://stackoverflow.com/questions/20218855/web-data-scraping-online-news-comments-with-scrapy-python

Saturday, 6 September 2014

A good web data extraction/screen scraper program?

I need to capture product data from a site on a regular basis and wondered if any one knows of a good software program? I've trialed Mozenda but its a monthly subscription and pricey in the long term. Obviously something thats free would be best but I don't mind paying either. Just need a decent program thats reliable and doesn't require much programming knowledge.

You can try ScraperWiki.com if you know python.

I've experimented with Screen-Scraper and found it easy to use. The application comes in multiple versions: basic (which is free), professional, and enterprise. Also, multiple platforms are supported.

Hire a programmer to do it so that there is only a one off cost. I often see similar projects on freelancing websites like Elance and oDesk.

I really like iMacros. You can give it a test drive to see if it meets your needs with the totally free Firefox extension (there's also IE versions), but there are also more full featured application and "server" versions that have more features and ability to do thing in an unattended manner.

Here are some other alternatives to consider:

    License the data from the provider. Call em up and ask 'em.

    Use Amazon Mechanical Turk to get humans to copy and paste and format it for ya. They are cheap.

    For automation, it depends on how complicated the HTML is and how often it changes. You could use Excel's Web Data Import if it's really simple.

You can use irobot from IRobotSoft, which is totally free, and provides more functionalityies than other paid software. Watch demos here http://irobotsoft.com/help/ for how simple it is.

Questions on their forum were answered very quickly.

Source: http://stackoverflow.com/questions/2334164/a-good-web-data-extraction-screen-scraper-program

Friday, 5 September 2014

How to login to website and extract data using PHP [closed]

I have installed the tiny tiny rss on to my computer (Windows) and also have Xampp installed (localhost).

I want to be able to use PHP to extract data from the Tiny tiny RSS webpage.

I have tried this it which just opens the front page:

<?php
$homepage = file_get_contents('my install tiny tiny rss url');
echo $homepage;
?>

But how do I login and extract the data.

You can use cURL to send post data and headers. To login you need to replicate the exact data exchange between the client and the server.

SOurce: http://stackoverflow.com/questions/20611918/how-to-login-to-website-and-extract-data-using-php

Is it ok to scrape data from Google results?

I'd like to fetch results from Google using curl to detect potential duplicate content. Is there a high risk of being banned by Google?

Google will eventually block your IP when you exceed a certain amount of requests.

Google disallows automated access in their TOS, so if you accept their terms you would break them.

That said, I know of no lawsuit from Google against a scraper. Even Microsoft scraped Google, they powered their search engine Bing with it. They got caught in 2011 red handed :)

There are two options to scrape Google results:

1) Use their API

    You can issue around 40 requests per hour You are limited to what they give you, it's not really useful if you want to track ranking positions or what a real user would see. That's something you are not allowed to gather.

    If you want a higher amount of API requests you need to pay.
    60 requests per hour cost 2000 USD per year, more queries require a custom deal.

2) Scrape the normal result pages

    Here comes the tricky part. It is possible to scrape the normal result pages. Google does not allow it.
    If you scrape at a rate higher than 15 keyword requests per hour you risk detection, higher than 20/h will get you blocked from my experience.
    By using multiple IPs you can up the rate, so with 100 IP addresses you can scrape up to 2000 requests per hour. (50k a day)
    There is an open source search engine scraper written in PHP at http://scraping.compunect.com It allows to reliable scrape Google, parses the results properly and manages IP addresses, delays, etc. So if you can use PHP it's a nice kickstart, otherwise the code will still be useful to learn how it is done.

Source: http://stackoverflow.com/questions/22657548/is-it-ok-to-scrape-data-from-google-results

Thursday, 4 September 2014

Data Scraping from PDF and Excel

I am doing a little data scraping, There are 3 types of file from which i am scraping data.

1- HTML
2- PDF
3- Excel(xls)

For HTML i am comfortable, i am using HTML Agility for that.

For PDF and excel i need suggestions from anyone.

Concerning Excel. If you are in a MS environment you can either do Office Automation or use OLEDB. In a Java

environment look at Apache POI.

EDIT: Concerning PDF in Java try Apache PDFBox . Can also work in .NET using IKVM

I can recommend Cogniview's PDF2XL, a reasonably inexpensive commercial product, to extract data from tables in PDF

files into Excel. We have used it with great success.

HTML Agility is a library. Its good to use. But then, why do you need separate tools for different data extraction

purposes? Use Automation Anywhere to extract data from any source. As far as I know, it would work for all the three

sources you have specified. Google it.

Source: http://stackoverflow.com/questions/3147803/data-scraping-from-pdf-and-excel

Wednesday, 3 September 2014

Excel VBA Data Mining Real-Time Data from a Web Page that Refreshes Data

I want to capture real-time data that updates into a table on a webpage; I prefer capturing it into excel using VBA, but I will write it in .NET C# or VB if I that is easier.

the data updates about 1 or 2 seconds, and I want to just grab the latest data quotes and log it into my spreadsheet; the table names are the same, only the data refreshes, and it does so automatically on the web page.

I've done a lot of Excel VBA and I know how to download a URL to a file--this is NOT what I want; I want to gain access to my webpage that is active and grab the data updates after I've logged into my site and selected a webpage that I like.

Is there a simple way to access this data on the webpage from Excel or .Net? Because it refreshes no more than once every 1 or 2 seconds, it is easy to just keep checking it for updates, and I can compare the latest data to see if it actually refreshed.

In Excel 2003, use Data/Import External Data/New Web Query
Browse to your page and select the table you want to import.
After that you can either do a manual Refresh, or use a timer procedure to do something like:

Source: http://stackoverflow.com/questions/9855794/excel-vba-data-mining-real-time-data-from-a-web-page-that-refreshes-data

Tuesday, 2 September 2014

Need to pull data from a website…web query? macro?

I have a list of every DOT # (Dept. of Trans.) in the country. I want to find out insurance effective date for each one of these companies. If you go to http://li-public.fmcsa.dot.gov --> "continue" --> then from the dropdown select "carrier search" and hit "go" it'll take you to a search form (that is the only way to get to this screen).

From there, you can input a DOT # X (use 61222 as an example) and it'll bring you to another screen. Click "view report in HTML" and then down on the bottom you'll see "Active/Pending Insurance". I want to pull the "effective date" from that page and stick it in the spreadsheet next to the DOT # X that I already know.

Of the thousands of DOT #'s in my list, not all will have filings on this website, if that makes a difference.

Can this be done with a Macro or Excel Web Query? I know I probably sound like a total novice, but I'd appreciate any help I could get.

Can you do it? Frankly even if you could you'd lock up the spreadsheet while it's doing that processing. And in the end, how would you handle an error half-way through?

I'd not do this in a client-facing application. This sounds more like something to do in server-side app that can do the processing and gather the information in a more controlled environment. Then you Excel spreadsheet could query that app and get the information in one fell swoop. Error handling is much simpler and you don't end up sitting there staring at Excel why it works its way through thousands of web sites. It was not built to do that elegantly.

What do you write the web service I'm describing in? Well it depends on your preference. Me, I'd write it in Ruby on Rails since it can easily handle the scraping aspect of the task and can report the data out easily as well. But it really falls back to whatever you're most comfortable coding in.

Source: http://stackoverflow.com/questions/15286429/need-to-pull-data-from-a-website-web-query-macro

How to extract data from web 2.0 graphs using a scraper

I have recently come across a web page containing a graph object that displays the (x, y) values on the object as the

mouse is rolled across it. Is there any way to automate the extraction of this data?

How is the graph data loaded? If embedded in the page source then you can extract it with xpath or regex. Else use

Firebug to see how it is loaded.

You will need a solution that works inside the web browser, so the AJAX/Javascript is properly rendered.

I have used iMacros with good success for web scraping in the past. There are free/open-source and "PRO" paid editions

(comparison table here).

Another option is always to custom code something with the Microsoft webbrowser control.

Source: http://stackoverflow.com/questions/3980774/how-to-extract-data-from-web-2-0-graphs-using-a-scraper

Legality of Web Scraping vs Normal Use

I know the topic of web scraping has been discussed before (example), and I understand it's a bit of a grey area

depending on a lot of factors (e.g. website's terms of use).

What I'd like to ask is: how is web scraping any different from (a) how we access the webpage via a web browser, and

(b) how web crawlers (e.g. Google) download and index webpages?

Without knowing the legal background, I can't help but think that they're all just HTTP requests. If web scraping is

illegal, then so should crawling and indexing (for instance be illegal).

Of course if your program is hitting the server so hard that it causes a denial of service, it's a different story

altogether... my point is simply accessing and using data that is already open to the public.

I know this is a dead thread, but it would be nice to place some legal implications here due to its ranking in my

Google Search. I cannot help but figure I am not the only one who searches like I do.

Legally, in the US, there are a few factors that seem to be important.

    Are you doing anything that is akin to hacking or gaining unauthorized access via the Computer Fraud and Abuse Act.

Exploiting vulnerabilities and passing SQL in the URL to open a database no matter how bad the idiot programming like

that was is illegal with a 15 year sentence (see the cases where an individual exploited security vulnerabilities in

Verizon). Also, add a time out even if you round robin or use proxies. DDoS attacks are attacks. 1000 requests per

second can shut down a lot of servers providing public information. The result here is up to 15 years in jail.

    Copyright Law: As mentioned, pure replication of data is illegal. Even 4% replication has been deemed a breach.

With the recent gutting of the DMCA, a person is even more vulnerable to civil and criminal penalties.

    Trespass and Chattels: The following from wikipedia says it all.

    U.S. courts have acknowledged that users of "scrapers" or "robots" may be held liable for committing trespass to

chattels,[5][6] which involves a computer system itself being considered personal property upon which the user of a

scraper is trespassing. The best known of these cases, eBay v. Bidder's Edge, resulted in an injunction ordering

Bidder's Edge to stop accessing, collecting, and indexing auctions from the eBay web site.

    Paywalls and Product: When going behind paywalls and breaching contract by clicking an agreement not to do

something and then doing it, you add fuel to the protection of negligence v. willingness [an issue for damages and

penalties not guilt] in civil and any criminal trials. (sorry originally wanted to say ignorance but it really isn't a

defense)

    International: EU law and other law is way more lax. Corporations with big budgets dominate our legal landscape.

They control the system in a very real way with their $$$.

Basically, get public information and information that is available without going behind a pay wall. Think like a user

of the internet and combine a bunch of sources into a unique product. Don't just 'steal' an entire site (it isn't

really stealing if it is a government site that offers public data especially for download but is if you download all

or even more than a couple of the listings on ebay). Read the terms and conditions to know who actually owns the

content.

Here are a few examples. Trulia owns its information but you could use it to go to an agents website and collect a

legal amount of information. The legal amount is determinable. However, a public MLS listing lookup site with no

agreement or terms and offering data to the public is fair game. The MLS numbers lists, however, are normally not fair

game.

If a researcher can get to data, so can you. If a researcher needs permission, so do you. A computer is like having a

million corporate researchers at your disposal.

AS for company policy, it is usually used internally to shield from liability and serves as a warning but is not

entirely enforceable. The legal parts letting you know about copyrights and such are and usually are supposed to be

known by everyone. Complete ignorance is not a legal protection. It does provide a ground set of rules. Be nice, or get

banned is that message as far as I know.

My personal strategy is to start with public data and embellish it within legal means.

Source: http://stackoverflow.com/questions/14735791/legality-of-web-scraping-vs-normal-use