require 'rubygems'
require 'hpricot'
require "watir"
# Navigate to Google in a new IE instance
# Search for the following: pirates vs ninjas
# Iterate thought the first 10 pages of google
# On each page collect the following
# The blue page title
# The green url
# Output all the data into a file named test_output.txt
# Format "Title - URL"
#
# Prepare Watir
url = "http://www.google.com/"
browser = Watir::Browser.new
browser.goto url
# Input our search term: "Gap, Inc." and click search button
browser.text_field(:name, "q").set "pirates vs ninjas"
browser.button(:name, "btnG").click
results = Array.new
# We loop through 10 pages of results
for page in 1..10
# Bring Watir's HTML into Hpricot so we can easily process it
parsed = Hpricot(browser.html)
# Look in the LIs (class g)
results << (parsed/"li.g").map do |ele|
{
# Grab the title from the inner_text of A (Blue Text)
:title => ele.at("a").inner_text,
# Grab the URL from the innter_text of Google's citation (Green Text)
:url => (ele/"//cite").first.inner_text
}
end
# Use Watir to head over to the next page unless we are done
browser.link(:text, (page+1).to_s).click unless page >= 10
end
# Time to write to file
outfile = File.new("test_output.txt", "w")
# Results are organized by page.
results.each do |page_results|
# Write each individual SERP's title and url
page_results.each do |serp|
# Clean the nasty - ## - stuff from the URL text
outfile.puts serp[:title] + " - " + serp[:url].gsub(/ -.*-/, '')
end
end
Refactorings
No refactoring yet !
mischamolhoek.myopenid.com
December 18, 2008, December 18, 2008 08:52, permalink
just a question: why use watir, instead of open_uri?
Tj Holowaychuk
December 20, 2008, December 20, 2008 18:44, permalink
yeah i was just going to ask that same thing, im sure you could just use open-uri and get or post some data as needed. Running a browser in the background seems pretty hack-ish
Matt Downey
January 2, 2009, January 02, 2009 08:33, permalink
The first 65 results are easy to get using Google's AJAX API.
For the 'pirates vs ninjas' example:
getSERPs('pirates vs ninjas')
There is an optional argument, that can be set to 'news', etc...
getSERPs('pirates vs ninjas','news')
require 'rubygems'
require 'json'
require 'net/http'
def google(query, service="web", optionalArgs="")
url = "http://ajax.googleapis.com/ajax/services/search/#{service}?v=1.0&q=#{URI.encode(query)}&rsz=large" + optionalArgs
resp = Net::HTTP.get_response(URI.parse(url))
data = resp.body
# convert JSON to hash
result = JSON.parse(data)
if result["responseStatus"] != 200
raise "google web service error"
end
result = result["responseData"]["results"]
return result
end
def getSERPs(query, service="web")
outfile = File.new(query + ".dat", "w")
for p in 0..7
g = google(query, service, ('&start=' + (p * 8).to_s))
puts "Got page " + (p + 1).to_s + "..."
g.each do |r|
outfile.puts r['titleNoFormatting'] + ' - ' + r['unescapedUrl']
end
end
outfile.close()
puts "Wrote " + query + ".dat"
end
vivnet.myopenid.com
January 2, 2009, January 02, 2009 15:39, permalink
Hi guys, I'm interested in an ideal resolution to this problem as well. I was curious, mischamolhoek.myopenid.com and Tj Holowaychuk, how would you approach this problem using open-uri only and without touching watir? (It would certainly help if you could provide actual code!) Also, Matt Downey - do you happen to know, by any chance, why Google's API restricts you to 65 results, and not say 100 results or more...
Matt Downey
January 2, 2009, January 02, 2009 23:06, permalink
Well, to answer your question, would you want people scraping all the results from your search engine? Google sure doesn't, and that is why they restrict the amount of results you can get with their API. I wrote some code to answer your question below, but keep in mind that if you send Google too many automated queries, they will flag you and you will have to enter a CAPTCHA. If you wanted to get a lot of results with this, you would have to rotate your IP address through proxies or some other way. Also, net/http and open-uri are kinda slow, and curb would probably be better, but not all systems have CURL... Prolly not the best code, but here ya go. If someone could refactor using hpricot, I'd love to learn more about that library :-)
require 'open-uri'
require 'rubygems'
require 'rubyful_soup'
def scrapeGoogle(query, pages)
results = []
for page in 1..pages
startQuery = (page - 1) * 10
url = "http://www.google.com/search?hl=en&safe=off&q=#{URI.encode(query)}&start=#{startQuery}"
data = open(url).readlines.join # open page w/ open-uri, read it, and join to a string
# parse page w/ rubyful_soup
soup = BeautifulSoup.new(data)
links = soup.find_all('h3', :attrs => {'class' => 'r'}) # find all h3 tags where class is 'r'
links.each do |link|
# make sure the links aren't extra ones...
if link.contents[0].contents[0] != ('News results for ' || 'Video results for ' || 'Image results for ')
title = []
# clean the <em> tags around the query in the title
link.contents[0].contents[0].each do |piece|
if piece.class == Tag # if it's a tag
piece = piece.contents[0] # save the contents
end
title << piece # put each piece in title array
end
title = title.join # join the pieces into a string
# put the results in a hash, and append to an array
results << {'title' => title, 'href' => link.contents[0]['href']}
end
end
# display status message
puts 'Done scraping page ' + page.to_s + ' of ' + pages.to_s
end
return results
end
Tj Holowaychuk
January 3, 2009, January 03, 2009 03:59, permalink
I actually am working on an SEO library (ruby of course) so I will be doing this shortly, I will post my implementation asap
vivnet.myopenid.com
January 5, 2009, January 05, 2009 02:59, permalink
does anyone have ideas on improving zigx's original script as is? meaning, writing better xpath perhaps, than in "results << (parsed/"li.g").map do |ele|". lets say we had to use watir and hpricot... also, i'm curious why he had to use symbols to store the titles and urls.
Would love to see a COOLer way to do this. I feel that this can be made better especially around the "results << (parsed/"li.g").map do |ele|
" line. thanks! (PS Purpose of the script is included in a comment)