Ff9eefc6ed72560c262b0e125151ea7c

Would love to see a COOLer way to do this. I feel that this can be made better especially around the "results << (parsed/"li.g").map do |ele|
" line. thanks! (PS Purpose of the script is included in a comment)

require 'rubygems'

  require 'hpricot'

  require "watir"



  #	Navigate to Google in a new IE instance

  #	Search for the following:  pirates vs ninjas
  #	Iterate thought the first 10 pages of google

  #	On each page collect the following

  #		The blue page title

  #		The green url

  #	Output all the data into a file named test_output.txt
  #		Format  "Title - URL"

  #

  

  

  

  

  # Prepare Watir 

  url = "http://www.google.com/"

  browser = Watir::Browser.new

  browser.goto url

  # Input our search term: "Gap, Inc." and click search button

  browser.text_field(:name, "q").set "pirates vs ninjas"

  browser.button(:name, "btnG").click



  

  results = Array.new

  

  # We loop through 10 pages of results

  for page in 1..10

    

    # Bring Watir's HTML into Hpricot so we can easily process it

    parsed = Hpricot(browser.html)

    

    # Look in the LIs (class g)

    results << (parsed/"li.g").map do |ele| 

      {

        # Grab the title from the inner_text of A (Blue Text)

        :title => ele.at("a").inner_text, 

        # Grab the URL from the innter_text of Google's citation (Green Text)

        :url => (ele/"//cite").first.inner_text

      }

    end

    

    # Use Watir to head over to the next page unless we are done 

    browser.link(:text, (page+1).to_s).click unless page >= 10

  end



  # Time to write to file

  outfile = File.new("test_output.txt", "w")

  

  # Results are organized by page.

  results.each do |page_results| 

    # Write each individual SERP's title and url

    page_results.each do |serp| 

      # Clean the nasty - ## - stuff from the URL text

      outfile.puts serp[:title] + " - " + serp[:url].gsub(/ -.*-/, '')

    end

  end

Refactorings

No refactoring yet !

D7a31f22c11776898826f7c1ed0ff80a

mischamolhoek.myopenid.com

December 18, 2008, December 18, 2008 08:52, permalink

No rating. Login to rate!

just a question: why use watir, instead of open_uri?

F1e3ab214a976a39cfd713bc93deb10f

Tj Holowaychuk

December 20, 2008, December 20, 2008 18:44, permalink

No rating. Login to rate!

yeah i was just going to ask that same thing, im sure you could just use open-uri and get or post some data as needed. Running a browser in the background seems pretty hack-ish

37cebc7c6f4cee5d5bc34277e691e7ba

Matt Downey

January 2, 2009, January 02, 2009 08:33, permalink

No rating. Login to rate!

The first 65 results are easy to get using Google's AJAX API.
For the 'pirates vs ninjas' example:
getSERPs('pirates vs ninjas')
There is an optional argument, that can be set to 'news', etc...
getSERPs('pirates vs ninjas','news')

require 'rubygems'
require 'json'
require 'net/http'

def google(query, service="web", optionalArgs="")
   url = "http://ajax.googleapis.com/ajax/services/search/#{service}?v=1.0&q=#{URI.encode(query)}&rsz=large" + optionalArgs
   resp = Net::HTTP.get_response(URI.parse(url))
   data = resp.body

   # convert JSON to hash
   result = JSON.parse(data)

   if result["responseStatus"] != 200
      raise "google web service error"
   end
   result = result["responseData"]["results"]
   return result
end

def getSERPs(query, service="web")
  outfile = File.new(query + ".dat", "w")
  for p in 0..7
    g = google(query, service, ('&start=' + (p * 8).to_s))
    puts "Got page " + (p + 1).to_s + "..."
    g.each do |r|
      outfile.puts r['titleNoFormatting'] + ' - ' + r['unescapedUrl']
    end
  end
  outfile.close()
  puts "Wrote " + query + ".dat"
end
96fc4d971dd7148565c9b13188519cb7

vivnet.myopenid.com

January 2, 2009, January 02, 2009 15:39, permalink

No rating. Login to rate!

Hi guys, I'm interested in an ideal resolution to this problem as well. I was curious, mischamolhoek.myopenid.com and Tj Holowaychuk, how would you approach this problem using open-uri only and without touching watir? (It would certainly help if you could provide actual code!) Also, Matt Downey - do you happen to know, by any chance, why Google's API restricts you to 65 results, and not say 100 results or more...

37cebc7c6f4cee5d5bc34277e691e7ba

Matt Downey

January 2, 2009, January 02, 2009 23:06, permalink

No rating. Login to rate!

Well, to answer your question, would you want people scraping all the results from your search engine? Google sure doesn't, and that is why they restrict the amount of results you can get with their API. I wrote some code to answer your question below, but keep in mind that if you send Google too many automated queries, they will flag you and you will have to enter a CAPTCHA. If you wanted to get a lot of results with this, you would have to rotate your IP address through proxies or some other way. Also, net/http and open-uri are kinda slow, and curb would probably be better, but not all systems have CURL... Prolly not the best code, but here ya go. If someone could refactor using hpricot, I'd love to learn more about that library :-)

require 'open-uri'
require 'rubygems'
require 'rubyful_soup'

def scrapeGoogle(query, pages)
  results = []
  for page in 1..pages
    startQuery = (page - 1) * 10
    url = "http://www.google.com/search?hl=en&safe=off&q=#{URI.encode(query)}&start=#{startQuery}"
    data = open(url).readlines.join # open page w/ open-uri, read it, and join to a string
    
    # parse page w/ rubyful_soup
    soup = BeautifulSoup.new(data)
    links = soup.find_all('h3', :attrs => {'class' => 'r'}) # find all h3 tags where class is 'r'
    
    links.each do |link|
      # make sure the links aren't extra ones...
      if link.contents[0].contents[0] != ('News results for ' || 'Video results for ' || 'Image results for ')
        title = []
        # clean the <em> tags around the query in the title
        link.contents[0].contents[0].each do |piece|
          if piece.class == Tag       # if it's a tag
            piece = piece.contents[0] # save the contents
          end
          title << piece  # put each piece in title array
        end
        title = title.join  # join the pieces into a string
        # put the results in a hash, and append to an array
        results << {'title' => title, 'href' => link.contents[0]['href']}
      end
    end
    # display status message
    puts 'Done scraping page ' + page.to_s + ' of ' + pages.to_s
  end
  
  return results
end
F1e3ab214a976a39cfd713bc93deb10f

Tj Holowaychuk

January 3, 2009, January 03, 2009 03:59, permalink

No rating. Login to rate!

I actually am working on an SEO library (ruby of course) so I will be doing this shortly, I will post my implementation asap

96fc4d971dd7148565c9b13188519cb7

vivnet.myopenid.com

January 5, 2009, January 05, 2009 02:59, permalink

No rating. Login to rate!

does anyone have ideas on improving zigx's original script as is? meaning, writing better xpath perhaps, than in "results << (parsed/"li.g").map do |ele|". lets say we had to use watir and hpricot... also, i'm curious why he had to use symbols to store the titles and urls.

03c7c0ace395d80182db07ae2c30f034

sd

January 17, 2010, January 17, 2010 10:00, permalink

No rating. Login to rate!

so can use but i want many people love so much .

Your refactoring





Format Copy from initial code

or Cancel