642d333608cc3483f64dba38cb81bdd2

I was bored during ruby class and started working on a nzb client program out of ruby. I got it working fairly well but its slow when it comes to parsing the xml files.
this is what I have done so far

I have pasted the xml parsing part only.
Here is the entire program http://ruby.pastebin.ca/1744726
example nzb file http://ruby.pastebin.ca/1744710

require 'net/http'
require 'rexml/document'
require 'download.rb'


max_connections = 5 #max simultaneous connections/threads to run 
nzb =[] #array to hold parsed nzb
import_file ="" #used to hold raw imported nzb's

#opens the nzb and read each line
  File.open(ARGV[0], "r") do |aFile|
    aFile.each_line do |line|
      import_file += line 
    end
  end


# process the nzb file using xml processing
# this needs to be redone as its slow 
doc = REXML::Document.new(import_file)

count = 0 
#confusing to look at but it searchs for <groups><group></group></groups>
while defined? doc.root.elements.to_a[count][1][1][0]
group = doc.root.elements.to_a[count][1][1][0]

#searchs for <segments>
  if doc.root.elements.to_a[count][3][3]
  count2 = 1 
#searchs for multiple <segments>
    while defined? doc.root.elements.to_a[count][3][count2][0]
     segment = doc.root.elements.to_a[count][3][count2][0]
     nzb << [group,segment]
     #puts segment
     count2 += 2
    end
  else
  segment = doc.root.elements.to_a[count][3][1][0]
#inserts them into an array ex. ["alt.binaries.warez", "1262522465.84379.1@news.astraweb.com"]
  nzb << [group,segment]
  end
  
  count+=1
end

Refactorings

No refactoring yet !

4d72203c38dd5f3e3d2d446b5888e8a7

Elij

January 10, 2010, January 10, 2010 01:30, permalink

1 rating. Login to rate!

When dealing with XML there are 2 approaches -- DOM or SAX -- first method loads the entire xml block into a runtime object (REXML::Document.new.) The second gives serial access and should be faster and have an even memory footprint.

Also REXML is notoriously slow -- I'm don't know much about the ruby scene to recommend an alternative but from here it looks like you'll need to move to SAX and something other than REXML.

Af93e3439fdd4b0aadc662fbe85c253d

Jose

January 10, 2010, January 10, 2010 05:36, permalink

1 rating. Login to rate!

Yeah, I would look into Nokogiri for XML parsing, the differences in speed between REXML and Nokogiri are quite large actually. If it is an absolute necessity to be as fast as possible I would use the libxml library directly.

Af93e3439fdd4b0aadc662fbe85c253d

Jose

January 10, 2010, January 10, 2010 18:46, permalink

1 rating. Login to rate!

Here is something that gives the exact same output using Nokogiri, the entire import file section is not required at all and is replaced in one line.

require 'rubygems'
require 'nokogiri'

nzb = []
doc = Nokogiri::XML(File.read('filename.xml'))

doc.css('file').each do |file|
  nzb << file.css('group').first.content
  nzb << file.css('segment').first.content
end
642d333608cc3483f64dba38cb81bdd2

bain19.myopenid.com

January 11, 2010, January 11, 2010 03:41, permalink

No rating. Login to rate!

Awesome work Jose, i had to make one change but it works a million times quicker. The issue was there could be multiple segments for a file. So i just chucked in another loop

doc.css('file').each do |file|
  file.css('segment').each do |seg|
    nzb << [file.css('group').first.content, seg.content.strip]
  end
end

Your refactoring





Format Copy from initial code

or Cancel