ryodo - domain parser

I just created a tiny gem for parsing a given domain string and retrieve relevant information like TLD, registered/registrable domain and subdomain.

TL;DR

Project in Github: github.com/asaaki/ryodo

RubyGems: rubygems.org/gems/ryodo

Gemfile: gem "ryodo"

What is it good for?

Read the some explanation of publicsuffix.org:

The Public Suffix List is a cross-vendor initiative to provide an accurate list of domain name suffixes, maintained by the hard work of Mozilla volunteers and by submissions from registries, to whom we are very grateful.

[…]

Since there was and remains no algorithmic method of finding the highest level at which a domain may be registered for a particular top-level domain (the policies differ with each registry), the only method is to create a list. This is the aim of the Public Suffix List.

Shortly: It is not so easy to figure out which part of a given domain string is the registered one, because there are different registration rules for the TLDs/suffixes.

There are some implementation for different languages, also a Ruby version is available, called public_suffix.

When I started some tests, playground stuff and finally a —not fully working— gem with cext, I decided to do a pure Ruby version without knowing that there was already one.

In the end I think it was not so bad to do it on my own, because after some benchmarks I figured out, that my implementation is much faster now.

How to use it

Some code examples tell you:

Basics

dom = Ryodo.parse("my.awesome.domain.co.jp")
#=> Ryodo::Domain

                  #    SUBDOMAIN  DOMAIN   TLD
dom.tld           #=>                   "co.jp"
dom.domain        #=>            "domain.co.jp"
dom.subdomain     #=> "my.awesome"
dom               #=> "my.awesome.domain.co.jp"
dom.fqdn          #=> "my.awesome.domain.co.jp."

More formats

# all parts also reversable
# mostly used on domain/FQDN
dom.reverse            #=> "jp.co.domain.awesome.my"
dom.fqdn.reverse       #=> ".jp.co.domain.awesome.my"

dom.to_a               #=> ["my","awesome","domain","co","jp"]
dom.domain.to_a        #=> ["domain","co","jp"]
dom.subdomain.to_a     #=> ["my","awesome"]
dom.fqdn.to_a          #=> ["my","awesome","domain","co","jp",""]

# .to_a also usable with parameter :reverse (or shorthand :r)
dom.domain.to_a(:reverse) #=> ["jp","co","domain","awesome","my"]
dom.fqdn.to_a(:reverse)   #=> ["","jp","co","domain","awesome","my"]
dom.fqdn.to_a(:r)         #=> ["","jp","co","domain","awesome","my"]

You also can call ryodo in different ways:

Ryodo.parse("my.awesome.domain.co.jp")
Ryodo("my.awesome.domain.co.jp")
Ryodo["my.awesome.domain.co.jp"]
ryodo("my.awesome.domain.co.jp")

String extension

Is automatically required.

"my.awesome.domain.co.jp".to_domain
"my.awesome.domain.co.jp".ryodo

URI extension

Has to be explicitly required.

Gemfile

gem "ryodo", :require => ["ryodo","ryodo/ext/uri"]

Usage

require "ryodo/ext/uri" # if not required via Gemfile

uri = URI.parse("http://my.awesome.domain.jp:5555/path")
uri.host
#=> "my.awesome.domain.jp"

uri.host.class
#=> Ryodo::Domain
# but decorates the String class transparently

uri.host.domain
#=> "domain.com"

Benchmark

Now the tiny benchmark I did:

Setup

A domain input list, taken by publicsuffix.org (checkPublicSuffix test script under publicsuffix.org/list/). I added some very long domain names with many parts (for look-up time scale).

Some of them are also invalid (to test, if you implementation works correctly).

Finally 72 entries to check.

Ruby: 1.9.3-p194, no special patches

We only do a basic parsing and retrieve the registered/registrable domain. (Should hit the most important code of the gems.)

Test script snippet

# DOMAINS is the array of domain entries

LOOPS = 1_000

Benchmark.bmbm do |b|

  b.report "ryodo" do
    LOOPS.times do
      DOMAINS.each do |domain|
        Ryodo.parse(domain).domain # returns nil if not valid
      end
    end
  end

  b.report "public_suffix" do
    LOOPS.times do
      DOMAINS.each do |domain|
        PublicSuffix.parse(domain).domain rescue nil # it raises if not valid in any way, so we rescue it
      end
    end
  end

end

Caveats

PublicSuffix.parse(…) will raise errors if domain input is invalid (e.g. not a registrable domain). That is the reason why I have to put a rescue statement.

Ryodo.parse(…) won't raise but return nil values for invalid stuff (it only raises if input is not a String, of course).

Result

Rehearsal -------------------------------------------------
ryodo           1.800000   0.000000   1.800000 (  1.809521)
public_suffix  21.880000   0.020000  21.900000 ( 21.907808)
--------------------------------------- total: 23.700000sec

                    user     system      total        real
ryodo           1.770000   0.000000   1.770000 (  1.769734)
public_suffix  22.320000   0.010000  22.330000 ( 22.346013)

As you can see, Ryodo is more than 10 times faster.

(Funfact: My first approach was 6 times slower — improvement factor of 60!)

Conclusion

public_suffix is completely okay. If you haven't to query a lot, it will do its job.

ryodo will be the choice if you expect to parse lots of domain data in short time. I will use it for another project where I have to expect such masses of parsing.

I also will try to extend it with an optional C extension to make it much more faster. The current implementation can handle ~ 40,000 domains/sec — quite okay for API usage.

A simple .split(".") could also do the job if you don't need to find out, of which type the specific parts are. In my cases it won't be enough.