Parsing Bookmarks With Nokogiri

| 2 min read

I’ve been working on a feature for one of my side projects that requires parsing bookmarks files that have been exported from a web browser. Most browsers export bookmarks in the Netscape Bookmark File Format. This format is an HTML document where the bookmarks are organized according to a standard structure. Microsoft has some helpful documentation MSDN.

While considering how I would implement the feature, I spent a few minutes searching for an existing Ruby gem that already provided the functionality I needed. But, I came up empty-handed. Knowing that Nokogiri was a popular choice among Rubyists for parsing HTML files, I decided to try to implement the feature with it instead.

Nokogiri, for those of you who may be new to it, is “an HTML, XML, and SAX parser with the ability to search documents via XPath and CSS selectors” 1. It’s very fast and very powerful. The poject’s site has a few examples of how it can be used to parse, search, and modify HTML and XML documents.

If all you’re interested in is the URLs then parsing a Netscape Bookmark File is trivial. This can easily be done with Nokogiri in a few lines of code. Here’s an example:

require 'nokogiri'
document = Nokogiri::HTML(File.open('bookmarks.html'))
document.css('dt > a').each { |a| puts a['href'] }

This example uses CSS selectors to find all of the anchor tags (or bookmarks) in the document and retuns the text value of each one.

However, in addition to the bookmarks, I was interested in the folders as well. The reason being that I wanted the ability to use the folder names to categorize the bookmarks. This complicated things a bit. Because the bookmarks may be deeply nested within several levels of folders, I would have to traverse the document recursively much like you would if you were traversing a file system.

My first attempt at this was to use Nokogiri’s traverse feature to recursively walk through all of the nodes in the document. The problem with this approach was that traverse descends in to each node of the document as it’s encountered. Since HTML elements containing folder name and folder items were “siblings”, it was very difficult to determine what folder each bookmark belonged to.

After some trial and error, I ended up using a combination of recursion and XPath selectors. This approach allowed me to identify all the bookmarks in the current level of nesting, process them, and then descend to the next level. It works great and I’m pleased with the results. Here’s what I ended up with:

# bookmark_reader.rb
require 'nokogiri'

class Bookmark
attr_accessor :title, :url, :path

def initialize(path, title, url)
@path = path
@title = title
@url = url
end
end

class BookmarkReader
include Enumerable

def initialize(path)
@path = path
end

def each
doc = Nokogiri::HTML(File.open(@path))
node = doc.at_xpath('//html/body')
traverse(node, '/') { |b| yield b }
end

private

def traverse(node, path, &block)
anchors = node.search('./dt//a')
folder_names = node.search('./dt/h3')
folder_items = node.search('./dl')

anchors.each do |anchor|
yield Bookmark.new(path, anchor.text, anchor['href'])
end

folder_items.size.times do |i|
folder_name = folder_names[i]
folder_item = folder_items[i]
next_path = folder_name.nil? ? path : [path, folder_name].join('/')
traverse(folder_item, next_path, &block)
end
end
end
# USAGE
reader = BookmarkReader.new('bookmarks.html')
reader.each { |b| puts b.title }

Parsing the Netscape Bookmark File with Nokogiri ended up being a fun exercise and more of a challenge than I expected. If you have a better solution, I would love to hear from you.