Documentation / HTML parsing guide

Back to contents PHP Python Ruby Choose a language:

The easiest and most familiar way to extract data from HTML web pages is to use "CSS selectors". These are part of the same rules which in web stylesheets are used to describe the spacing, colour and layout of web pages.

For more details, read the Nokogiri documentation, or the CSS selector specification.

Getting started

Grab the HTML web page, and parse the HTML using Nokogori.

require 'nokogiri'
html = ScraperWiki::scrape("https://scraperwiki.com/")
doc = Nokogiri::HTML(html)

Select all <a> elements that are inside <div class="featured">. These queries work the same way as CSS stylesheets or jQuery. They are called CSS selectors, and are quite powerful.

doc.css('div.featured a').each do |link|
  puts link.class
end

Print out all of a tag and its contents as HTML (put this and the next example inside the "each" loop, before the "end").

puts link.to_html

Read attributes, such as the target of the <a> tags.

puts link['href']

Simple text extraction

Select the first <strong> element inside <div id="footer_inner">.

el = doc.css("div#footer_inner strong")[0]
puts el

Extract the text from inside the tag.

puts el.content

Deep text extraction

Get all text recursively, throwing away any child tags.

eg = Nokogiri::HTML.fragment('<h2>A thing <b>goes boom</b> up <i>on <em>the tree</em></i></h2>');
puts eg.content() # 'A thing goes boom up on the tree'

Sometimes you have nearly pure text elements that still have <i> and <b> elements which you want to retain. Such an element can be extracted using a recursive function.

def ctext(el)
  if el.text? 
    return el.text
  end
  result = [ ]
  for sel in el.children
    if (!["b", "i"].include?(sel.name))
      raise "disallowed tag: " + sel.name
    end
    if sel.element? 
      result.push("<"+sel.name+">")
    end
    result.push(ctext(sel))
    if sel.element? 
      result.push("</"+sel.name+">")
    end
  end
  return result.join
end

This gives an error if there are other unexpected elements, such as <em>.

puts ctext(eg)

Finding data manually

Iterate down through the elements in the document and see the tags and attributes on each element.

for el in doc.css('html').children
    puts el.name
    for el2 in el.children
        puts "--" + el2.name + " " + el2.attributes.to_json
    end
end

Navigate around the document.

eg = Nokogiri::HTML.fragment('<h2>A thing <b>goes boom</b> up <i>on <em>the tree</em></i></h2>')
goes_boom = eg.children()[0].children()[1]
puts goes_boom                  # <b>goes boom<b>
puts goes_boom.parent.name      # h2
puts goes_boom.next             #  up 
puts goes_boom.next.next.name   # i
puts goes_boom.parent.children.map {|x| x.name}.join(",") # text,b,text,i