Mida
Mida is a Microdata extractor/parser library for Ruby.Installation
Mida keeps RubyGems up-to-date with its latest version, so installing is as easy as:$ gem install mida
Requirements:
- Nokogiri
Command Line Usage
To use the command line tool, supply it with the urls or filenames that you would like to be parsed (by default each item is output as yaml):mida http://lawrencewoodman.github.com/mida/news/
-t
switch followed by a Regular Expression:
mida -t /person/i http://lawrencewoodman.github.com/mida/news/
mida -h
Library Usage
The following examples assume that you have requiredmida
and open-uri
.
Extracting Microdata from a page
All the Microdata is extracted from a page when a newMida::Document
instance is created.To extract all the Microdata from a webpage:
url = 'http://example.com'
open(url) {|f| doc = Mida::Document.new(f, url)}
doc.items
.To simply list all the top-level Items that have been found:
puts doc.items
Searching
If you want to search for an Item that has a specific itemtype/vocabulary his can be done with the search method.To return all the Items that use one of Google’s Review vocabularies:
doc.search(%r{http://data-vocabulary\.org.*?review.*?}i)
Inspecting an Item
Each Item is aMida::Item
instance and has four main methods of
interest: type
, vocabulary
, properties
and id
.To find out the itemtype of the Item:
puts doc.items.first.type
puts doc.items.first.id
String
or Mida::Item
instances.To see the properties of the Item:
puts doc.items.first.properties
Working with Vocabularies
Mida allows you to define vocabularies, so that input data can be constrained to match expected patterns. By default a generic vocabulary (Mida::GenericVocabulary
) is registered, which will match against
any itemtype with any number of properties.If you want to specify a vocabulary, you create a class derived from
Mida::Vocabulary
and use itemtype
,
has_one
, has_many
and extract
to describe the vocabulary.As an example the following describes a subset of Google’s Review vocabulary:
class Rating < Mida::Vocabulary
itemtype %r{http://data-vocabulary.org/rating}i
has_one 'best'
has_one 'worst'
has_one 'value'
end
class Review < Mida::Vocabulary
itemtype %r{http://data-vocabulary.org/review}i
has_one 'itemreviewed'
has_one 'rating' do
extract Rating, Mida::DataType::Text
end
end
Mida::Vocabulary
it automatically
registers the Vocabulary.Now if Mida is parsing some input and manages to match against the
Review
Vocabulary
, it will only allow the specified
properties and will reject any that don't have the correct number. It will
also set Item#vocabulary
accordingly, e.g.
doc.items.first.vocabulary # => Review
include_vocabulary
:
class Thing < Mida::Vocabulary
itemtype %r{http://example.com/vocab/thing}i
has_one 'name', 'description'
end
class Book < Mida::Vocabulary
itemtype %r{http://example.com/vocab/book}i
include_vocabulary Thing
has_one 'title', 'author'
end
class Collection < Mida::Vocabulary
itemtype %r{http://example.com/vocab/collection}i
has_many 'item' do
extract Thing
end
end
Book
as an item of Collection
this would be accepted
because it includes the Thing
vocabulary. When examining the item you would
find #vocabulary
set to Book
and you would have access to all the properties of
Thing
and all the properties of Book.