Mida
Mida is a Microdata extractor/parser library for Ruby.Installation
Mida keeps RubyGems up-to-date with its latest version, so installing is as easy as:Requirements:
- Nokogiri
Command Line Usage
To use the command line tool, supply it with the urls or filenames that you would like to be parsed (by default each item is output as yaml):If you want to search for specific types you can use the
-t
switch followed by a Regular Expression:
For more information look at mida‘s help:
Library Usage
The following examples assume that you have requiredmida
and open-uri
.
Extracting Microdata from a page
All the Microdata is extracted from a page when a newMida::Document
instance is created.To extract all the Microdata from a webpage:
The top-level Items will be held in an array accessible via
doc.items
.To simply list all the top-level Items that have been found:
Searching
If you want to search for an Item that has a specific itemtype/vocabulary his can be done with the search method.To return all the Items that use one of Google’s Review vocabularies:
Inspecting an Item
Each Item is aMida::Item
instance and has four main methods of
interest: type
, vocabulary
, properties
and id
.To find out the itemtype of the Item:
To find out the itemid of the Item:
Properties are returned as a hash containing name/values pairs. The values will be an array of either
String
or Mida::Item
instances.To see the properties of the Item:
Working with Vocabularies
Mida allows you to define vocabularies, so that input data can be constrained to match expected patterns. By default a generic vocabulary (Mida::GenericVocabulary
) is registered, which will match against
any itemtype with any number of properties.If you want to specify a vocabulary, you create a class derived from
Mida::Vocabulary
and use itemtype
,
has_one
, has_many
and extract
to describe the vocabulary.As an example the following describes a subset of Google’s Review vocabulary:
When you create a subclass of
Mida::Vocabulary
it automatically
registers the Vocabulary.Now if Mida is parsing some input and manages to match against the
Review
Vocabulary
, it will only allow the specified
properties and will reject any that don't have the correct number. It will
also set Item#vocabulary
accordingly, e.g.
If you want to include the properties of another vocabulary you can use include_vocabulary
:
In the above if you gave a Book
as an item of Collection
this would be accepted
because it includes the Thing
vocabulary. When examining the item you would
find #vocabulary
set to Book
and you would have access to all the properties of
Thing
and all the properties of Book.