Cognition: MiXed Case SeMantiC WeB.

Try Cognition Now!

Output format: [?]









Options: [?]




or download

Example URIs to Cognify

If you have Javascript enabled, clicking on the links above will copy the link to the "Try Cognition Now" form.

* Note that Cognition's built in document structure parser generates the dcterms:title triple from the <title> attribute — the GRDDL adds a legacy dc:title predicate.

Table of Contents

  1. What is Cognition?
  2. What is the mixed case semantic Web? What is “gainy” conversion?
  3. What metadata does Cognition parse?
    1. Options
  4. What does Cognition do with this metadata?
    1. Output formats
  5. What input URIs does Cognition accept?
  6. Download
  7. Requirements

What is Cognition?

Cognition is a parser for both “upper case Semantic Web” (RDF, RDFa) and “lower case semantic web” (microformats) technologies. It includes modules for exporting parsed data in a variety of formats, including RDF, vCard, iCalendar, Atom and KML.

Cognition is written in Perl 5 and licensed under the GNU GPL (v3).

What is the mixed case semantic Web? What is “gainy” conversion?

Cognition internally represents all parsed data in an RDF-like triple format. Microformats don't usually contain as much information as is required by RDF — they usually don't have an explicit subject, and predicates aren't namespaced.

Cognition's microformat parsing process assigns explicit URIs to the subjects, prefixes microformat class names with a relevant URI (e.g. urn:ietf:rfc:2426# for vCard). This allows so called “lower case semantic web” data to mix in with data gleaned from the “upper case Semantic Web” (e.g. RDF). hCards converted to vCards can thus gain information from other sources. This is “gainy” conversion, as against lossy conversion.

What metadata does Cognition parse?

It essentially combines data from three different sources:

  1. RDF, including:
  2. Microformats and other POSH formats (built-in, non-GRDDL support), including:
  3. HTML document structure, including:

Many of these technologies make use of namespaces. Standard XML namespaces are mostly understood, and namespaces may also be linked to using RFC 2731. (You may run into problems if you define the same prefix differently in different parts of the document.) A number of namespaces are also predefined, so that stuff like <meta name="DC.creator"> will "just work" even if the author never explicitly defined the DC prefix.

Note that both HTML and XHTML are supported equally. The stuff that strictly speaking should not work in HTML (e.g. XML namespaces, RDFa) does work: HTML is treated as if it were funny-looking XHTML.

Options

Strict Microformats
When in strict mode, the page must explicitly provide well-known profile URIs for any microformats it uses (except for a handful of microformats for which there do not exist any such profiles yet). Without strict mode, microformats can be freely used without profile URIs.
Strict RDFa
When in strict mode, the page must use the XHTML+RDFa doctype if it wants to make use of RDFa. Without strict mode, RDFa can be used in any old page.
Strict eRDF
Unless you know what you're doing, you really want to keep strict mode on. When strict mode is off, Cognition will attempt to read eRDF information from pages, even if they haven't indicated that they actually contain eRDF markup — for some pages this will result in a lot of useless and meaningless data.
Strict GRDDL
When strict GRDDL is off, Cognition will follow rel=transformation links and use them to glean extra data from the page. When strict GRDDL is on, it will ignore these links unless the GRDDL profile is found — this is the correct GRDDL behaviour.
Check <head profile> for GRDDL
When this is enabled, GRDDL will fetch each profile URI on the page and inspect them for rel=profileTransformation links to use for GRDDL. For most pages, this will be very slow. Note that Cognition includes built-in parsing for most microformats, generally better than GRDDL is able to provide, so for most pages you will not need to use GRDDL anyway.

The command-line version of Cognition includes many more options, but these have not (yet) been exposed in the web interface. Run Cognition with a parameter of --help for more information.

What does Cognition do with this metadata?

Cognition is currently available in three forms. The first is the Cognition library for Perl. This library is capable of parsing an HTML file as RDFa, eRDF, microformats, etc, etc and making the data available to the calling application as a rudimentary RDF triple store. Export functions are also available to retrieve the data in RDF/XML, RDF/JSON, iCalendar, vCard, etc.

Another form is the cognitiond.pl daemon based on the Perl library. This daemon listens on a TCP port (by default, 24646) waiting to be parsed URIs and outputting data in a chosen format (i.e. RDF/XML, RDF/JSON, iCalendar, vCard, etc). A small command-line client is included which is able to connect to the daemon and output the data to STDOUT. The command-line client is also linked to the Cognition Perl library, and so is able to output data even when it is unable to connect to the daemon.

Lastly, Cognition can be used through a web service at srv.buzzword.org.uk. This web service is still in an experimental stage.

Output formats

RDF/XML
The most detailed output format, including every bit of information which Cognition has been able to glean from the page. To make sense of it you probably need a fairly good knowledge of RDF. RDF/XML is fairly tricky to parse with a plain XML parser — if you need to access the information as RDF triples, RDF/JSON is probably a better bet.
http://www.w3.org/TR/rdf-syntax-grammar/
RDF/JSON
Pretty much as detailed as the RDF/XML output, but far easier to parse — a JSON library should give very easy access to the underlying triples.
http://n2.talis.com/wiki/RDF_JSON_Specification
Turtle or Notation3
Turtle is the most readable RDF-based output. Turtle is a subset of a more complex format called Notation3, so by definition Cognition's Turtle output is also a Notation3 output.
http://www.dajobe.org/2004/01/turtle/
http://www.w3.org/DesignIssues/Notation3
vCard
vCard is an interoperable format for exchanging contact information. Only information about contacts (people, groups, organisations, etc) is output as vCard — e.g. information gleaned from hCard, XFN, FOAF or the W3C PIM vocabulary. Cognition mostly uses vCard 3.0 but includes some properties introduced in the draft vCard 4.0 standard. Cognition's vCards should be importable in most address book applications such as Apple Address Book and Microsoft Outlook.
http://www.ietf.org/rfc/rfc2426.txt
jCard
jCard exposes the same information as vCard, but in a format that's easier to parse. If you want to use Cognition to glean contact information from a web page and then pass that along to another script for further processing, jCard is a good choice of output format.
http://microformats.org/wiki/jcard
iCalendar
iCalendar is an interoperable format for exchanging event and todo information. Only the information about events, todo lists, free/busy times, alarms and journals are output as iCalendar — e.g. information gleaned from hCalendar or hAtom. Cognition's iCalendar files should be importable into most calendar applications such as Apple iCal and Microsoft Outlook.
http://www.ietf.org/rfc/rfc2445.txt
Atom
Atom is a format for syndicating (mostly textual) content, such as blog articles or news stories. RSS– and Atom-like content found on pages (including hAtom) is output as Atom and can be read in feed readers such as NewsGator/NetNewsWire or Google Reader.
http://www.ietf.org/rfc/rfc4287.txt
KML
KML is a file format for recording annotated geographic co-ordinates. Any appropriately marked up co-ordinates found on the page, such as co-ordinates for an address or for the location where an event is occuring can be output as KML, which can be imported into some mapping software, such as Google Earth.
http://code.google.com/apis/kml/documentation/
M3U
This is a playlist format supported by most common media players, such as Wimamp, iTunes and XMMS. Any audio files marked up in hAudio or the RDF Audio Vocabulary can be exported as M3U playlists.
http://en.wikipedia.org/wiki/M3U

What input URIs does Cognition accept?

The web service only accepts HTTP URIs (and a special URI of http://referer which indicates that Cognition should process the referring URI). The command-line client supports a wider variety of URIs including file:// URIs.

Cognition supports a special syntax for fragment identifiers. By requesting Cognition to process the URI:

http://example.org/foo#subject(http://example.org/bar)

Cognition will process http://example.org/foo and return all the information it can find about the subject http://example.org/bar. Also, given the input URI:

http://example.org/foo#bar

Cognition will process http://example.org/foo and return all the information is can find about the subject http://example.org/foo#bar.

Download

Change log.

Requirements

To run Cognition, you will need Perl 5.8 or above, plus a number of Perl modules installed. (All available from CPAN.) The modules marked with an alternative bullet point are used not by Cognition's parsing library, but by "infrastructure" code such as the daemon. The modules in italics are core Perl modules, included with the base Perl distribution (so you shouldn't need to download them).

Cognition has been tested on Mac OS 10.4 and Mandriva Linux 2008. (There are some bugs in some recent versions of LibXSLT which cause crashes on Mac. You can fix this by disabling GRDDL support using the -o p_grddl=0 option.) It will probably work on Windows too.

Powered by…

RDF RDFa Microformats
eRDF KML GRDDL
Dublin Core FOAF Atom
Toby Inkster
http://tobyinkster.co.uk
Last modified: 2008-08-20

Valid XHTML + RDFa