Swignition Input
Although the primary purpose of Swignition is to parse HTML, it accepts input in various formats. Its behaviour depends on the Content-Type
HTTP header.
Parsing Mode | Perl Module Responsible | Content-Type |
---|---|---|
* with the string xmlns:rdf in the first kilobyte of the file† where file starts with one of the characters _<# , optionally preceded with white space‡ with the string <rss or http://www.w3.org/2005/Atom or http://purl.org/atom/ns in the first kilobyte of the file¶ where the root element is <TriX> § if it looks like RDF/JSON. |
||
HTML | Swignition::HtmlParser |
|
RDF/XML | Swignition::RdfXmlParser (wrapper for Redland) |
|
Turtle / N-Triples | Swignition::TurtleParser (wrapper for Redland) |
|
Notation 3 | Swignition::Notation3Parser (wrapper for Redland) |
|
TriG | Swignition::TrigParser (wrapper for Redland) |
|
Feeds | Swignition::FeedParser (wrapper for Redland) |
|
TriX | Swignition::PoxParser |
|
XML (generic) |
|
|
RDF/JSON | Swignition::JsonParser |
|
JSON (generic) |
|
HTML
Swignition is able to parse XHTML and HTML — indeed, this is its primary purpose. It can understand a wide variety of Microformats and other POSH formats (built-in, non-GRDDL support), including:
- adr (extensions)
- figure microformat
- geo (extensions)
- hAtom 0.1 (extensions)
- hAudio 0.9
- hCalendar 1.1
- hCard (extensions), including determining the representative hCard and contact hCard
- hRecipe PROPOSAL (notes)
- hResume 0.1 (notes)
- hReview 0.3 (notes)
- ICBM
- measure DRAFT
- OpenURL COinS (extensions)
- rel-enclosure
- rel-license
- rel-tag (extensions)
- species PROPOSAL (notes)
- XEN
- XFN 1.1 (notes)
- xFolk RC1 (extensions)
- XOXO
It recognises various ways of embedding RDF within HTML:
- RDFa
- GRDDL (transformation languages: XSLT1, RDF-EASE)
- eRDF
- RDF/XML embedded in XHTML using namespaces
- RDF/XML hidden in HTML
<!--comments-->
And of course it understands HTML's built-in features for metadata and document structure:
- HTML
<title>
,<meta>
and<link>
tags - Document outline determined by headings and HTML5/XHTML2 sectioning elements
- XHTML
@role
module
RDF Serialisations (RDF/XML, Notation3, Turtle, N-Triples & TriG)
RDF/XML, Turtle, N-Triples, and if you've got a recent enough version of Redland, TriG are parsed. Notation3 is tried, but doesn't always work.
Feeds (RSS & Atom)
Swignition uses Redland to convert "tag soup" RSS or Atom into strict RSS 1.0, and is thus able to read them as RDF.
Item descriptions in RSS feeds are treated as HTML (if they look like they're a bit more than plain text) and inspected for Microformats and for RDFa. Entry summaries in Atom feeds if explicitly marked as type="html"
are treated similarly.
XML
Any media type containing the string xml
, except those noted above, is recognised as XML.
If the root element's tag name is <TriX>
then Swignition will parse it as TriX. Swignition will take notice of <?xml-stylesheet ?>
processing instructions and run the file through any XSLT1 transformations (in the order they're encountered) before parsing it. Swignition is able to handle multiple TriX graphs (it just merges them into one graph, but does so properly).
For non-TriX XML files, Swignition attempts to parse them using GRDDL.
JSON
Swignition understands RDF/JSON, but will only parse JSON this way if it "seems to be RDF/JSON" (which it checks by looking for colons in the key strings of the JSON root object - they each need a colon). Otherwise, you may explicitly indicate that the file is actually RDF/JSON and not generic JSON by including a link to the RDF/JSON schema. For example:
{ "$schema" : { "$ref" : "http://SOAPjr.org/schemas/RDF_JSON" } , "http://example.org/about" : { "http://purl.org/dc/elements/1.1/title": [ { "type" : "literal" , "value" : "Anna's Homepage." } ] } }
Swignition also understands jsonGRDDL to extract data from generic JSON.