Semantic Portals - The SWED Approach

The Semantic Community Portal approach aims to overcome the range of limitations and problems with existing approaches to creating and maintaining Web-based community information resources. These include high maintenance costs and overheads, limited ability for third parties to re-use the information, problematic nature of adding new types of information, and others detailed in the Background section.

The details of the original specification for demonstration system is given in the project specification document. Below we present an overview of the approach in particular the aspects that make it distinct from traditional and existing approaches.

Overview

Figure 1 gives a basic overview of the Semantic Community Portals Approach. The most striking point when compared to traditional approaches to creating directories (or other types of web-based information) is the separation of data creation and storage from that of publication.

The data (in the standard format of the semantic web, RDF) is created and hosted by the information provider. This can be done in different ways (e.g. using a web form that generates the RDF file - as is generally the case with SWED) or it could be generated from existing data in a database, or it could be written by hand using a text editor. As long as the final file is in the correct RDF format and contains the expected types of information, the next stage will work.

diagram of overview of semantic Communities portals approach
Figure 1 - Overview of Semantic Communities Portals Approach

The data is harvested (i.e. the file is located, a copy is made and stored in a database) In the case of SWED it is stored along with the thesauri/vocabularies that are used to categorise the organisations/projects and the display templates associated with the information. Although in general these can all be stored independently, even on different servers on the web.

The portal viewer system then imports the information and processes it to display it for the user - dynamically generating the views (based on the templates) as the user browses or searches the site. SWED has chosen a 'faceted browse' interface, in which users can explore the information using facets (classification categories) under which the organisations/projects are classified.

However perhaps more important than the specific technical architecture used by SWED, is that the data is now part of the larger semantic web. Anyone can now harvest the information and make use of it too for example to produce specialist directories and/or add specialist information to the existing information, e.g. information about museum collections or volunteering opportunities, etc.

This can be done because the basic SWED data records are written in RDF (see above) and uses externally available data elements and classification vocabularies. For example address/contact data are represented using the vCard standard. Other data and classification terms are defined by SWED in a way that are linkable to other widely used vocabularies e.g. term's in the SWED 'types of activity' classification can be mapped to the very widely used Standard Industrial Classification (SIC) system.

The following sections provides more detail about the approach taken by SWED, including the processes, the data format and why it is so easy to add information to existing RDF based data.

Data Creation and Storage

diagram of creation and storage of organisations directory

Figure 2- Creation and Storage of Organisations Directory Information

Figure 2 illustrates the most basic difference between existing approaches and the Semantic Portal approach. The data itself is created and hosted (stored) by the organisations on their own web site (or that of a related organisation, where they do not have a web site). This is exactly like an organisational or project homepage on the Web. You can think of the portal data file like a homepage for the Semantic Web.

The data file can be created in many different ways, figure 2 shows two of these:

  1. A facilitating organisation (e.g. a directory publisher) might provide a Web form that the organisation can visit. When this is completed the RDF file generated can be sent to or downloaded by the organisation.
  2. A member of the organizations technical team might produce the RDF using a simple text editor. How the file is created is independent of the subsequent use of it.

In this approach the organisations themselves are responsible for the publication of their own data. They create and update the data file.

example of simple SWED data file in RDF (XML syntax) 

Figure 3- Example of simple data file in RDF (XML syntax)

The data file is written in the standard language of the Semantic Web, RDF (Resource Description Framework - see glossary) this is the equivalent of HTML for for normal Web pages. An example of a simple file is shown in Figure 3.

The data contains various types of information about the organisation (metadata - see glossary) including name, contact details, the topics that describe its areas of interest, the kind of organisation, etc. this rich metadata means that the SWED site can provide a large number of ways of browsing and searching for organisations.

Although the file is not generally read by humans (it is designed for easy automatic processing by computers) it is possible to see that it is made up of properties of the organisation e.g. the lines with <swed:has_topic at the start are 'topics' that the organisation is categories under. The long URL like values are used to indicate unambiguously the specific concepts used in this classification scheme, e.g. the use of the term 'enquiries' on its own is ambiguous, however with the added http://www.swed.org.uk/2004/etc. it becomes clear that the term 'enquiries' is used in this case as it is used by the organisation that created or controls the http://www.swed.org.uk/2004/etc. web domain, this is called a namespace - see glossary. Ideally the namespace URL would point to a human (and/or machine readable) definition of the term.

The structure of this can be seen more clearly by looking at a diagram illustrating these properties. Figure 4 shows a simplified graphical representation of the data in Figure 3.

example of simple SWED data file in RDF (graph view)
Figure 4 - Simplified graphical representation of the data from Figure 3 above

The central green oval represents the organisation or project (with prorg_number of "prorg104" - this is only used for internal SWED use) each of the purple lines represents a property of the organisation e.g. has_primary_prorg_name is the property that defines the name of the organisation or project (in this case with the value "The Environment Council"). The blank green ovals represent 'values' of properties which are more complex e.g. has_postal_address does not have a single value but is made up of other properties such as Street, Locality etc.

Once the data file is made available by the organisation on its own (or another) Web site, it can collected (harvested) using Web-based computer programs - similar to those used by search engines to collect and index information from Web pages. This means that any directory organisation or indeed anyone with an Internet connection can use (reuse) the information.

Collation and Publication

Figure 5 illustrates the collation and publication phases Semantic Portal approach. The RDF data files are harvested from the organisations' own Web sites. This is done using a software robot (bot) that systematically retrieves the RDF files of all organisation that are known to the directory.

This might be because i) the organisation has registered the location of the file with the directory (as is the case with the SWED Directory) or ii) the Bot located the file itself or iii) the directory organisation has used a third party index of the location of the RDF files.

In most Semantic Web applications similar to the SWED project the data files are harvested on a regular (often daily or hourly) basis. It may also be that it is possible to prompt the system to harvest a particular file.

diagram of harvesting, collation and publication stages  of SWED

Figure 5 - The Harvesting, Collation and Publication Stages of the
Semantic Portals Approach

Once the files have been harvested (step 1 in figure 5) they can be added to the directory publisher's(s) RDF database(s). This database holds copies of the data. These copies are used to create the actual Web pages of the directory Web site (step 2 & 3 in figure 5). The Web pages are generated using a template based system allowing the easy creation and editing of particular views of the information.

One other means of publication (more specifically syndication in this case) not detailed in figure 4 is the use of RSS (which stands for RDF Site Summary or Really Simple Syndication depending on the particular version) news feeds. RSS is a standard machine readable format. It is widely used within the news industry for sharing and publishing categorized summary news feeds, to alert news agencies and customers of timely relevant news items. Users can set up personal aggregators using various [often freeware] software [e.g. http://disobey.com/amphetadesk/] and choose which news feeds to collate. Using a form of RSS that uses RDF a SWED type directory could publish the information so that it can be harvested by users using RSS aggregators. This may be included in the next phase of SWED development.

The Mechanism for Reusing and Enriching the Information

Reuse of information is an integral aspect of the Semantic Web. Because the Semantic Community Portal is based on Semantic Web technical standards (e.g. RDF) other directory organisations will find it easy to harvest and collate the information. Figure 6 illustrates a number of ways that this may happen.

In stage 1. (Figure 6) the directory organisation selectively harvests the RDF files of organisations that are relevant to their particular area of interest (e.g. species conservation, or pollution control). This is possible because the RDF files contain the relevant classifications. The directory organisation then collates the information as before. However in stage 2a. they also add some additional specialist information themselves - thus adding value to the information for their particular specialist community of users (e.g. providing geographically related information). They might use their own vocabulary for categorising or describing the information.

In stage 2b. the directory organisation also harvests information from a third party information provider (e.g. particular type of pollution control services the organisations provide). Once again using their own vocabulary for categorising or describing the information. This enriches and adds value to the original information.

illustration of processes to reuse and enrich

Figure 6 - Illustration of processes to reuse and enrich information

The directory provider will then publish their specialist and enriched information to their Web site providing a set of customised views (e.g. Web pages, navigation system, search interfaces, ...) on the information.

Enriching the Basic Data

Enriching the data by integrating in with related information is a central aspect of the Semantic Web. Figure 8 illustrates how simple this can be. If we imagine that the 3rd party information provider in figure 6 is providing information related to specialist services offered by a particular type (sub-category) of organisation e.g. say a type of pollution control or monitoring service. They simply need to create an RDF file with the additional data that the relevant organisations offer the specialist service using their own property and terms, a fragment of which is shown in figure 7.

example of a RDF fragment adding third party data
Figure 7 - Adding 3rd Party RDF data

This basically says that the organisation that has the property 'swed:has_primary_url' of "http://www.example.com", (i.e. the organisation with the homepage www.example.com) also has the property 'thirdparty:service' of "http://www.thirdparty.org.uk/terms#foobar". That is, it offers a service that is categorised using the third party organisations' own vocabulary, called 'foobar'.

The new data is simply added to the RDF database stored by the 3rd party directory publisher, and can immediately used to provide the additional information on their Web site, with minimal changes to the software configuration. This includes the ability to search or select organisations on the basis of whether they provide the specialist service.

Technical Architecture of SWED

The system specification is described in a separate document which can be found at:

http://www.w3.org/2001/sw/Europe/reports/requirements_demo_2/

The specification document covers the approach in more depth and provides an overview of the system architecture at a technical level. It also includes some examples of potential use cases. Below we simply give a high level review of the system architecture.

Finding Out More

If you would like more information about the Semantic Portals approach, the SWED project more generally or SWAD-Europe projects visit our contacts page to find out who to contact for your particular inquiry.