Master Thesis - Das IICM

09.02.2005 -

2MB Größe 6 Downloads 474 Ansichten

Master Thesis

Integrating Heterogenous Document Sources with XML and Pipelines into a Portal

Institute of Information Systems and Computer Media Graz, University of Technology

Author: Christian Mayrhuber Supervisor: Univ.-Ass. Dipl.-Ing. Dr.techn. Harald Krottmaier Assessor: o. Univ.-Prof. Dr. Dr. h.c. mult. Hermann Maurer

February 9, 2005

I

Abstract

During the hype about object-oriented programming the software component technology, UNIX Pipes & Filters, was almost forgotten. But using XML as input/output format for filter programs multi-channel publishing pipelines can be built. Due to the concept of content transformation a publishing pipeline is able to produce almost any document format. This makes it an easy task to generate HTML and PDF from the same document source. The Apache Cocoon framework develops this idea further by using objectoriented Java components and SAX streams to build publishing pipelines. With components operating above a filesystem abstraction this framework is easily extensible to communicate with existing document management systems. Cocoon contains a flow controller able to model the page flow in a high level language instead of a finite state machine greatly enhancing maintainability. A powerful XML form processing framework is also included. Based upon this core functionality a versatile portal framework was created by the Cocoon project which is used to demonstrate the integration of the enterprise content management platform Hyperwave IS/6, remote Web sites and the digital library Meta-Search-Engine Daffodil into a portal.

II

III

Kurzfassung

Während des Rummels rund um Objekt Orientiertes Programmieren wurde die

Software

Komponenten

Technologie,

UNIX

Pipes

&

Filters,

beinahe vergessen. Durch die Nutzung von XML als Ein-/Ausgabeformat für die Filter Programme ist es möglich mittels Transformation des Inhaltes mehrere Dokumentenformate zu produzieren. Daher fällt es leicht HTML und PDF aus ein und der selben Datenquelle zu erzeugen. Das Apache Cocoon Framework erweitert diese Idee durch die Benutzung von Java Komponenten und SAX Datenströmen zur Modellierung von Mehrkanal Publikationen. Durch Komponenten die mit einer Dateisystem Abstraktion arbeiten kann das Framework leicht um einen Zugriff auf Dokumenten Management Systeme erweitert werden. Die in Cocoon enthaltene Flussteuereinheit erlaubt die Programmierung der Seitenabfolge in einer Hochsprache, was die Wartung ungemein erleichtert. Die Verarbeitung von XML Formularen wird ebenso unterstüzt. Basierend auf dieser Kernfunktionalität wurde vom Cocoon Projekt ein vielseitiges Portal Framework entwickelt welches dazu benutzt wird um die Integration der Unternehmens Content Management Platform Hyperwave IS/6, entfernte Websites und die Digitale Bibliotheken Meta Suchmaschine Daffodil in einem Portal zu demonstieren.

IV

V

Contents

Contents

V

Preface

IX

1

2

Introduction

1

1.1

Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.2

Structured Content . . . . . . . . . . . . . . . . . . . . . . .

4

Methodology 2.1

11

Portal Frameworks . . . . . . . . . . . . . . . . . . . . . . . .

11

2.1.1

Stateful Portals

. . . . . . . . . . . . . . . . . . . . .

13

2.1.2

Hyperlinks in a Portal . . . . . . . . . . . . . . . . . .

14

VI

Contents 2.1.3

2.2

2.3

2.4

Problematic HTML Technologies in a Portal Delivering HTML . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

Remote HTML Content Inclusion . . . . . . . . . . . . . . . .

16

2.2.1

Content Inclusion with HTML Inline Frames . . . . . .

16

2.2.2

Inclusion of Remote Content by Copy . . . . . . . .

17

2.2.3

Legal Considerations . . . . . . . . . . . . . . . . . .

25

The Unrivaled Flexibility of XML Technology . . . . . . . . . .

26

2.3.1

XPath, XPointer and XInclude . . . . . . . . . . . . .

27

2.3.2

Extensible Stylesheet Language for Transformations and Formatting Objects . . . . . . . . . . . . . . . . . . .

29

2.3.3

XML Schema . . . . . . . . . . . . . . . . . . . . . .

31

2.3.4

XUpdate . . . . . . . . . . . . . . . . . . . . . . . . .

32

2.3.5

XQuery

. . . . . . . . . . . . . . . . . . . . . . . . .

33

2.3.6

XML:DB API . . . . . . . . . . . . . . . . . . . . . . .

36

Multi-Channel Publishing . . . . . . . . . . . . . . . . . . . .

37

2.4.1

UNIX Pipes . . . . . . . . . . . . . . . . . . . . . . . .

40

2.4.2

Publishing Channel Setup . . . . . . . . . . . . . . .

42

2.4.3

Steps 1 and 2: Source Documents . . . . . . . . . . .

44

2.4.4

Steps 3 and 4: Intermediate Data Generation and

. . . . . . . . . . . . . . . . . . . . . . .

47

2.4.5

Steps 5’ and 6’: Generate HTML . . . . . . . . . . . .

49

2.4.6

Steps 5” and 6”: Generate PDF . . . . . . . . . . . .

52

Distribution

VII 3

Overview of XML Software 3.1

Cocoon Core . . . . . . . . . . . . . . . . . . . . . . . . . .

60

3.2

Concepts and Pipelines . . . . . . . . . . . . . . . . . . . .

62

3.2.1

Pipelines and Views . . . . . . . . . . . . . . . . . . .

63

3.2.2

Pipeline Components

. . . . . . . . . . . . . . . . .

65

Control Flow . . . . . . . . . . . . . . . . . . . . . . . . . . .

70

MVC revisited . . . . . . . . . . . . . . . . . . . . . .

74

3.3

3.3.1 3.4

Cocoon Forms

. . . . . . . . . . . . . . . . . . . . . . . . .

76

3.5

Portal Engine . . . . . . . . . . . . . . . . . . . . . . . . . .

78

3.5.1

Coplets . . . . . . . . . . . . . . . . . . . . . . . . .

79

3.5.2

Portal Communication . . . . . . . . . . . . . . . . .

82

3.5.3

Authentication and Profiles

. . . . . . . . . . . . . .

83

NXD Access . . . . . . . . . . . . . . . . . . . . . . . . . . .

83

3.6.1

XML:DB Access . . . . . . . . . . . . . . . . . . . . .

84

3.6.2

WEBDAV . . . . . . . . . . . . . . . . . . . . . . . . .

85

3.6

4

59

Integration of External Content Into the Cocoon Portal 4.1

4.2

87

Hyperwave Information Server (HWIS) . . . . . . . . . . . . .

88

4.1.1

Components . . . . . . . . . . . . . . . . . . . . . .

88

4.1.2

An Example Portal Page . . . . . . . . . . . . . . . .

92

Remote Content . . . . . . . . . . . . . . . . . . . . . . . .

97

VIII

Contents

4.3

5

4.2.1

Components . . . . . . . . . . . . . . . . . . . . . .

4.2.2

Example for the Proxy Coplet . . . . . . . . . . . . . 101

Daffodil

98

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

4.3.1

Components . . . . . . . . . . . . . . . . . . . . . . 104

4.3.2

Daffodil Search Coplet . . . . . . . . . . . . . . . . . 108

Conclusion and Future Prospects

111

5.1

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

5.2

Ongoing Developments of Cocoon and eXist . . . . . . . . 113

5.3

An Introduction to the Semantic Web . . . . . . . . . . . . . 114

List of Figures

117

Listings

119

Bibliography

121

IX

Preface

I’d like to thank the Apache Cocoon and eXist open source communities for their numerous hours of work to create and support two shining software products of the XML computing domain. Thanks goes to the developer of the Cocoon Portal Engine Carsten Ziegler, to the lead developer of the eXist XML database Wolfgang Meier for giving tips and fixing bugs in various occasions, to Klaus-Peter Clas for his support, aid and bugfixes during the Daffodil integration phase. Special thanks goes to my supervisor Harald Krottmaier for his patience and enduring support and to Georg Gassner for proofreading and improving the English of this thesis. This work was partly supported by DELOS, a Network of Excellence on Digital Libraries (EU FP6, G038-507618).

X

Preface

The thesis consists of five chapters described below. Chapter 1 gives an introduction to the possibilities of information retrieval in the WWW. Chapter 2 develops methodologies which can be used to conflate content from remote web sites and provide them to different media. Chapter 3 gives an technical overview on the opensource XML framework Cocoon1 from the Apache Software Foundation2 and on the native XML database eXist3 . Chapter 4 focuses on practically integrating the Hyperwave document management system, remote web sites and the Daffodil framework into the Apache Cocoon portal engine. Chapter 5 gives a conclusion and offers an outlook to the future of XML technologies on WWW.

1

http://cocoon.apache.org http://www.apache.org 3 http://www.exist-db.org 2

1

C HAPTER 1

Introduction

The World Wide Web (WWW) was designed to make it easy for people to publish documents [1] and to read them. It’s frontend programs, the so called web browsers have been in vital development since their birth [2], now presenting a comfortable user interface. However, there is no technology available to the core of WWW to ease the lookup of information. Since the World Wide Web was proposed by Tim Berners-Lee on November 12, 1990 the number of web sites grew from one to 54,407.216 in September 2004 and continues to grow with more than one million sites a month [3]. The predominant methodology to find information on the Internet is to ask a search engine for Web sites. Only twenty percent of the people use the navigation presented on a site, about fifty percent of the users are focused on a task and are directly approaching the search function a web site is offering [4]. Fifty-four million sites make it difficult to find exactly the ones which are offering what a person is looking for. The extraordinary importance

2

Chapter 1. Introduction

of WWW search can be observed in the user interface of modern web browsers, which offer the gimmick to perform a search on the WWW from the browser toolbar. Todays search engines only return high quality results for written material. Looking for non written materials like images, audio or video often requires special software to obtain reasonable good search results.

1.1

Search

Most people experience that they get a lot of unwanted search resultsgetting lots of unwanted search results from a search engine by using catch words that appear in many different contexts not thought of. The only chance to improve the quality of the search results is by trial and error. Either more words are added or different words are used. Most search engines allow to limit the search to specific sites or domains. Sometimes it is advantageous to whitelist sites known for having valuable content in the field of interest. Two different search techniques for Web content are distinguishable Web Crawlers and Meta-Search-Engines. Web Crawlers are search engines which start with a list of links, like server lists, and look at these pages for more links to add to their information database. This database is then used to answer queries and rank the search results. Two well known search engines using web crawler technology are Google (http://www.google.com) and AltaVista (http://www.altavista.com). Meta-Search-Engines do not build a search database by themselves. They just query existing search engines, remove duplicate results and rank them. For instance Ithaki.net (http://www.ithaki.net/) and ez2Find

1.1. Search

3

(http://ez2find.com/) allow both to search the whole Internet by querying multiple search engines. Recent developments of search engines focuses on non-written material and domain specific information retrieval. These search technologies require a big amount of processing power. Examples are object detection and similarity matches for images but these techniques are still under development. The farthest developed lookup techniques for non-written or domain specific material exists where metadata1 is available. Very prominent examples are peer to peer music sharing applications. These applications use the meta information available from digital audio songs, so called tags, to enable search for artists, album, genre or song title. High quality metadata is usually available in the domain of digital libraries. Metadata sets of publications often contain entries for authors, title of the publication, publising date, publisher, the type of publication and a list of references.

Daffodil The German Research Foundation (DFG) is funding the Daffodil2 project as part of the strategic research initiative "Distributed Processing and Delivery of Digital Documents". Daffodil is available as a personalized Java™webstart3 client software 4 or as a agent5 framework to perform queries. 1 2

Data about data. An agent-based architecture for supporting high-level search activities in federated

digital libraries for computer science. 3 Using Java™Web Start technology, standalone Java™software applications can be deployed with a single click over the network. Java™Web Start ensures the most current version of the application will be deployed, as well as the correct version of the Java™Runtime Environment (JRE). 4 DAFFODIL webstart http://www.daffodil.de/webstart.html 5 An agent is a piece of software able to fullfill a task autonomously. Agents of the same type can interact by using predefined communications protocols.

4

Chapter 1. Introduction

Basically, Daffodil is a Meta-Search-Engine. It uses search wrappers to query local search engines of digital libraries with fulltext or metadata. These results are ranked and cached in a relational database and can be retrieved by the agent framework. The Daffodil webstart client offers a variety of features and supports queries for documents, conferences, journals, proceedings and references, including a graph of co-authors. DAFFODIL is a search system for Digital Libraries aiming at strategic support during the information search process. From a user point of view this strategic support is mainly implemented by high-level search functions, so-called stratagems6 , which provide functionality beyond todays digital libraries. Through the tight integration of stratagems and with the federation of heterogeneous digital libraries, DAFFODIL reaches high effects of synergy for information and services. These effects provide high-quality metadata for the searcher through an intuitively controllable user interface. The implementation of stratagems follows a tool-based model [5].

1.2

Structured Content

In some cases searching the Web does not always return high quality matches. In such cases using a specialized information source may be advisable.

Web Directory A web directory is a source of information that provides categorized content. Web directories develop taxonomies and index the previously col6

A stratagem is a complex set of actions composed of basic moves and tactics that

exploit the information structure of a single search domain [5].

1.2. Structured Content

5

lected and hand-picked information. These directories often provide additional information on topics, suggest similar topics and allow to search in a limited area. The maintenance of a web directory requires a huge amount of work, therefore only a few directories exist which cover the whole WWW. Most directories focus only on a specific domain. The biggest web directory is the Open Directory Project7 which has collected around four million links in mid 2004 [6]. However, the most popular Web Directory is Yahoo!8 .

Web Portal A web portal is a web site that provides a unified access to aggregated content on a common topic, enterprise or institution. Portals are designed to use varying kinds of information sources like remote web sites or distributed applications to provide a number of different services. Portals usually allow their users to personalize them. With the raise of mobile computing portals sometimes provide services to mobile devices, like cell phones or personal digital assistants (PDAs). In recent years portal technology has become a key component in the consolidation of enterprise information technology. Portals enable enterprises to provide access to software services to their employees. Even if they are running on different operating systems or hardware platforms. Such portals are called enterprise information portals (EIBs). EIBs work as a single point of access, thus simplifying mandatory services like single sign in, personalization and the presentation of a uniform user interface. One of the most flexible portal engines is part of the Cocoon XML framework developed by the Apache Software Foundation. 7 8

http://www.dmoz.org http://www.yahoo.com

6

Chapter 1. Introduction

Digital Library A digital library is a collection of digital content including multimedia works. It stores the digital content and makes it available by utilizing digital storage and retrieval technologies. Digital libraries are usually the highest quality information sources available on the Internet. A rising amount of research papers, often produced with support of public funding, is only published commercially making the content inaccessible for students or poorer countries [7]. Digital libraries often use portal technology to offer advanced services like personalization or various levels of subscription. Payment may be required per issue or per a yearly basis.

On-line Encyclopedia On-line encyclopedias are in principle a special form of digital libraries, containing a written compendium of human knowledge. Encyclopedias may focus on a specific domain or be of general nature. The currently most exciting project is Wikipedia9 , a free encyclopedia built upon collaboration of Internet users and Wiki technology. It moves the idea of a Web user to be an author at the same time to the next level.

Persistent Names On the Internet a digital objects can be accessed by an Uniform Resource Locator, URL in short. An URL consists of a protocol, hostname, port and a protocol specific path identifier. The notion of a URL constrains that any digital object retrieval is bound to a hostname. For most content this is sufficient, however there are cases where a persistent name for a digital 9

http://www.wikipedia.org

1.2. Structured Content

7

object is required that does not depend on a hostname, but stays valid even if the document moves to a different site. The simplest system for persistent names are Persistent URLs (or PURLs). A PURL does not point directly to a digital object, but to a directory service that issues a HTTP redirect to the digital object. PURLs never came into wide usage but a different system was developed in the publishing domain, where persistent names are required for referencing. This system called Digital Object Identifier (or DOI10 ), was brought to life and is now administered by the International DOI foundation.

Application Profile (AP) Every DOI is associated with one or more Application Profiles. An Application Profile is a group of DOIs which share a common metadata structure and rules. Metadata, policy and business rules are determined by registration agencies. These registration agencies are fully autonomous in their business model and issue DOIs with varying pricing. DOIs can have three levels of relationship with an Application Profile:

Zero AP: No Application Profile is associated with this DOI. In its beginnings all DOIs issued were of this Application Profile. A Zero AP DOI is similar to a PURL, no metadata is associated with it. Base AP: DOIs with this Application Profile have the kernel metadata associated. The kernel metadata consits of the DOI value, the DOI AP name and further kernel metadata elements [8]. Full AP: The Full Application Profile consits of the Base Application Profile, other metadata and business rules. Full Application Profiles exist for 10

DOI® and DOI.ORG® are registered in the U.S. Patent and Trademark Office

8

Chapter 1. Introduction

text applications11 and for MPEG applications12 to name a few.

DOI structure A DOI consists of a prefix, the directory identifier, is followed by a / and a suffix resolvable by the directory. A typical DOI might look like: 10.1002/ISBN0-471-58064-3

10.1002 is the directory identifier. ISBN0-471-58064-3 is the suffix

DOIs are written as an URI like naming scheme, for example: doi:10.1002/ISBN0-471-58064-3. These DOIs can be entered into a form at http://www.doi.org/ or retrieved as hyperlink trough a DOI proxy. For instance: http://dx.doi.org/10.1002/ISBN0-471-58064-3

DOI prefix The DOI prefix is maintained by the International DOI foundation which issues prefixes to registration agencies [9]. At it’s beginnings the prefixes were sold for a one time fee of $1000, but are now only issued to registration agencies which develop application profiles. 11 12

http://www.editeur.org http://mpeg.telecomitalialab.com

1.2. Structured Content

9

DOI suffix The DOI suffix is an alphanumeric string unique to the prefix and issued by registration agencies. Suffixes are case insensitive. Often legacy identifiers like the ISBN, ISSN, etc. are used to generate the suffix. Slashes are allowed in the suffix string, to support legacy URL naming schemes like jucs_10_7/partial_categorical_multi_combinators.

The next chapter addresses some of the problems that may occur on the path to build a portal by the usage of XML technology and data oriented processing.

10

11

C HAPTER 2

Methodology

This chapter addresses key problems appearing when trying to build a highly integrated site which depends on content from different sources. First the way web portals work is examined. Next some additional problems arising when a portal may include legacy HTML content are addressed. After that a technology step higher the domain of XML and it’s powerful data processing capabilities are examined. Finally a look at pipeline processing from a UNIX perspective and with XML as intermediate language is taken through the presentation of a comprehensive example. This example introduces to the basics of pipeline processing in order to understand the advanced capabilities of the Apache Cocoon framework.

2.1

Portal Frameworks

Portal frameworks are designed to aggregate content from different sources and are therefore a natural way to accomplish the goal of a fully integrated web site.

12

Chapter 2. Methodology

Figure 2.1: Composition of a Portal Page

Portal frameworks are comprised of: Portal is responsible for aggregating content from different sources and presenting a personalized view to the user. Authentication Service authenticates users to some kind of user database and provides the basis for personalization. Event Manager enables the portal to carry out user interface interaction and answers user requests by forwarding them to portlets. Layout Renderer draws the Portlet decoration and controls, see figure 2.1. There may be different renderers for different layouts, like window layouts or tab layouts.

2.1. Portal Frameworks

13

Portlet generates a fragment of the content needed for the Portal to present a user interface. Portlet Container runs portlets and provides life cycle management for them. Different portlet containers are able to run different kinds of portlets.

2.1.1

Stateful Portals

Portals need to track user sessions to know how to present the layout to the user. Portal frameworks are web frameworks and are subject to restrictions of the HTTP protocol. A shortfall of HTTP is, that it does not provide permanent connections, it reconnects upon every request, thus preventing easy tracking of user sessions. Session tracking must be performed by the portal through either encoding the session id into URLs or storing it into web browser cookies. Portlets may be hidden, minimized or in some other view state. The implementation of a portlet may have a state, too. The statefulness is in conflict with the frontends, the web browsers. Web browsers administer a history list of the least recently viewed web pages. This list can be gone forth and back by the web browsers Forward and Back buttons. Portals can react in two ways to this situation, either disable the browsers toolbar to disable the history function or ignore client requests from the history list. To be able to distinguish between old requests in the browser history from ones of the actual page, portals have to add a parameter to the hyperlink URLs that advances in time. The following URLs are using a request counter: 1. http://myportal.at/somepath?portal-request-counter=0 2. http://myportal.at/somepath?portal-request-counter=1

14

Chapter 2. Methodology

3. http://myportal.at/somepath?portal-request-counter=2 4. http://myportal.at/somepath?portal-request-counter=3 5. http://myportal.at/somepath?portal-request-counter=4 The portal remembers the current request number through its session management and can discard old requests by comparison with the portal-request-counter request parameter. If an old request is made the portal simply presents its current view.

2.1.2

Hyperlinks in a Portal

Portals need to get information about events. The only way to get information about events by means of a web browser are specially created URLs. Portals share the URL namespace with their portlets to get notifications. This means that the portal itself must have the control how a URL looks like. Portals normally have at least two kinds of URLs so called Render URLs and Action URLs. Render URLs result in changes of the Portal user interface. Action URLs are processed by portlets and usually result in content fragment changes. Portlets are only allowed to issue requests to the event manager of the portal framework to receive a valid URL. If the user clicks on such a URL a request is sent to the portal. This request is then resolved by the event manager which notifies the portlet. A render URL request directly resolves to the portal and inflicts a change of the display of a portlet. For instance a portlet in a window may get minimized.

2.1.3

Problematic HTML Technologies in a Portal Delivering HTML

HTML was created to describe the look of a whole page not just a part of it. This design decision, and a rather uncontrolled development of HTML

2.1. Portal Frameworks

15

and associated technologies during the early years of the Web, lead to a technology mix that is problematic in portal environments.

HTML Frames

HTML frames may be completely unassociated with the view and the state of the portal and are forbidden to be used in a portal because they prevent the portal to control the view presented to the user. Inline frames, defined by the iframe element, however may be used very cautiously. Inline frame usage is allowed, because they are embeddable into HTML elements and do not distort the portal view. Their usage should be avoided, though, because the portal has no control how the content is displayed.

Cascading Style Sheets (CSS)

CSS definitions should be avoided in generated portlet fragment content, because the CSS definitions may conflict with the ones the Portal employs to the whole. Portlet CSS definitions can completely destroy a portal view if not used cautiously. Best practice is to prefix the CSS definition by a unique namespace to avoid CSS clashes.

JavaScript

JavaScript libraries may only be defined for the portal as a whole, to have a defined function set available. Definition of JavaScript functions or variables may result in a name clash. If per portlet JavaScript is needed, functions or variables should be prefixed by a unique namespace.

16

Chapter 2. Methodology

2.2

Remote HTML Content Inclusion

The complexity of HTML may require a huge effort to include the content of a remote site into a Portal. Especially the problematic technologies mentioned above may result in huge problems which will be addressed later. Basically two possibilities exist how to include remote HTML. If all of the following statements are true then Content Inclusion with HTML Inline Frames can be used. • The content of the remote site can use different CSS than the portal. • The whole remote site should be presented in the portal. • There is no need to integrate the menu of the remote site into the menu of the portal. If not, then Inclusion of Remote Content by Copy should be used to solve the problem.

2.2.1

Content Inclusion with HTML Inline Frames

If the content of a remote web site is included by an element into a portlet fragment it is easy. An IFRAME acts like a little, embedded browser window with its own JavaScript and Cascading Style Sheet definitions, as can be seen at listing 2.1. Listing 2.1: An IFRAME example 1

2

3

Your browser doesn’t understand inline frames.

4

You may directly access the embedded site by clicking on

5 6

Title of Hyperlink

2.2. Remote HTML Content Inclusion

7 8

17

It is recommended to upgrade your browser to a newer version.

9

10

The src attribute specifies the origin of the remote content. Note: The width attribute of the iframe element should always be specified as percent(%) value, otherwise a to large width may distort the layout of the portal page.

2.2.2

Inclusion of Remote Content by Copy

If IFRAMES cannot be used because according to the statements in 2.2 or the developers simply don’t want to use frames then technical difficulties can occur. The following sections try to identify these technical difficulties and provide some advice to overcome them.

Extract Content from a HTML Page HTML consists of a header and a body. The header contains valuable information: • Title • Metadata • Cascading Style Sheet definitions • References to remote CSS scripts • JavaScript functions • Links to external JavaScript

18

Chapter 2. Methodology

The body contains the layout and content of the page. As portlets are producing content fragments, there is a need to extract information out of the page. It may be necessary to include parts of the header as well as parts of the body or even the whole body. To enable a portlet to be flexible enough to pick parts it needs a HTML parser. The only good available HTML parsers are web browsers and the W3C Tidy tool. W3C Tidy is a HTML beautifier which is able to produce HTML conformant to XML, so called XHTML. XHTML is a newer revision of HTML standard in XML. To grab parts of the XHTML page widely available XML parser technology can be used. Unfortunately some HTML is nonconform to standards, that even Tidy is unable to produce sensible XHTML. The only way to deal with such nonconform sites is to fix Tidy or the remote site. It is advisable to use W3C Tidy. Web browsers are huge software components and will require huge amounts of memory if many instances of them are required to handle multiple concurrent connections to the portal.

Tunnel Requests through the Portal HTML pages contain hyperlinks. As stated above, only the portal is allowed to issue hyperlinks. That means every single hyperlink in HTML must be rewritten by asking the event manager of the portal engine for its value. This seems simple, but it isn’t due to the many ways of Dynamic HTML (DHTML) to issue requests to the portal it is not.

JavaScript Links:

Many pages contain hyperlinks activated by some

JavaScript code, like in listing 2.2.

2.2. Remote HTML Content Inclusion

19

Listing 2.2: Heise Online Search Dropdown Box 1

2

7 8

Telepolis-Art.

9

Software

10

c’t-Soft-Link

11

Treiber

12

Firmenkontakt

13

Jobs

14 15

... im Internet

19 20

The above example changes the browser location to a value selected by a HTML form drop down box. If someone would want to tunnel that correctly through a portal he would have to write a program that understands what the code above does, prior to it’s execution. Not even web browsers can do that. My advice is not to output something like that in a portlet fragment. It is better to carefully analyze the remote site and split it into content and navigation parts. The navigation can be integrated into a portal navigation structure like a menu or the sitemap. The content can be displayed as portlet fragment.

HTML Forms submit a form to the destination specified by their action attribute. How the submission takes place is specified in the method and enctype attributes. Most HTML forms use method="POST" to transmit their data, because method="GET" has a 255-byte restriction in it’s con-

20

Chapter 2. Methodology

tent size. Unfortunately most portal engines treat POST requests differently than GET requests. GET requests are transmitted as name-value pairs encoded in URL parameters and can be extracted easily. These parameters can be forwarded to the remote site by constructing a new URL. Portals may use their own request parameters, like portal-request-counter. These portal specific request parameters must be stripped from the request sent to the remote site, as they may result in unexpected results. POST requests usually contain the form data in the request body not in the URL, but exceptions are the norm. There are three cases of form definitions which need different treatments: 1. Forms with a definition like are email forms. The action URLs of these forms must not be encoded by the portal, because it is directly processed by the web browser which sends an email upon submission. 2. Forms

with

a

definition

like

must be treated like GET requests, because the browser encodes the form values into the request URL. 3. Forms with an attribute method="POST" and without attribute enctype="application/x-www-form-urlencoded" are "regular" POST requests and require the request body to be forwarded to the remote resource by issuing a new POST request. POST requests can contain huge amounts of data that will add load to the server where the portal is running on. HTML Header The header of a HTML page may contain definitions and links to various resources. Header content basically can be divided into two types:

2.2. Remote HTML Content Inclusion

21

1. Content that should appear in the header of the portal page. 2. Links to non HTML resources. The ability to add metadata to the header of the portal page is a relatively uncommon feature in portal frameworks, but most portal frameworks allow to set the title of the portlet by a method call. Basically JavaScript and CSS definitions must be treated the same way as metadata, though JavaScript can be included into the page body, too. Even more problematic are links to JavaScript or CSS libraries, because it might be necessary for the portal to proxy them. More on how to deal with non-HTML content can be found in the section Non-XML/HTML Content. If is necessary to include information into the page header like JavaScript, CSS or metadata the best way is to use a portal engine with a freely defineable portal page generation mechanism. It is impossible to act flexible if the base technology, the portal framework, does not support it1 .

Cascading Style Sheets bear their name because they allow for inheritance of attributes from more than one stylesheet. This sounds helpful, but it isn’t in combination with web portals. Imagine the scenario of a complex site being included in a portal. It contains stylesheet definitions which override the ones defined for the portal page. The portal page is likely to look distorted, requiring to hand craft a "universal" stylesheet that fits the portal and the remote page. Even if there is no conflict in stylesheet definitions, some sites may use the absolute positioning feature of CSS which may result in an image being displayed outside of the portlet area. Either changing the remote site or handcrafting a custom stylesheet may be necessary. 1

Portlet technology based upon the Java™Specification Request 168 Java™Portlet Spec-

ification Version 1.0 is not flexible enough for this task.

22

Chapter 2. Methodology

In some cases people use a complicated method of stylesheet inclusion, where URL rewriting doesn’t help, like Wikipedia - see listing 2.3. The standard way of including CSS can be seen in listing 2.4. Unless someone detects cases like these by usage of a CSS parser or regular expressions, a hand crafted stylesheet will be necessary, too. Listing 2.3: Complicated Stylesheet Include 1 2 3 4 5 6

/**/

7

...

8

Listing 2.4 shows the standard way of how to import CSS. Listing 2.4: Standard Stylesheet Include 1 2 3

4

5

...

6

Javascript and on-load/on-unload Event Handlers The body element supports the onload and onunload attributes. These attributes allow to specify event handlers for events occurring after the page is completely loaded and prior to the page destruction. Typically the values of these attributes need to be retained on the portal page. If there are more sites to be included copying is not sufficient. The following example shows why. Notice that the onunload attribute of listing 2.5 lacks a semicolon at the end of the JavaScript function call.

2.2. Remote HTML Content Inclusion

23

Listing 2.5: Remote Site Number 1 1 2 3

4

function onLoad1()

5

{

6

...

7

}

8

function onUnload1()

9

{

10

...

11

}

12

13

14

15

...

16

17

Listing 2.6: Remote Site Number 2 1 2 3

4

function onLoad2()

5

{

6

...

7

}

8

function onUnload2()

9

{

10 11 12

... }

13

14

15

...

16

17

The attribute values of the onload and onunload attributes are simply copied together, see listing 2.7. The JavaScript code becomes invalid because JavaScript statements may either be terminated by a semicolon or by a line feed. The onUnload1() function call is neither terminated by a linefeed nor by a semicolon.

24

Chapter 2. Methodology Listing 2.7: Merged Events - Non Working

1 2

3

...

4

5

6 7 8

...

If the JavaScript statements are on separate lines the semicolon is optional. Listing 2.8 will execute both the onUnload1() and the onUnload2() functions because it presents completely valid JavaScript code.

Listing 2.8: Merged Events - Working 1 2 3

4

function portal_onLoad()

5

{

6

onLoad1();

7

onLoad2();

8

}

9

function portal_onUnload()

10

{

11

onUnload1()

12 13 14

onUnload2(); }

15

16

17

...

18

19

A prefix of portal_ has been added to the onLoad and onUnload function names to reduce the risk of name conflicts for the function names.

2.2. Remote HTML Content Inclusion

25

Non-XML/HTML Content Images, Flash animations, Java™applets, hyperlinks to JavaScript libraries or CSS definitions need to be made available to the client. If these files are available on public sites there is no need to tunnel them through the portal and the links to these resources can be left as they are. In corporations portals are often used to accumulate different intranet web applications and which are often protected by firewalls. Most of the time only the portal server will be allowed to pass through the firewall to protect from contemporaries and security exploits. In such a case all links to non-XML/HTML content must be written into some URL where a proxy service running on the portal server provides this content. The implementation of this proxy service determines it’s security.

2.2.3

Legal Considerations

The invocation of a link always results in multiple digital copies of the remote content. There may be one or more HTTP proxy caches, the web browser harddisk cache and the browser RAM cache involved in storing. The EU directive 2001/29/EG of the European Parliament allows for temporary storage of digital copies, if they are an integral part of a technique and 1. their sole purpose is a transmission over a network to a client by a mediator, or 2. to enable the rightful usage of a works that has no economic significance by it’s own. Basically it means HTTP proxies and similar technology can be used legally. It does not cover framing or changing the display of a remote site, because

26

Chapter 2. Methodology

such technique is seen as editing in the sense of copyright law. This may require an agreement with the copyright owner. If the remote site does not belong to the owner of the portal and is displayed together with some other content it will almost always be necessary to get a written permission by the creator of the works. If the remote content is not clearly identifiable as remote or the portal provides similar content of the same domain and it is used for business purposes, problems with competitive law may arise. Providing remote information in a portal may be viewed as misguidance or living on others. To illustrate the legal risks the following cases illustrate lawsuits that took place within the EU: • A part of a web site has been included by using frames. • A hyperlink was directed to a special part of a remote web site, showing it like being a part of the same site. • A hyperlink looked like it was from the local site, but pointed to some other site which was not mentioned in the link text.

2.3

The Unrivaled Flexibility of XML Technology

XML was introduced as a simpler form of the Standard Generalized Markup Language (SGML) to allow a common ground for data exchange. SGML was seen as too complex to fullfill this task and is only common in high end publishing systems. To ease its distribution the XML markup is similar to HTML. Unlike HTML, XML has no predefined vocabulary for elements and attributes giving this flexibility to the user. One, if not the most important feature of XML is namespace support. Namespaces allow to combine various XML languages in a single document without losing the ability to distinguish them. During recent years,

2.3. The Unrivaled Flexibility of XML Technology

27

the development in the XML domain was incredible. Many languages based upon XML markup have been designed, XML Schema and XML Query were developed. These newly standardized technologies complete the data processing abilities of XML. The following sections describe a variety of XML based languages and technologies and how they fit together.

2.3.1

XPath, XPointer and XInclude

The XML Path Language (XPath) is a language for addressing parts of an XML document. Every XPath expression evaluates one of the four basic types: • node-set (An unordered list of nodes) • boolean • number • string A XPath location can be either a relative or an absolute location in an XML document. It can address element nodes, attribute nodes and node-sets thereof. The amount of nodes matched by a XPath location can be restricted further by specifying additional requirements for a match like comparison operators, functions or predefined variables. XPath supports equality operators and helper functions operating on the four basic types. For instance substring extraction, summation of the values in a node-set or the number of nodes in a node-set to name a few. Examples can be found in the XPath language specification [10]. The XML Pointer Language (XPointer) [11] Framework is an extensible system for XML addressing by the means of schemes. As of November, 2004 XPointer consists of three schemes:

28

Chapter 2. Methodology

element() allows basic addressing of XML elements and does not support qualified names (names with namespace prefixes). xmlns() is used to define XML namespaces for addressing qualified names in the xpointer() scheme. xpointer() extends the XPath specification with points and ranges to allow the retrieval of XML parts. A point is a position between two XML nodes. A range is the content between two points. Several functions extend the XPath specification to be able to operate with points and ranges. Additionally it is allowed for the resulting XML document to contain multiple root nodes. If qualified names are used in the xpointer() expression their namespaces must be defined by xmlns() schemes. XML Inclusions (XInclude) consist of a processing model and syntax for general purpose inclusion. Inclusion is accomplished by merging a number of XML information sets into a single composite infoset [12]. The processing model supports XML includes and plain text includes of resources specified by URI references. The xpointer framework is leveraged to address XML resources. XInclude defines two XML elements, a include and a fallback element, both must be in the XInclude XML namespace to be usable in any XML document. The include element consists of the following attributes: href contains the URI location of the resource to be included. If not specified the current document is referenced. parse may be text or xml to indicate how the resource specified by href should be parsed. If omitted the value of xml is implied.

2.3. The Unrivaled Flexibility of XML Technology

29

xpointer contains the XPointer expression to specify which part of the resource to include. If parse equals xml either xpointer or href must be specified. encoding specifies the document encoding for text includes, because it may be impossible to detect. Has no effect for resources included as XML. accept mirrors the value of the Accept attribute in the header of the HTTP protocol to aid content negotiation. accept-language mirrors the value of the Accept-Language attribute in the header of the HTTP protocol.

The fallback element appears as a child of the include element and may contain any XML elements, with the exception that only the include element is allowed from the XInclude namespace. The fallback element is used to return XML that indicates an error.

2.3.2

Extensible Stylesheet Language for Transformations and Formatting Objects

The Extensible Stylesheet Language for Transformations (XSLT) is a language in XML markup designed to transform an XML document into another XML or plain text document. XSLT is used to transform XML documents of various dialects into an XML document using a structure and vocabulary processable by some software. For instance XHTML, Wireless Markup Language (WML), Scaleable Vector Graphics (SVG) or Really Simple Syndication (RSS) to name a few. XSLT works by applying templates to XML nodes matching XPath patterns. This results in a transformation of an input XML tree into an output XML tree.

30

Chapter 2. Methodology

XSLT shines in transformations between XML dialects but falls short in producing print document formats like Postscript or the Portable Document Format (PDF). The main problems with print formats are their binary nature and pagination. In XML documents binary data can only be defined with numeric entities which are difficult to write if a sequence of them is needed. To be able to split text flow into pages the stylesheet processor needs to know how to paginate, render and express the output in the desired destination format. Due to these needs it is logical to define a page description language. This language is called Formatting Objects, FO in short. The Formatting Objects page description language divides documents into pages and blocks which can be arranged side by side or one after another. XSL-FO supports regions, margins, areas, width, height, sequence and numbering of pages. Blocks and page contents can be filled with markup for borders, spacing, column’s, paragraphs, lists, tables, lines, pictures and formatted text.

The generation of the output document is handled by a FO processor. Every document format needs a specialized FO processor that can interpret the FO language and knows how to produce the specific output format.

As illustrated in Figure 2.2 the FO processor is usually used after an XSL transformation to produce a printable document.

Figure 2.2: XSL Processing Chain

2.3. The Unrivaled Flexibility of XML Technology

2.3.3

31

XML Schema

XML Schema is the successor of Document Type Definitions (DTDs) and allows to define structure and data types of an XML document in a formal manner. It fullfills the following duties for XML [13]:

Validation:

XML Schema can validate structure and data of XML docu-

ments. Validation works as firewall against invalid documents and allows to skip document validation in data processing applications because the parser has already validated the document.

Data Description Language (DDL):

In the domain of relational databases

the structure of a database is determined by SQL CREATE,MODIFY or DROP statements. XML documents with a schema definition have already a defined structure and data. XML Schema works therefore as a data description language in an implicit manner. It can also be used to define equivalents of the RDBMS normal forms for XML databases [14, 15]. An XML Schema definition may aid XML databases to optimize their on disk storage structures, because the layout is known prior to data import.

Data Binding:

The idea of data binding is to import the data of XML

documents directly into data structures of applications without requiring to work with the DOM or SAX application programming interfaces (APIs). This has the advantage of a reduced error probability. Two kinds of data binding exist, runtime data binding and design phase data binding. Runtime data binding tools analyze the structure of documents and applications and allow to define mapping between them. Design phase data binding tools work with a data model that is formally defined in an XML Schema.

32

Chapter 2. Methodology

Guided Editing Applications: As XML Schema allows to define structure and data types of an XML document it is possible to write a general XML editor that aids in generating a valid documents. Another class of guided editing are applications that provide specialized editing functionality for a specific schema.

2.3.4

XUpdate

XUpdate [16] is a language for updating data represented in XML. It is an XML markup language that addresses XML data by means of XPath. The vocabulary is defined in the namespace http://www.xmldb.org/xupdate. It supports • inserts of – elements – attributes – text – processing instructions – comments • appending of a node as a child • updating of content of an existing node • removing of a node • renaming of existing elements and attributes XUpdate does not support transactions, locking and conditional statements vastly limiting its usage. Currently there exist no other specified language that provides XML Updates. XML Query Update Extensions may address that in the future.

2.3. The Unrivaled Flexibility of XML Technology

2.3.5

33

XQuery

The mission of the XML Query project is to provide flexible query facilities to extract data from real and virtual documents on the World Wide Web, therefore finally providing the needed interaction between the Web world and the database world [17]. At the time of writing, the XQuery 1.0 specification is still not final but a usable working draft. XQuery is an extension of XPath version 2.0 and does not operate on the syntax of an XML document but on its abstract, logical structure known as the XQuery 1.0 and XPath 2.0 data model. The XQuery language doesn’t utilize XML markup but has a syntactic grammar on its own. The special feature that XQuery has are FLWOR expressions (speak "flower expressions"). FLWOR is a shortcut for FOR-LET-WHERE-ORDER BY-RETURN and works similar to SELECT-FROM-WHERE-ORDER BY statements in SQL. FLWOR expressions are used to combine and restructure XML data. Because of that XQuery is dubbed by some people another version of XSLT, which falls short, because it operates on a data model and not on XML input thus enabling its usage as a database query language. A FLWOR expression binds variables to values in for and let clauses. Such a binding of a variable to some value is called a tuple. The for clauses are producing a stream of tuples. This tuple stream can be stored by a let clause into a variable. This variable can be used in later where, order by and return statements. Consider the following rather complicated example from XQuery Use Cases Q12 [18]. This query tries to find pairs of books that have different titles but the same set of authors, possibly in a different order. Listing 2.9: bstore1.example.com/bib.xml example data 1

2

3

34

Chapter 2. Methodology

4

TCP/IP Illustrated

5

StevensW.

6

Addison-Wesley

7

65.95

8

9 10

11

Advanced Programming in the Unix environment

12

StevensW.

13

Addison-Wesley

14

65.95

15

16 17

18

Data on the Web

19

AbiteboulSerge

20

BunemanPeter

21

SuciuDan

22

Morgan Kaufmann Publishers

23

39.95

24

25 26

27

The Economics of Technology and Content for Digital TV

28

29

GerbargDarcy

30

CITI

31

32

Kluwer Academic Publishers

33

129.95

34 35

In listing 2.10, lines 3 and 4 produce streams of tuples containing references to book entries stored into the book1 and book2 variables. In the let expressions at line 5 and 8 the variables auth1 and auth2 store a set of authors of their respective book entries, sorted by their last and than by their first name in document order2 . The where clause at line 11 looks at all books in book1 preceding book2 in document order that don’t have 2

The document order is the order in which nodes appear in the XML serialization of a

document. The document order does not change during the processing of a given query and is thus called stable.

2.3. The Unrivaled Flexibility of XML Technology

35

the same title but deeply equal their list of authors. To get a "true" return from the deep-equal(A, B) function both A and B sub-trees must consist of equal values. The return statement on line 14 produces book-pair entries containing the titles of the books with the same list of authors, see listing 2.11. Listing 2.10: An XQuery FLWOR example 1

2

{

3

for $book1 in doc("http://bstore1.example.com/bib.xml")//book,

4

$book2 in doc("http://bstore1.example.com/bib.xml")//book

5

let $aut1 := for $a in $book1/author

6

order by $a/last, $a/first

7

return $a

8

let $aut2 := for $a in $book2/author

9

order by $a/last, $a/first

10

return $a

11

where $book1 \

6

listing.fo &

7 8 9

cat document.xml | \

# Step 1

xsltproc --xinclude document2intermediate.xsl - | \ # Steps 2 and 3

44

Chapter 2. Methodology

10

tee -a intermediate.xml | \ # Step 4

11

xsltproc intermediate2html.xsl - > \ # Steps 5’ and 6’

12

listing.html

13 14

# Step 6’’

15

xmlto pdf listing.fo

16 17

# cleanup: remove the FIFO and temp file

18

rm -f listing.fo intermediate.xml

2.4.3

Steps 1 and 2: Source Documents

Listing 2.13 works as the origin of the publishing channels. It contains a title, a section and two paragraphs. Important is the include element in line 11. It tells an XInclude transformer to include all PP elements and their children from the file people.xml. The PP elements are identified through

the

spp

prefix

specifying

the

XML

namespace

urn:liberty:id-sis-pp:2003-08 used by personal profiles of the Liberty Alliance Project8 . Any XInclude transformer can identify XInclude elements by the namespace http://www.w3.org/2003/XInclude and the element prefix xi. Listing 2.13: document.xml 1

2

5 6

7

8

This document contains a sample list of J.UCS paper authors.

9

10

11

12 13 14

8

http://projectliberty.org

2.4. Multi-Channel Publishing

45

The file people.xml, listing 2.14, contains an example list of Journal of Universal Computer Science (J.UCS) paper authors. A listing of them should be provided in both HTML and PDF. All authors are in the XML namespace of the Liberty Alliance Project’s personal profile definition. This namespace must match with the namespace used in the xpointer attribute of the include element, see listing 2.13. Listing 2.14: people.xml 1

2

5 6

7 8 9

Álvaro Reis Figueira

10

11

Álvaro

12

Figueira

13 14

Reis

15

16

17

urn:liberty:id-sis-pp:addrType:work

18

19

DCC-FC & LIACC. Universidade do Porto
21

pt

22

23

24

25

urn:liberty:id-sis-pp:msgType:work

26

urn:liberty:id-sis-pp:msgMethod:email

27

urn:liberty:id-sis-pp:msgTechnology:email
29 30

31 32 33 34

Yi Fang

46

35 36 37 38

Chapter 2. Methodology

Yi Fang

39

40

41

urn:liberty:id-sis-pp:addrType:work

42

43

New York University

44

ny

45 46 47 48

us

49 50 51

52

Anthony MacDonald

53

54 55 56

Anthony MacDonald

57

58

59

urn:liberty:id-sis-pp:addrType:work

60

61

Department of Computer Science and Electrical Engineering and Software Verification Research Centre, The University of Queensland

62

4072

63

Brisbane

64

Queensland

65

au

66

67

68

69

urn:liberty:id-sis-pp:msgType:work

70

urn:liberty:id-sis-pp:msgMethod:email

71

urn:liberty:id-sis-pp:msgTechnology:email

72 73

[email protected]

74

75

2.4. Multi-Channel Publishing

2.4.4

47

Steps 3 and 4: Intermediate Data Generation and Distribution

The code in Listing 2.15 converts the elements from the Liberty Alliance Project’s personal profile namespace to a simpler format used for further processing. Elements identified by the XSL namespace are processed by any XSLT transformer. There are four templates defined to match different elements. /: This is the starting point. It matches the root element of document.xml and calls all other templates. spp:PP: Matches all PP elements and outputs person, name, address and contact elements. Calls the template matching spp:Address elements. spp:Address: Builds the address string separated by commas. *: Matches all other elements not matched by any other template and copies them to the output. The output of the XSLT transformer can be seen in listing 2.16 Listing 2.15: document2intermediate.xsl 1

2 3

6 7 8 9

10 11 12

13

14

48

15

Chapter 2. Methodology

16

17

18 19 20 21 22 23

24 25 26 27

,

28

29

30 31

32

33

34

,

35

36

37

,

38

39

40

41

42

43 44 45 46 47

48

49

50 51

The intermediate format (listing 2.16) is much to interpret by humans than the raw data of the Liberty Alliance Project. It makes the writing of further stylesheets much easier. Notice that all elements are in the same namespace.

2.4. Multi-Channel Publishing

49

Listing 2.16: Contents of Processing Step intermediate.xml 1

2

3 4 5

This document contains a sample list of J.UCS paper authors.

6

7

8

9

Álvaro Reis Figueira

10

DCC-FC & LIACC. Universidade do Porto, Porto, pt

11

[email protected]

12

13

14

Yi Fang

15

New York University, ny, us

16

17

18

Anthony MacDonald

19

Department of Computer Science and Electrical Engineering and Software Verification Research Centre, The University of Queensland, 4072 Brisbane, Queensland, au

20 21 22 23 24

[email protected]

As illustrated in Figure 2.4 the output of the XSLT transformer is passed to the program tee (Step 4) and distributed to the named pipe intermediate.xml and to stdout for further processing.

2.4.5

Steps 5’ and 6’: Generate HTML

To generate HTML from intermediate.xml only a XSL transformation is required. All XSL transformers can output HTML and work as XSLT transformer and HTML serializer at the same time. This behavior is achieved by setting the output method to html, in listing 2.17 line 6. Listing 2.17 contains four templates to generate the output.

50

Chapter 2. Methodology

/: This is the starting point.

It matches the root element of

intermediate.xml. It sets the title, generates the HTML skeleton and calls all other templates. section: Generates a heading and a paragraph and calls the paragraph templates. paragraph: Applies all templates and encloses their output in a HTML paragraph. person: Writes the name in bold font, generates a hyperlink for the email address and prints the address in italic. Listing 2.17: intermediate2html.xsl 1

2 3 4

5 6

7 8 9 10

11

12

13

14 15 16 17

18 19

20

22

23

24

25 26 27 28 29

2.4. Multi-Channel Publishing

30

51

31 32 33

34

35

36

37

38

39

mailto:

40

41 42 43

44

45

46 47 48

49 50

Listing 2.18 (listing.html) contains the output after the XSLT transformation and serialization. A visual representation in the form of a browser window can be seen in figure 2.5. Listing 2.18: listing.html 1 2 3 4

J.UCS Authors

5

6

7

A excerpt of J.UCS authors

9 10

This document contains a sample list of J.UCS paper authors.

11

14

Ãlvaro Reis Figueira [email protected].

15

pt 16 17

DCC-FC & LIACC. Universidade do Porto, Porto, pt

52

Chapter 2. Methodology

18

19

Yi Fang

20

21

New York University, ny, us

22

24

Anthony MacDonald anti@csee .uq.edu.au

25

26

Department of Computer Science and Electrical Engineering and Software Verification Research Centre, The University of Queensland , 4072 Brisbane, Queensland, au

27 28

30

31

Figure 2.5: Author Listing in Browser Window

2.4.6

Steps 5” and 6”: Generate PDF

In order to obtain PDF from XML the content of intermediate.xml has to be transformed to XSL-FO by using a XSLT transformer. All output el-

2.4. Multi-Channel Publishing

53

ements must be in the XSL-FO namespace otherwise the XSL-FO serializer will not find the elements designated to it. Listing 2.19 contains four templates to generate formatting objects output. The templates match the same elements as in intermediate2html.xsl, listing 2.17. /: This is the starting point.

It matches the root element of

intermediate.xml. It generates a default layout master and a page sequence. All other templates are called as subsequent steps. section: Generates a block in Arial, 20pt to print the heading and a block in Times, 12pt for the author listing. Calls the paragraph templates. paragraph: Applies all templates and encloses their output in a block. Below the block, a vertical space of 12pt is inserted. person: Generates a 1cm indented block with a padding below and above of 6pt each. Prints the name in a bold font. Prints the email address in blue underlined text. Prints the postal address in italic. The output of the XSLT transformer can be seen in listing 2.20. Listing 2.19: intermediate2fo.xsl 1

4 5 6 7 8 9 10

11

12

13 14

54

15 16 17 18

Chapter 2. Methodology

19 20 21 22

23

24

25 26 27

28 29 30 31

32

33

34 35 36 37 38

39

40

41 42 43 44

email:

45

46

47 48

49

50

51 52

To obtain a PDF from the code in listing 2.20 (listing.fo) a PDF serializer needs to be used. This example calls xmlto on listing.fo to generate a PDF file. The output of this serialization process can be seen in figure 2.6.

2.4. Multi-Channel Publishing

55

Listing 2.20: listing.fo 1

2

3 4 5 6 7

8 9 10 11 12

A excerpt of J.UCS authors

13

14

15 16

This document contains a sample list of J.UCS paper authors.

17

18

19

20

Álvaro Reis Figueira

21

22

email: [email protected]

23

24

25 26

DCC-FC & LIACC. Universidade do Porto, Porto, pt

27

28

29

Yi Fang

30

31 32

New York University, ny, us

33

34

35 36

Anthony MacDonald

37

38

39

email: [email protected]

40

41

42

Department of Computer Science and Electrical Engineering and Software Verification Research Centre, The University of Queensland, 4072 Brisbane, Queensland, au

56

Chapter 2. Methodology

43

44 45 46 47 48 49

A excerpt of J.UCS authors This document contains a sample list of J.UCS paper authors. Álvaro Reis Figueira email: [email protected] DCC-FC & LIACC. Universidade do Porto, Porto, pt Yi Fang New York University, ny, us Anthony MacDonald email: [email protected] Department of Computer Science and Electrical Engineering and Software Verification Research Centre, The University of Queensland, 4072 Brisbane, Queensland, au

Figure 2.6: Author Listing as PDF

As soon as there exists a generator which is a common starting point, any desired output format for which a serializer exists can be generated by the use of an intermediate XML markup language and XML transformers. The XInclude extension allows to take only parts of an XML page. An ability of concern for portals. Picking only parts of a page may allow to bypass the hardest problems that occur by trying to include XHTML from a remote site. If XML pipeline processing is combined with portal technology the streams including content become visible and can easily be restricted by access control lists.

2.4. Multi-Channel Publishing

57

The above example showed how to design the publishing channels avoiding dual processing steps. Using UNIX pipes for multi-channel publishing is nevertheless suboptimal because every XML processing step requires parsing of the XML input. It would be better if parsing the stdin XML stream and writing XML text to stdout could be avoided by using some internal XML representation for the stream, instead of bytes. This is something that some XML frameworks are able to do. One of these frameworks is Apache Cocoon which is discussed in the next chapter. If XML databases are used for XML document storage then XQuery can function as a language to drive a generator program. This could save step 2 and step 3 of figure 2.4. XML databases usually provide a caching mechanism to speed up queries, thus increasing the throughput of the publishing channels. Web sites have a much higher number of accesses than of updates.

58

59

C HAPTER 3

Overview of XML Software

XML technology is a relatively new technology. Software based on XML is not yet as sophisticated as software originating from older computer fields like the relational database world. There is still a lot of research and experimenting going on in how to get: • simplicity of use • software design patterns • division of labor • scalability • good performance • transaction security in the XML software domain by systaining the flexibility of XML itself. This chapter provides an overview over the XML web development framework Apache Cocoon and its access to the opensource XML database eXist. Both projects are under heavy development even at their software cores.

60

Chapter 3. Overview of XML Software

Cocoon 1.0 started as a Java™Servlet that was able to transform an XML file with a XSLT stylesheet. The Cocoon 1.x series turned into an XML-based publishing system due to active development from the Cocoon opensource community. During the Cocoon 1.x era performance problems became evident and the framework was redesigned around pipelined Simple API for XML (SAX) processing and caching. The 2.x series of Cocoon is a sophisticated Model-View-Controller (MVC) based component framework. Lots of components have been developed easing the following tasks: • Multi-channel publishing • Content aggregation • Centralized application flow control • Handling and validation of forms • Internationalization • Building portals Before having a look at the Cocoon concepts and pipelines a short introduction to the Cocoon core is given.

3.1

Cocoon Core

Cocoon 2.1.x uses the Apache Avalon Excalibur Component Manager (ECM) as core component architecture. Avalon components are configured in a file called cocoon.xconf. Within the limitation of this master thesis the description of this file was beyond the scope.

3.1. Cocoon Core

61

Basically the Cocoon core provides four important features interesting for a web application developer: 1. A filesystem abstraction called source. The capabilities of a source are determined through the implementation of one of the following Java™interfaces: Source: A read-only source that doesn’t have collections. A collection is a directory abstraction. TraversableSource: A read-only source that can be traversed like a directory. ModifiableSource: A writeable source that doesn’t have collections. ModifiableTraversableSource: A writeable source that can be traversed like a directory. It allows to create collections. MoveableSource: A source that can move collection entries. InspectableSource: A source that has annotated metadata (properties). LockableSource: A source supporting a locking mechanism. VersionableSource: A source supporting version control. Sources can be accessed in Cocoon through a source resolving mechanism working on a URI basis. For instance there is a source for handling filesystem access and a source that handles Web-URLs through the java.net.URLConnection class. 2. Flexible logging functionality for all Cocoon components through the Apache Logkit.

Logging can be configured in a file called

logkit.xconf. 3. Transient and persistent Java™object cache store. Used to cache results already processed.

62

Chapter 3. Overview of XML Software

4. Input/output modules to read/write internal data structures like session or request objects without the need to write Java™code. In Web development, programmers often need to interfere with graphic designers and vice versa. This can reduce the productivity of both designers and programmers. Every team should work in it’s domain to keep productivity at a high level.

3.2

Concepts and Pipelines

Cocoon encourages Separation of Concerns (SoC). The basic idea of SoC is to have different teams, working in their own area without having to deal with other teams and their work. SoC forms a pyramid of contracts, figure 3.1, in Cocoon. The existing contracts are denoted by interconnection lines at figure 3.1.

Figure

3.1:

The

Cocoon

Pyramid

of

Contracts,

source

http://cocoon.apache.org/2.1/introduction.html

The innovation of Cocoon is the missing contract between Logic and Style. This makes it easier for the Management to control the web application and for the different groups Logic, Content and Style to accomplish their task.

3.2. Concepts and Pipelines

63

In Cocoon is the Pipes & Filters design pattern is excessively used. The page content flows through pipelines comprised of cocoon components. This page flow is controlled by the Flow Controller, illustrated in figure 3.2.

Figure 3.2: A Typical Cocoon Application

As the section Multi-Channel Publishing in chapter 2 has demonstrated it is easy to do multi-channel publishing with a Pipes&Filters like design. Cocoon components are, unlike UNIX filter programs, not communicating through a stream of bytes but through a stream of SAX events. This spares the XML parse and serialization step for every component thus speeding up the XML processing. Through the development of Cocoon 1.x it was learned that SAX events bear less overhead than a component architecture operation on DOM trees.

3.2.1 Pipelines and Views Cocoon pipelines are configured in so called sitemaps in a file called sitemap.xmap. A sitemap contains the registration of the available com-

64

Chapter 3. Overview of XML Software

ponents and pipelines that map URIs to component arrangements. Sitemaps can be cascaded through sub-sitemap mounting. A Cocoon pipeline itself is just a Cocoon component. There are different pipeline implementations available: non-caching: A pipeline implementation that does no caching at all. caching: Caches the entire pipeline according to a longest cacheable path (usually the pipeline endpoint). caching-point: Caches the entire pipeline according to a longest cacheable path and the parts of a pipeline that gets used by multiple cocoonview’s. Step 4 in figure 2.4 would make up such a caching point for a cocoon-view for HTML and a cocoon-view for PDF. Pipelines can be accessed via the cocoon source protocol from within a sitemap. There exist two addressing schemes: cocoon:/some-pipe queries the pipeline some-pipe in the current sitemap. cocoon://path/some-pipe queries the pipeline some-pipe resolving the sitemap through the path of mounted sub-sitemaps. The resolving process starts in the root sitemap.

Internal Pipelines

are pipelines that are only accessible from within other

Cocoon pipelines, but not from external. An internal pipeline can be selected by setting the internal-only attribute of a pipeline to true.

Resources are a special kind of pipeline that are invoked by a map:call element from within the sitemap. A resource can contain any part of a

3.2. Concepts and Pipelines

65

pipeline without any restriction and exists to ease the maintenance of complex sitemaps. Its function for a sitemap is similar to a subroutine in a programming language.

Views

are an orthogonal technology to pipelines. They allow to define

an additional exit point in pipeline processing. Such an exit point is specified by defining a view and labelling a pipeline component with the label in the views definition. A view of a pipeline component output can be requested by specifying the cocoon-view=LABEL request parameter in the HTTP request URL. Views have been originally added to offer a semantic search feature in Apache Cocoon. A view like "content" could be used to get the XML representation of a document without the additional add ons like navigation buttons. Views are also useful for debugging purposes of pipelines.

3.2.2

Pipeline Components

There are only a few components that are mandatory for a pipeline to function. Basically a pipeline could consist only of a Generator component and a Serializer or a Reader component. But most of the time a Transformer and a Matcher or a Selector component also are necessary. Listing 3.1 contains a basic pipeline. It generates HTML from XML files. The example uses a caching pipeline matching all *.html files. Their XML source files are input into a generator component, transformed by a XSLT transformer component and serialized as HTML by a HTML serializer.

66

Chapter 3. Overview of XML Software Listing 3.1: A Basic Pipeline

1

2

3

...

4 5

...

6

7

8

9

10

11

12 13 14

...

15

...

16

Generator:

A pipeline fragment always starts with a generator. Genera-

tors are components outputting a stream of SAX events of various sources. To name a few: FileGenerator reads XML files from a source and generates a SAX stream of it. HTMLGenerator reads HTML resources, converts them to XHTML and generates a XHTML SAX stream. MP3DirectoryGenerator provides a SAX stream XML representation of a directory listing of MP3 files with associated tag information. JXTemplateGenerator generates a SAX stream from an XML file containing JXTemplate language statements. XQueryGenerator generates a SAX stream from a XQuery on an XML database. PortalGenerator generates the portal for the current user.

3.2. Concepts and Pipelines

67

AsciiArtSVGGenerator generates a SVG SAX stream from simple plain text ASCII-art files.

Serializer:

Serializers are the end of a pipeline fragment and output var-

ious formats converted from the input SAX stream. If a pipeline is used as internal pipeline this step is always omitted. XMLSerializer serializes any XML SAX stream to an XML text stream. HTMLSerializer outputs HTML from an XHTML SAX stream. SVGSerializer generates JPEG or PNG images from an SVG SAX stream. FOPSerializer generates Printer Control Language (PCL), Postscript (PS) or PDF from a XSL-FO SAX stream. RTFSerializer generates Rich Text Format (RTF) from an XSL-FO SAX stream. RTF can be read by Microsoft Word. HSSFSerializer generates a Microsoft Excel sheet from a Gnumeric XML formatted SAX stream. ZIPSerializer generates a ZIP archive from an SAX stream containing the archive contents specification.

Reader:

A reader is a component that reads non-XML content and out-

puts non-XML content. A reader is always the start- and endpoint of a pipeline fragment. Readers never output an SAX stream but a byte stream and can thus never be plugged into a SAX pipeline. Readers are is especially useful for images and other downloadable non XML content. Some implementations of a reader are: ResourceReader serves binary data from a Source. It uses HTTP headers to support HTTP caching.

68

Chapter 3. Overview of XML Software

ImageReader serves image content from a Source. Allows to generate thumbnails or greyscale images. DatabaseReader reads a binary resource from a relational database. AxisRPCReader accepts SOAP requests and generates SOAP responses from a Cocoon application.

Transformer:

A transformer is a component that receives a SAX stream

and outputs a SAX stream allowing it to inspect and modify the stream to its needs. A few prominent implementations are: TraxTransformer performs XSLT transformations with XSL stylesheets. XIncludeTransformer includes content specified by XInclude statements. SQLTransformer performs SQL queries and fills in their result into the output SAX stream. JXTemplateTransformer executes JXTemplate language markup. I18nTransformer internationalizes markup text through message lookup with keys. It supports the internationalization of date, time, number, percent and currency formatting. Note: A pipeline can function without a transformer but nearly all pipelines will require a transformation step.

Matcher:

Matchers decide if a pipeline fragment is executed during a

request. Various implementations of matchers exist. A few examples are: WildcardURIMatcher matches URIs based upon "*" and "**" wildcards. RegexpURIMatcher matches URIs with Perl regular expressions.

3.2. Concepts and Pipelines

69

CookieMatcher matches cookies against a given name. HeaderMatcher matches a request header against a given name. LocaleMatcher matches locales in a number of ways. The locale to use can be provided the following ways: • request parameter • session attribute • cookie • sitemap parameter • user agent (the default language of the browser) • default locale set in matcher configuration The above list is by no means complete. Matchers can be cascaded in a sitemap. For instance it is possible to use a WildcardURIMatcher and a LocaleMatcher in cascade. In a pipeline multiple matchers can be ordered one after another. The pipeline will only process the pipeline fragment of the first matching matcher.

Selector: Selectors are components similar to matchers. Matchers either match or they don’t. Selectors evaluate simple boolean expressions and execute a pipeline fragment if the expression evaluates to true. Selectors are similar to the XSL choose statement or to if/else or switch statements of programming languages. Some selectors are: BrowserSelector tests a browser pattern against the HTTP user-agent. DateSelector can select if a configured time or date is before or after the current time or date. HostSelector matches a string of the host parameter of the HTTP request. Useful for virtual hosting.

70

Chapter 3. Overview of XML Software

ResourceExistsSelector selects the first set of resources that exist on a source. Can be used to select resources (usually files) appearing under multiple names.

Action:

An action is a component that makes some data available for fur-

ther matching or selecting in the sitemap. An action may skip the treatment of the nested elements. A sitemap is meant to contain definitions, no logic of whatsoever kind. Actions break this rule and are able clutter the pipeline definitions in the sitemap to an almost unreadable extent. No examples of actions are given because the author objects to their usage. A better way to control a Cocoon application, would be Control Flow.

3.3

Control Flow

Finite State Machines (FSM) are the traditional way of describing page transitions in web applications. Most of the time they are used in a single point of access through a Java™Servlet that dispatches requests. FSMs are visualized best by graphs where an edge represents a request and a node the state after the request. These FSM graphs tend to become unreadable for complex applications. Cocoon has advanced control flow, the ability to describe the order of Web pages that have to be sent to the client, at any given point in time in an application1 . – Ovidiu Predescu Most Programming languages are rather good at modelling the inherent complexity of real life but lack support to describe parallel processing. An innovation of computer theory are continuations [21]. An implementation can be found in the Scheme programming language [22]. 1

http://cocoon.apache.org/2.1/userdocs/flow/index.html

3.3. Control Flow

71

Continuations represent the future of a computation at a particular point in program execution2 .

In Cocoon a continuation is an object that contains a snapshot of a flow program. It is comprised of a snapshot of the stack trace plus the local variables and the program counter. A continuation enables to continue the execution of the flow program at the point where the snapshot has been taken. If the program state would not be stored in a continuation object the whole thread would have to be stored to be able to continue processing at the next request sent by the browser. Cocoon uses JavaScript as their primary page-flow language because it allows for easier adoption of a page flow through the interpretive character of JavaScript. When JavaScript is used as a flow language it is called FlowScript. The concept of continuations has been included into Rhino3 . Since that time it can be used without modification in Cocoon. FlowScript can create Java™objects and call methods upon them. Additionally JavaFlow is available. JavaFlow uses a Java™bytecode interpreter to run Java™class files. It is currently under development and not as stable as FlowScript. One of the objects available in FlowScript is the cocoon object. This object contains among others two methods sendPageAndWait(String page_name), Object[] to_pass

and

sendPage(String

page_name, Object[] to_pass). The method sendPageAndWait sends a page to the client and stops processing till the next request. The method sendPage sends a page back to the client but the script is executed further. Listing 3.2 shows a simple FlowScript example found in the Cocoon user documentation4 . 2

From

the

Scheme

FAQ.

What

are

http://www.schemers.org/Documents/FAQ/#id2515418 3 Rhino JavaScript interpreter http://www.mozilla.org/rhino/ 4 http://cocoon.apache.org/2.1/userdocs/flow/continuations.html

continuations?

72

Chapter 3. Overview of XML Software Listing 3.2: calculator.js, a FlowScript Calculator

1

function calculator()

2

{

3

var a, b, operator;

4 5

cocoon.sendPageAndWait("getA.html");

6

a = cocoon.request.get("a");

7 8

cocoon.sendPageAndWait("getB.html");

9

b = cocoon.request.get("b");

10 11

cocoon.sendPageAndWait("getOperator.html");

12

operator = cocoon.request.get("op");

13 14

try {

15

if (operator == "plus")

16

cocoon.sendPage("result.html", {result: a + b});

17

else if (operator == "minus")

18

cocoon.sendPage("result.html", {result: a - b});

19

else if (operator == "multiply")

20

cocoon.sendPage("result.html", {result: a * b});

21

else if (operator == "divide")

22

cocoon.sendPage("result.html", {result: a / b});

23

else

24

cocoon.sendPage("invalidOperator.html", {operator: operator});

25

}

26

catch (exception) {

27

cocoon.sendPage("error.html", {message: "Operation failed: " + exception. toString()});

28 29

} }

Page-flow of listing 3.2 is as follows: 1. Send the HTML form getA.html to the client and stop processing of the script (line 5). 2. The browser submits the form getA.html. 3. Store the value of the request parameter a into variable a (line 6). Send form getB.html to the browser and stop processing (line 8). 4. The browser submits the form getB.html.

3.3. Control Flow

73

5. Store the value of the request parameter b into variable b (line 9). Send form getOperator.html to the browser and stop processing (line 11). 6. The browser submits the form getOperator.html . 7. Store the value of the request parameter operator into variable operator (line 12). Do the calculations according to the operator and pass the calculated result to the Cocoon pipeline producing result.html (lines 16,18,19,20). If the operator is invalid, pass the invalid

operator

to

the

invalidOperator.html pipeline (line 24). If an exception occurs catch it and display the pipeline error.html with the error message (line 27). The invocation of sendPage does not halt the script. The script finishes after sending one of the above pages. The FlowScript objects passed to the pipelines in step 7 can be accessed by the JXTemplate generator to fill their values into some XML markup. Screen shots of the page flow can be seen in figure 3.3.

Figure 3.3: Screen Shots of the Calculator Example

The sitemap for calculator.js would look like in listing 3.3. The sitemap needs to be told which script and language to use and it needs two extra pipelines not necessary without using FlowScript. A pipeline to invoke the FlowScript function and a pipeline to match continuations. A continuation

74

Chapter 3. Overview of XML Software

ID is available through the cocoon object in the JXTemplate generator. The value of cocoon.continuation.id needs to be added to the action parameter of the form submission endpoint. Otherwise the request would not be matched and processed by the continuation invocation pipeline. Listing 3.3: sitemap.xmap, the Sitemap for calculator.js 1

...

2

3

4 5 6 7

...

8

9

10

11

12

13 14

15

16

17

18 19

20

21

22

23

24 25 26

...

In the following section the model, the view and the controller in a Cocoon application are explained.

3.3.1 MVC revisited Originally Cocoon possessed no flow controller, just pipelines. In the light of pipelines the definition of MVC is easy.

3.3. Control Flow

75

Model source content, probably a file, or the output of a generator. View is the serializer. It could output HTML or PDF for instance. Controller is the pipeline definition itself. It controls the result of the pipeline.

With the innovation of FlowScript an additional controller appears on top of the simple MVC pattern. Some Cocooners call this MVC+ meaning multiple MVC patterns. Many micro-MVC in the shape of pipeline definitions and a global MVC for the application flow. The toplevel MVC pattern consits of:

Model is the business logic comprised of Java™objects like Java™Beans or EJBs. View is the pipeline delivering content to the client. Controller is the Cocoon flow controller steering the page flow to the client.

A developer should always be aware of these MVC patterns. The example of a flowscript calculator, listing 3.2, violates toplevel MVC by mixing model and controller. A FlowScript script must not do business logic calculations but it uses business logic to direct the page flow. The business logic forming the model should persist in external components like Java™Beans, EJBs or Spring5 components. Cocoon possesses an XML form framework that heavily relies upon flow controller, templates and XSLT. 5

Spring

is

a

light

http://www.springframework.org/

weight

enterprise

component

framework

76

Chapter 3. Overview of XML Software

3.4

Cocoon Forms

The Cocoon form framework provides an advanced multi-channel form processing environment which supports data types and server side form validation. To create a Cocoon form two XML files are required: Form Definition: The form definition defines the widgets of a form. It contains all information that is required to describe widgets, except their presentation. For example fields for each widget are data type, label, hint, help and validation information. The form definition is used to generate a form instance containing the widget objects used in subsequent requests to automatically validate the submitted form. Form Template: The form template is an XML page that describes the layout of the form and the styling of the form widgets. The widgets are referenced by their id of the form definition file. An example of widget styling is to force a text input widget to display a password input. To use the form framework a form publishing pipeline needs to be defined in the sitemap. A form publishing pipeline is at least comprised of: 1. A generator that provides the form template. 2. The FormsTemplateTransformer which transforms the widget templates into an XML GUI representation of the widgets. 3. An XSL transformer which renders the XML GUI widgets into something displayable on the client, usually XHTML. Cocoon 2.1.6 comes with two XSL stylesheets one for simple HTML output and another one for DHTML output. 4. A serializer which streams the rendered output to the client.

3.4. Cocoon Forms

77

The form publishing pipeline needs a FlowScript environment similar to the one in the calculator example, listing 3.3.

Figure 3.4: Sequence of First Form Request

Figure 3.4 shows the sequence of actions taken during the first request from the client. To control the processing of the form FlowScript needs to be written. First it is necessary to create a form instance from within FlowScript by: var myform = new Form("PATH/TO/FORM/DEFINITION/FILE"); Next the form needs to be presented to the client. The form instance must be told to display itself using a form publishing pipeline by: myform.showForm("NAME OF THE FORM PUBLISHING PIPELINE"); This creates a continuation object able to continue the flow control script later at its current position. After that the form is rendered according to the form template and the XSLT stylesheet.

78

Chapter 3. Overview of XML Software

Figure 3.5: Processing of Subsequent Form Submissions

After the form submission the continuation is used to proceed with flow where it stopped, see figure 3.5. The widget instances automatically validate the submitted form. Next FlowScript can be used to perform custom validations not covered by the form validation. If the submitted form is not complete, the form will be redisplayed by the form publishing pipeline. If the form is valid, business logic can be invoked from FlowScript. Finally the continuation is invalidated and a page will notify the client that the form was processed successfully.

3.5

Portal Engine

Cocoon provides a very versatile portal engine which roots in the concepts of Cocoon, see section 3.2. Cocoon’s portal engine is a framework which

3.5. Portal Engine

79

allows to replace any component by an own implementation if necessary. A developer can freely define the look and type of content of the portal page, how links are handled, also. The look of the URLs issued by the portal engines event manager can also be changed. However, the portal engine is currently a vastly undocumented terrain which makes it difficult to develop a portal application. As mentioned in secion 2.1 a portal is comprised of portlets, called coplets in Cocoon.

3.5.1

Coplets

It wouldn’t be Cocoon if there was only a single implementation of a coplet. Every type of coplet supports the following attributes in its configuration: buffer: The output of a coplet can be buffered before it is streamed to the client. If an exception occurs during processing and the output is not buffered the whole portal view may be rendered invalid. Can be either true or false (default false). timeout: The portal will wait the specified number of seconds for the coplet to answer. If not specified the portal will wait infinitely (no timeout is set). sizeable: Defines if a user is allowed to minimize or maximize the coplet view. May be either true or false (default true). size: Defines the display size of the initial display provided to the client. A value of 0 means minimized, a value of 1 means maximized (default 1). mandatory: Determines if a user may remove the coplet from his view. It may be desireable to prevent the coplet removal for things like navi-

80

Chapter 3. Overview of XML Software

gation coplets. In this case this attribute must be set to true (default false).

It is possible to define custom attributes in several places which can be used for inter-coplet communication. These attributes can be retrieved from within the sitemap through XPath expressions. The following coplets exist in the Cocoon 2.1.6 release:

URICoplet: An URICoplet is the simplest coplet type. It allows to access ordinary pipelines by specifying a Cocoon pipeline URI, hence its name. It has some additional attributes in its configuration:

uri: Specifies where to read a stream from a Cocoon URI. This entry is mandatory. error-uri: Additionally to the URI-attribute another URI may be specified which will be shown if an error occurs during the retrieval of the URI specified by the URI-attribute. handleParameters: If this attribute is set to true and the URI-attribute addresses a internal cocoon pipeline then the HTTP request parameters will be forwarded to the coplet. This may be necessary for HTML form handling (default false).

CachingURICoplet:

It extends the URICoplet with the ability to cache

the output of a pipeline till the coplet receives an event. For instance, an event will be issued if a form is submitted by the client. This coplet is able to retain the pipeline content of dynamic applications during changes of the portal view. This functionality is required to show a Cocoon application in a coplet. The meaning of its special attributes is:

3.5. Portal Engine

81

uri: The meaning of the uri-attribute changes with this coplet. It does not point to a cocoon pipline which should be included, but to the cocoon pipeline where the coplet instance resides. temporary:application-uri: This URI points to the pipeline of a cocoon application and is mandatory.

ApplicationCoplet:

The ApplicationCoplet is an advanced form of the

CachingURICoplet meant to aid in the integration of external applications into the coplet. Its configuration attributes are: uri: This URI points to a cocoon proxy pipeline for the coplet and not to the URL of the external HTML application. The proxy pipeline works as data pump for the ApplicationCoplet. The proxy pipeline is required to fetching remote content and rewrite the hyperlinks of the HTML pages. link: This attribute is used to store the current link to the remote site. Its usually empty. To define the start URL of the remote site see below. This coplet requires an additional mapping between external applications and

coplet

instances.

This

is

accomplished

by

a

application-coplet-binding.xml file which contains the following attributes: start-uri: Contains the starting URL of the remote site to include. encoding: Defines the content encoding for the W3C-tidy step required in the proxy pipeline. user-agent: Contains a HTTP User-Agent string which the proxy HTTP client uses to connect to the remote site. For example: Mozilla/5.0 (Linux; U; en_US, de_AT@euro, de, en_GB, en; rv:1.2.1) Gecko/20021130.

82

Chapter 3. Overview of XML Software

PortletCoplet:

This coplet is for inclusion of Java™Specification Request

168 (JSR-168) portlets [23]. Remark: The name portlet originates from the JSR-168 specification, thereof the name of this coplet. The configuration of the JSR-168 portlets takes place in the JSR-168 portlet container. It requires the following attribute: portlet: Refers to the portlets entity-id e.g. the window-id of the portlet. The naming may change with the JSR-168 portlet container. In cocoon-2.1.6 it is webapp.PortletTitle.

3.5.2

Portal Communication

The communication in the portal takes place by sending events. Events can be emitted by the client through accessing a event URL issued by the link service of the portal. No coplet is allowed to use the link service to retrieve an event URL. The link service may generate an invalid URL because the event manager does not know about other events occuring in the portal page. This knowledge is only available as one of the last processing steps in the rendering pipeline of the portal page.

Events:

The cocoon portal event system works on a subscription basis.

Java™objects can subscribe to a specific type of event and will get a notification for them. For instance coplets receive notifications of the type CopletInstanceEvent. For inter-coplet communication CopletJXPathEvents are used. These events allow to set the attributes of coplets. Each CopletJXPathEvent can target any number of coplets and attributes. Such events can be included into the SAX stream of a coplet pipeline by creating link elements in

the

XML

namespace

of

the

CopletTransformer.

The

CopletTransformer is the last transformer in the portal page pipeline to

3.6. NXD Access

convert

these

83

link

elements

into

portal

URLs

bound

to

a

CopletJXPathEvent. The next time the client accesses such a event URL the DefaultJXPathSubscriber gets notified by the event manager of the portal and sets the value of the coplet attributes accordingly.

Bookmarks:

Bookmarks are a special way in the portal to issue events.

The URLs generated by the link service of the portal are generally nonmemorable and non-persistent. A bookmark maps an URL to an event. The mapping between URLs and events is defined in a file called bookmarks.xml. The look of an URL is configured in the sitemap of the portal. A BookmarkAction issues events to the portal dependent on an bookmark-id and an event value. The display of the portal page changes due to the fired events.

3.5.3

Authentication and Profiles

Authentication is separated from the portal. One could write a totally independent authentication layer to protect the URL space of the sitemap. It is nevertheless sensible to use Cocoon’s authentication framework because the portal view can be loaded based upon the user authenticated or the role of the user. The availability of coplets and the page layout can be configured and stored into XML files. The naming of the XML files is defined in the sitemap. If XML files are non-sufficient own components for profile management must be introduced.

3.6

NXD Access

Cocoon supports the access of NXD in their XML:DB framework block. This framework was designed to access the Xindice XML database of the Apache

84

Chapter 3. Overview of XML Software

Software Foundation. It seems that this database project came to a standstill. There were no releases for almost one year. The opensource scene has developed another NXD, called eXist. The eXist project embraced Cocoons XML:DB block by supporting XQuery. This project is governed by the Lesser General Public License (LGPL) and not by the Apache Public License which prevents the eXist extensions to be merged back into the XML:DB block of Cocoon. The distribution file of eXist includes a stripped down version of Cocoon and a Java™GUI client for database maintenance. Once installed, the database can be either used in embedded mode or accessed remotely through various methods: XML:DB, Web-based Distributed Authoring and Versioning (WEBDAV), REST, SOAP, XML-RPC. For Cocoon only the XML:DB and WEBDAV methods are interesting.

3.6.1

XML:DB Access

XML:DB can be used to access the services defined in the XML:DB specification. It is possible to access multiple eXist instances in embedded mode by using XML:DB access.

XML:DB Services:

eXist fully supports XML:DB Core Levels 0 and Core

Level 1. No further Core Levels have been defined till January, 2005. The following services exist in the database: XQueryService: Supports the XQuery specification. XUpdateQueryService: Supports XUpdate specification.

3.6. NXD Access

85

Cocoon Components:

XMLDBSource

The following cocoon components are available:

This source functions as a limited filesystem driver for

cocoon. Due to restrictions in the XML:DB specification it can only support the Source and ModifiableSource interfaces, which allows reading and writing of files. Directories are non-existent.

XQueryGenerator:

This component allows to perform XQuery queries

in the database and it is the most powerful component of the eXist package for Cocoon. The XQueryGenerator supports caching of queries. The XQueryGenerator requires a file where the XQuery is defined. Where the documents can be found is specified as configuration parameter in the sitemap.

XMLDBTransformer: This transformer listens to some XML elements in the SAX stream and performs actions specified by them. It is vastly inferior to the XQueryGenerator.

3.6.2

WEBDAV

WEBDAV can be used to store plain XML documents into the database, because cocoon provides a Source implementation for it. Furthermore WEBDAV is a popular way to manage documents on web servers. Some operating systems even allow to access WEBDAV servers like any other network drive. If eXist is run in embedded mode WEBDAV access cannot be used.

WEBDAVSource

This source is particularly interesting because it is more

advanced than the XMLDBSource. Luckily eXist supports hierarchical names

86

Chapter 3. Overview of XML Software

for document storage. The source supports the following interfaces: • Source • ModifiableSource • TraversableSource • ModifiableTraversableSource • MoveableSource • InspectableSource The implementation of these interfaces provides the equivalent functionality of a filesystem with extended attributes, but without a locking mechanism.

87

C HAPTER 4

Integration of External Content Into the Cocoon Portal

This chapter deals with components for document sources and how to use them in the portal engine of Apache Cocoon. First an overview on the Hyperwave IS/6 enterprise content management platform is given. Next useful Cocoon components are developed for it. After that it is shown how coplets can be built with these components. The integration of remote websites into the portal is shown. This is accomplished by aggregating the remote content with the aid of a proxy and link rewriting. After that an agent based search system for digital libraries, Daffodil, is integrated into the portal. Due to the nature of Daffodil this integration requires a cache.

88

Chapter 4. Integration of External Content Into the Cocoon Portal

4.1

Hyperwave Information Server (HWIS)

The Hyperwave Information Server has its roots in the Hyper-G project originally developed by the Graz University of Technology. Now the development takes place at Hyperwave AG1 . The Hyper-G project at Graz University of Technology builds upon the WWW and tackles some of its main shortcomings: in particular the lack of composite and hierarchical structures, the embodiment of links within documents and the inadequate provision for cross server and focused searches [24]. Since the launch of the Hyperwave AG development and support company this product has developed into a full fledged clusterable enterprise platform. HWIS provides an identical programming interface for JavaScript, Java, C++ and COM (HWAPI) [25]. The HWAPI is used to develop components for Cocoon.

4.1.1

Components

HWIS manages its documents in collections. Each collection has at least one parent collection, except the root collection. This forms a collection hierarchy similar to a filesystem hierarchy, thus allows to create filesystem like components for Cocoon.

HyperwaveSource The HyperwaveSource is a protocol driver for Cocoon. Once it is registered in the component configuration file of Cocoon (cocoon.xconf) 1

http://www.hyperwave.com

4.1. Hyperwave Information Server (HWIS)

89

the source is accessible in the sitemap through an URI similar to the URL scheme of the FTP protocol: hyperwave://user:password@host:port/collection/path/to/object

The HyperwaveSource implements the following interfaces: • ModifiableTraversableSource • MoveableSource • InspectableSource It can create documents and collections, it can be traversed like a directory and it can move collection entries without the need to copy them first and delete them afterwards. Furthermore HyperwaveSource is able to read

and

write

attributes

of

Hyperwave

objects

through

the

InspectableSource interface. During writing of a byte stream to an object, a lock is placed on the object. This eliminates a common cause for errors during concurrent read/write accesses. HWIS supports explicit locking and version control of objects. This functionality is not implemented in HyperwaveSource.

The aim of

HyperwaveSource was to be able to replace the need for filesystem access in Cocoon applications to take advantage of the Hyperwave clustered storage system. Note: Explicit locking for HyperwaveSource can be provided by implementing the LockableSource interface. Implementing the VersionableSource interface would allow Cocoon applications to take advantage of HWIS version management.

HyperwaveIndexGenerator In Cocoon a component called SourceIndexGenerator exists. It provides an XML representation of a multi-level object hierarchy listing of a

90

Chapter 4. Integration of External Content Into the Cocoon Portal

traversable source component. It does so by traversing the hierarchy, accessing every single object one after another. This is extremely slow due to numerous object accesses. The HWAPI provides a method to quickly retrieve the contents of a collection and is used in the HyperwaveIndexGenerator to provide an XML object listing identical to the SourceIndexGenerator. It

should

be

used

as

a

drop

in

replacement

for

the

SourceIndexGenerator when dealing with Hyperwave servers.

HyperwaveSearchGenerator HWIS contains a flexible and powerful search engine, called Verity. Verity is able to perform full text searches and searches in attributes of objects. Cocoon contains already a search component which is able to provide full text search, the SearchGenerator. HyperwaveSearchGenerator takes advantage of the comprehensive search feature of HWAPI. Comprehensive search, see figure 4.1, exeeds the basic capability of full text search. It results in an XML output of HyperwaveSearchGenerator which is incompatible to Cocoon’s SearchGenerator component. Comprehensive search provides three kinds of queries which can be combined: extending keyquery: Is a query on attributes of HWIS objects. The objects with matching attributes are added to the search results. full text query with restricting keyquery: Provides a full text query on HWIS objects. The search result list can be restricted by requiring additional attribute matches. external full text query: Can search on web resources external to HWIS if an external index was configured in the user.xml file of HWIS configuration.

if (!out.error.error()) print_object(out.object); else print_error(out.error);

2.4.9.10 Methods for Searching for Information 4.1. Hyperwave Information Server (HWIS)

91

comprehensiveSearch() DESCRIPTION

A filter can be applied to objects returned by internal searches.

This call searches on the IS/6 database, full text server (WaveSearch), and external data sources. Figure 2 shows which queries are executed where, and how the results of the different queries are combined.

Fig.2: How comprehensiveSearch() works OBJECT ATTRIBUTES

Objects in IS/6 Figure 4.1: How Comprehensive Search Works [25] Objects found in IS/6 are returned with all the attributes that are stored with the object in the database, as well as the following attributes generated dynamically by comprehensiveSearch(): z

Score: This attribute indicates how well the object matches the full text query, if oneHyperwaveSearchGenerator was passed to the call. Its value ispaginates a float number in the range to 1.reThe the search result list0 and higher the score, the better the match. If only a key query was passed to the call, turns the first page 1. of results while the search is still running in the backthe score is always

ground to provide a quick response. Objects Found on External Data Sources Consecutive pages are obtained from the cache of search results in HWIS. An example of a full text query can be

The external indexes that you search in using this call must be configured for your IS/6seen using user.xml in the figure 4.2. entries that begin with WAVESEARCH/VERITY/EXTERNAL_INDEX/ for Verity and

In the next section the structure of a portal page will be described by © Hyperwave means of

155 a portal example. A Cocoon portal page of the Journal of Univer-

sal Computer Science (J.UCS) was chosen as an example.

92

Chapter 4. Integration of External Content Into the Cocoon Portal

Figure 4.2: A Simple Full Text Search with HyperwaveSearchGenerator

4.1.2

An Example Portal Page

Figure 4.3 shows the navigation tab "Browse" of the J.UCS example portal. The portal page is comprised of a page envelope and the content of the tab’s. The page envelope wraps the tab content (coplets) into a predefined page design. It contains the page header and the tab navigation including its border.

The browse tab uses two coplets, a navigation coplet and a view coplet. The browse tab view is composed of three instances of these coplets:

JucsMenu-browse: Instance of the navigation coplet operating in vertical menu mode. JucsBreadcrumbs-browse: Instance of the navigation coplet operating in breadcrumbs mode.

4.1. Hyperwave Information Server (HWIS)

93

Figure 4.3: J.UCS Browse Tab, Contents of Volume 10

JucsView-browse: Instance of the view coplet providing the content view area. Each coplet instance requires a resource attribute specifying the path on HWIS to display. Each coplet instance provides hyperlinks which influence the view of the other coplets. All coplets use CopletJXPathEvent events and the CopletTransformer to generate them. If the coplets would be connected directly to each other each on of the coplets would need an customized XSLT stylesheet to be able to create event links for the other two coplets. Doing so would prevent coplet reuse because the coplets could only work in this very specific page layout. This problem can be solved by the usage of a mediator component to prevent direct event connections between coplets. The BookmarkAction is able to dynamically generate events upon HTTP requests. It requires a mapping from an event to an event subscriber, see listing 4.1.

94

Chapter 4. Integration of External Content Into the Cocoon Portal Listing 4.1: bookmarks.xml

1

2

3

4

layout

10

maintab

11

aspectDatas/tab

12

13 14 15 16 17

18

coplet

19

JucsMenu-browse

20

attributes/resource

21

22

23

coplet

24

JucsBreadcrumbs-browse

25

attributes/resource

26

27

28

coplet

29

JucsView-browse

30 31

attributes/resource

32

33

Every event element has an id attribute which is used by the BookmarkAction to generate an event. In bookmarks.xml four events can be issued. The first event defintion on line 8 is able to switch the tab of the tab navigation bar. The other three event definitions map the coplet instances in the browse view. Each event definition for the browse view sets the resource attribute of the coplet instance to a value. In order to explain how these mapping works with the portal, a line by line explanation of listing 4.2 is given.

4.1. Hyperwave Information Server (HWIS)

95

Listing 4.2: Portal Bookmark sitemap.xmap Snippet 1 2 3

...

4

6 7

...

8 9 10 11

...

12

13

14 15

16

17

18

19

20

21 22

23

24 25

...

26

27

28

29

30

Consider the following request to myportal.at: http://myportal.at/browse?id=jucs_10_1/discovering_student_models_in

This request is matched by the matcher in line 2 of listing 4.2. The part {request-param:id} in line 4 uses the request parameter input module and substitutes it with the value of the id request parameter. In the next

96

Chapter 4. Integration of External Content Into the Cocoon Portal

line the portal page will be generated from the URI: cocoon:/html?top=0& menu=jucs_10_1/discovering_student_models_in& breadcrumbs=jucs_10_1/discovering_student_models_in& view=jucs_10_1/discovering_student_models_in

The above URI is matched by the matcher in line 10. In line 12 the BookmarkAction inspects the request parameters for a mapping in bookmarks.xml. All four request parameters match and the BookmarkAction uses the LinkService of the portal to create a request parameter able to issue the following events:

top: Set the tab bar’s tab number to 0 (first tab). menu: Set the resource attribute of the JucsMenu-browse coplet instance to jucs_10_1/discovering_student_models_in. breadcrumbs: Set

the

resource

JucsBreadcrumbs-browse

coplet

attribute

of

instance

the to

jucs_10_1/discovering_student_models_in. view: Set the resource attribute of the JucsView-browse coplet instance to jucs_10_1/discovering_student_models_in.

The request parameter created by the BookmarkAction is available in the sitemap as {uri} for further use. In line 15 of listing 4.2 this request parameter is used to generate the portal page from a similar request: cocoon:/xml?cocoon-portal-event=5 This request is matched in line 23 and used to generate an XML version of the portal page by the PortalGenerator in line 25.

The

PortalGenerator uses the cocoon-portal-event request parameter to fire the events registered by the BookmarkAction and later calls the publishing pipelines of the coplets defined for the view.

4.2. Remote Content

97

Figure 4.4: J.UCS Browse Tab, Paper Format Selection

The SAX stream of the portal page flows through the transformers in the lines 16 and 17 and then through the serializer in line 5, back to the client where it will bel rendered, see figure 4.4.

4.2

Remote Content

This section shows a way how to include remote web sites into the Cocoon portal. An overview of the methodology and related problems can be found in 2.2. First it introduces to helpful Cocoon components dealing with HTML and remote sources. Finally it shows how these components are used in the proxy coplet to achieve embedding of remote sites like in figure 4.5.

98

Chapter 4. Integration of External Content Into the Cocoon Portal

Figure 4.5: Inclusion of informatik.tugraz.at without Frames

4.2.1

Components

Cocoon pipelines can only deal with SAX streams. Unfortunately HTML is not well formed XML. An HTML to XHTML converter is needed to be able to produce an SAX stream for further processing. The HTMLGenerator and ProxyTransformer needs to undertake this tidy step. To retrieve binary content

the

ProxyReader

is

needed.

The

CopletEventLinkTransformer is required to convert hyperlinks and forms to links processable by the CopletTransformer component.

HTMLGenerator The HTMLGenerator reads an HTML document from a source and puts it through the W3C tidy tool to obtain XHTML. An XML parser is then used to produce an SAX stream. It can be used in any Cocoon pipeline as a

4.2. Remote Content

99

starting point for processing. In principle, the HTMLGenerator functions as FileGenerator for HTML files.

ProxyTransformer

The ProxyTransformer watches the SAX stream for an envelope element to replace it by the remote document. Next he remote document is fetched with a java.net.URLConnection. Note: Custom protocol handlers can be written for URLConnections2 . After fetching the document it is processed with W3C tidy. Finally it is parsed and the SAX stream is included into the transformer output at the position of the envelope element. The ProxyTransformer can only be used within an ApplicationCoplet and requires an envelope-tag sitemap parameter to name the envelope element.

ProxyReader

A remote site will contain binary content, like images or archive files, which cannot be serialized as SAX stream. The portal engine is solely SAX stream based and cannot pipe bytestreams. This limitation requires binary resources to be marked with a prefix to bypass the portal engine. Then the ProxyReader is able to read remote binary content and deliver it to the client. The ProxyReader further requires cocoon-portal-copletid and cocoon-portal-portalname request parameters to lookup connection data from the coplet configuration. 2

Custom URL protocol handlers

http://java.sun.com/developer/onlineTraining/protocolhandlers/

100

Chapter 4. Integration of External Content Into the Cocoon Portal

CopletEventLinkTransformer

This transformer examines XHTML anchors and forms to generate links processable by the CopletTransformer and allows to automatically add a prefix to the URL string of href or src attributes in other elements. The CopletEventLinkTransformer can detect off-site links and links that should open a new browser window. The behaviour can be tuned through sitemap parameters. A description of the parameters is given below:

attribute-name: The name of the temporary attribute to which the URI is set to. The default is application-uri. This is the correct setting for CachingURICoplet instances. cid-attribute-name: The coplet instance data attribute to which the URI is set to. If this attribute is not specified, the temporary attributes of the CachingURICoplet are used. portal-link-prefix: If specified links will be prepended with the prefix instead of generating portal-link events. It’s usage is suggested if the BookmarkAction should be used for the remote site. resource-prefix: If specified, all binary resources will be prepended by the resource prefix. resource-extension-pattern: A Perl regular expression pattern that matches URLs of links. If a match occurs they are treated as binary content. These URLs are mostly used as download links. protocol-exlude-pattern: This Perl regular expression pattern prevents foreign URLs and local anchors from being rewritten.

4.2. Remote Content

101

open-in-new-window-pattern: This Perl regular expression pattern determines links that should be opened in a new window. site-include-pattern: A Perl regular expression pattern which forces matching URLs to be piped through the portal. This is useful if a site has many different domain names (virtual hosts). debug-link: If true, output debugging attributes. The attribute names are prefixed by debug-. These attributes contain the original link attributes plus the results from the calls to getLink() and getPrefixLink(). To avoid broken links it is useful if link rewriting works as desired.

4.2.2 This

Example for the Proxy Coplet example

uses

the

ProxyTransformer

and

CopletEventLinkTransformer components to include Heise Online (http://www.heise.de) as coplet into the portal. For binary content a prefix of http://www.heise.de/ is added to make the browser fetch it directly from the remote site. The proxy coplet uses the link attribute of the ApplicationCoplet, see section 3.5.1. The sitemap, listing 4.3 uses the JXTemplateGenerator to generate the page envelope. The template, listing 4.4, requires two parameters to display a line at the top of the coplet page to express where the content originates from. The link parameter contains the URL of the currently displayed remote page and is read via the coplet sitemap input module ({coplet:attributes/link}) from the link attribute of the coplet instance. The site parameter contains the text which is displayed as the anchor text, see the yellow line in figure 4.6.

102

Chapter 4. Integration of External Content Into the Cocoon Portal Listing 4.3: sitemap.xmap of the Proxy Coplet

1 2

...

3

4

5

6

7

8

9

10

11

12

13

14

15

16 17 18

...

The ProxyTransformer, listing 4.3 line 7, is instructed to replace the element in listing 4.4, line 17, with the remote document. The CopletEventLinkTransformer, listing 4.3 line 10, rewrites URLs and forms in the XHTML from www.heise.de to coplet links like:

These coplet links are used by the CopletTransformer to generate hyperlinks for the portal via the LinkService of the portal engine. All binary resources that are detected by the CopletEventLinkTransformer are prepended by http://www.heise.de/. All URLs starting with any protocol, for example http://, ftp:// are treated as offsite links and will open in a new browser window. In the sitemap, listing 4.3, line 16 the XHTML head and body elements are encapsulated in an intermediary page envelope language residing in the

4.2. Remote Content

103

pd (Portal Document) XML namespace, see listing 4.4 lines 3, 7, 8, 15 and 18.

Figure 4.6: Inclusion of www.heise.de without Frames

Listing 4.4: include.jx, The Template File of the Proxy Coplet 1

2

6 7

8

9

The content of this page is included from

10 11 12 13 14 15 16 17 18

104

Chapter 4. Integration of External Content Into the Cocoon Portal

The result of this processing is included in the portal page by the portal generator, see listing 4.2 line 25. After that the intermediary portal document page language is transformed back to HTML (line 16). Next the CopletTransformer

generates

hyperlinks

by

means

of

the

LinkService of the portal engine (line 17). Finally the whole page is streamed to the browser where it displays like figure 4.6.

4.3

Daffodil

Daffodil, the high-level search tool for digital libraries, works as a MetaSearch-Engine by using wrappers around the search engines of digital libraries. This approach delivers the most accurate results but has the major disadvantage that the slowest digital library search engine will determine the response time of Daffodil. It is suggested by the Daffodil developers to allow at least 60 seconds for Daffodil to respond. A slow response time like this may be accepted once for high quality search results but not for the retrieval of subsequent result pages. Therefore a cache for the retrieved results is needed. This cache is implemented as permanent storage in an XML document on an eXist NXD to automatically build a database of digital library metadata.

4.3.1

Components

The components required for a Daffodil search can be split into two groups. The DaffodilQueryStringBuilder is for use with FlowScript. All other components are used by the DaffodilSearchGenerator.

4.3. Daffodil

105

DaffodilQueryStringBuilder

The DaffodilQueryStringBuilder is a Java Bean intended to be loaded by FlowScript to generate a valid query string for Daffodil out of

• Author • Title • Free-Text • Year

form fields plus AND, OR and NOT logical operators.

DaffodilSearchGenerator

This pipeline component performs a search on Daffodil, but requires a variety of helper classes. To better illustrate the coherence’s of the helper classes the class diagram is provided in figure 4.7. Request processing takes place in the following order: At the first request the DaffodilSearchGenerator takes a valid Daffodil search string as sitemap or request parameter. Next it generates a query id for new searches. After that the DaffodilSearchClient is used to retrieve the search results. Next the search results are streamed as SAX to the next pipeline component. All subsequent requests to retrieve the next search result page must contain the query id, the result to start from and the amount of results on the page.

106

Chapter 4. Integration of External Content Into the Cocoon Portal

Figure 4.7: Class Diagram of DaffodilSearchGenerator

DaffodilSearchClient:

The search client first performs a meta data search

with a MetadataAgent to retrieve a list of document id’s. After that an XMLDBAgent is used to retrieve cached detail results by document id. Next a DetailAgent is used to query Daffodil for metadata details of the document id’s not found in the cache. After that all retrieved metadata details are stored into the NXD by using the XMLDBAgent. Finally a list of search results is returned to the DaffodilSearchGenerator.

MetadataAgent:

A MetadataAgent performs a query for document

metadata by using a query string. Currently one implementation, the ExternalMetadataAgent exists. It uses the Daffodil agent framework to perform a metadata query.

4.3. Daffodil

DetailAgent:

107

A DetailAgent can retrieve more details for a document

id. For example in which journal the document was published or the abstract of the document. The DetailAgent is the major performance bottleneck. This is the cause why three different implementations of it exist.

ExternalDetailAgent: This is the default implementation. It uses a single detail query to retrieve the details for all documents on a result page at once. It uses the Daffodil agent framework. A failure to retrieve all documents due to internal timeouts on the Daffodil server is not uncommon. ThreadedExternalDetailAgent: This implementation was inspired by a fork bomb. A page with twenty results creates twenty connections each in an separate thread performing a detail query for a single document id. Surprisingly this approach results in less errors than the ExternalDetailAgent implementation. The Daffodil agent framework is used to perform the queries. ThreadedSoapDetailAgent: Simultaneous Soap requests are used to retrieve the details for the documents on a page because the Soap interface is unable to provide a result for more than on document id. This approach is the least favourable, it is slower than any other implementation and generates the most errors.

XMLDBAgent:

This interface has only one implementation,

the

ExistXMLDBAgent. It provides a client for updates and queries to the eXist XML database. This class is able to perform synchronized inserts of search results into the database by means of XUpdate and can retrieve metadata query results for document id’s by using an XPath expression.

108

Chapter 4. Integration of External Content Into the Cocoon Portal

4.3.2

Daffodil Search Coplet

First the Cocoon form in 4.8 is verified by FlowScript similar to the calculator example in section 3.4.

Next the DaffodilQueryStringBuilder

Figure 4.8: CForms Search Dialog for Daffodil

is used to compute a valid search string. Afterwards the search string is submitted to the search result pipeline as sitemap parameter to perform a search with DaffodilSearchGenerator. An XSLT transformer converts the XML search results output into a HTML page.

Next the

CopletEventLinkTransformer is used to convert HTML links into a format suitable for the CopletTransformer. This coplet output is included into the portal page. Finally the coplet links are converted by the CopletTransformer into portal HTML links and the page is transferred to the client. A result page of a search for author "Norbert Fuhr" can be seen in figure 4.9.

Requests for subsequent search results are directly handled

by the DaffodilSearchGenerator without calling the flow controller.

4.3. Daffodil

Figure 4.9: Daffodil Search Results for Author Norbert Fuhr

109

110

111

C HAPTER 5

Conclusion and Future Prospects

XML based technologies are among the fastest evolving in the computing field. Namespace support of XML allows the standardization efforts of these technologies to continue in parallel. XML has already achieved what SGML grammar never was able to, to become a universal base for data exchange. Even the industry-fortified RDBMS world looks into the emerging Native XML Databases field.

5.1

Conclusions

During the work with Apache Cocoon the author came to the following conclusions: • Pipelines composed of Generator, Transformer and Serializer are an elegant design directly expressing the data flow from the data source to the client. • Using FlowScript is much easier than defining a finite state machine (FSM) to control the page flow.

112

Chapter 5. Conclusion and Future Prospects

• Due to its modularity and filesystem abstraction the core architecture of Cocoon is among the most flexible and powerful Web-Tier frameworks in existence. • Flexibility bears the danger of taking unfavorable approaches in application development which may result in difficulties maintaining these applications. • Cocoon encourages the use of specialized teams through Separation Of Concerns (SOC). • The introduction of continuations in Cocoon control flow allows to write page logic in an elegant way. • In Cocoon it is rather easy to mix MFC patterns, or to clutter the sitemap with components doing application logic. • Due to the foundation on XML most newly developed web technologies are easily integrateable into Cocoon. • Hyperwave can be integrated into Cocoon in an elegant manner by using a filesystem like driver called Source. • A huge amount of concepts and technologies have to be learned to use Cocoon effectively. • Portals allow to create an centralized access point to an heterogenous information infrastructure of an organization. • HTML pages are not meant to be composed from different sources, thus requiring a certain effort to get it right. • With the aide of XSLT menues from external applications can be integrated into the menu of the portal. • Integrated information portals may be more effective than a redevelopment of legacy applications.

5.2. Ongoing Developments of Cocoon and eXist

113

• Portals require a huge amount of care to prevent breakage due to a misbehaving portlet. • Extra effort is required to obtain persistent and human readable links when a portal is used. • Daffodil was never meant to be used in a WWW search engine. Its response time lasts far to long for Web searches. • The eXist database is a good performer even for big XML documents in the size around one hundred megabytes. • The on-disk storage structure of eXist is still immature. Complex documents may fail to store. The examples have shown the Cocoon portal engine to be very versatile in the field of content integration. It is more powerful than most competitive products on the market. Like most open source project Cocoon is lacking documentation which results in many hours spent in reading the source code.

5.2

Ongoing Developments of Cocoon and eXist

The future development of Cocoon focuses largely on separation of noncore components into blocks and dynamic building and deploying of them. Furthermore the following enhancements are planned1 : • Finish the work on the first stable release of Cocoon Forms. 1

Planning

Cocoons

future

dev&m=110120731301758&w=2

http://marc.theaimsgroup.com/?l=xml-cocoon-

114

Chapter 5. Conclusion and Future Prospects

• A new template engine. • Migration of FlowScript to the new Rhino JavaScript interpreter. • Inheritable Views. Currently the development of the eXist NXD strife’s for source stabilization to ship version 1.0. Next a full implementation of the upcoming XQuery standard is planned. After that the development is likely to focus on XQuery Update Extensions as XUpdate is a weak spot of XML databases. Also a wish to rewrite the WEBDAV implementation was mentioned 2 . Todays WWW is rooted around HTML. The next generation of the Web, called Semantic Web, will be based upon XML. And XML frameworks like Apache Cocoon are likely to play an important role in building it.

5.3

An Introduction to the Semantic Web The Semantic Web is an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation [26].

The Semantic Web adds machine readable information to the WWW in a way understandable by machines and it encompasses the following technologies:

XML is used to structure documents with common syntactic requirements. XML Schema is a language for defining the structure and datatypes of a XML document. 2

eXist Roadmap http://wiki.exist-db.org/space/Roadmap

5.3. An Introduction to the Semantic Web

115

RDF is a markup language for metadata and consits of triples of subject, predicate and object. Predicate and object pairs can occur multiple times for a subject. subject is the URI of the resource. predicate names some metadata element like author or title. object is the value of the named metadata element i.e. the name of the author or title. An example may be more descriptive: Listing 5.1: An RDF example 1 2 3

4

Tim Berners-Lee

5

The Semantic Web

6

Scientific American

7

2001-05-01

8 9 10

en

The above example uses two XML namespaces, rdf and dc, to distinguish between RDF markup and the metadata. The dc namespace points to the Dublin Core Metadata standard, enabling any agent software that knows about the Dublin Core Metadata standard to get the meaning by using processing rules. Such an agent would be aware that the author of the document specified by rdf:about attribute is Tim Berners-Lee, the title of the document is The Semantic Web published in the Scientific American at 1st of May 2001 and written in English language. RDF Schema is a language used to extend the basic capabilities of RDF. It provides means to define application specific properties and classes

116

Chapter 5. Conclusion and Future Prospects

for RDF. Classes in RDF Schema are similar to classes in object oriented languages. RDF resources can be defined as subclasses and instances of classes. Web Ontology Language (OWL): An ontology is a system of terms defined in a formal manner. Ontology’s are different from controlled vocabularies in allowing rules and relations to be defined between terms. OWL extends the capabilities of RDF and RDF Schema to represent the meaning of the metadata elements and their relationships. The RDF example from listing 1.1 could become understandable for agent software that does not know the meaning of the Dublin Core metadata vocabulary, if there exists an ontology that maps from Dublin Core to a meaning known by the agent software.

Large sites with unified data sources and metadata collections, like digital libraries, publishers, and on-line music collections are expected to be among the first to adopt Semantic Web technologies. A lot of work will have to be done to merge and map domain specific existing metadata schemes to form a domain specific ontology ready for automated processing. For instance it is not even clear what the relation of the DOI application profiles and the Semantic Web will be. Due to the complexity of the Semantic Web the whole project may face major difficulties. Nevertheless, its creators expect that the Semantic Web technologies will get into broad usage around the year 2010. Although the W3C has released some usecases of the Semantic Web it is vastly unknown what technologies will evolve. Only one is certain now, the next generation of the World Wide Web will be built to serve its users in a better way.

117

List of Figures

2.1 Composition of a Portal Page . . . . . . . . . . . . . . . . . .

12

2.2 XSL Processing Chain . . . . . . . . . . . . . . . . . . . . . .

30

2.3 Multi-Tier Architecture . . . . . . . . . . . . . . . . . . . . .

38

2.4 Multi-Channel Publishing via UNIX Pipes . . . . . . . . . . .

42

2.5 Author Listing in Browser Window . . . . . . . . . . . . . .

52

2.6 Author Listing as PDF . . . . . . . . . . . . . . . . . . . . . .

56

3.1 The Cocoon Pyramid of Contracts . . . . . . . . . . . . . . .

62

3.2 A Typical Cocoon Application . . . . . . . . . . . . . . . . .

63

3.3 Screen Shots of the Calculator Example . . . . . . . . . . . .

73

3.4 Sequence of First Form Request . . . . . . . . . . . . . . . .

77

118

List of Figures

3.5 Processing of Subsequent Form Submissions . . . . . . . . .

78

4.1 How Comprehensive Search Works . . . . . . . . . . . . . .

91

4.2 A Simple Full Text Search with HyperwaveSearchGenerator

92

4.3 J.UCS Browse Tab, Contents of Volume 10 . . . . . . . . . .

93

4.4 J.UCS Browse Tab, Paper Format Selection . . . . . . . . . .

97

4.5 Inclusion of informatik.tugraz.at without Frames . . . . . .

98

4.6 Inclusion of www.heise.de without Frames . . . . . . . . . . 103 4.7 Class Diagram of DaffodilSearchGenerator . . . . . . . . . . 106 4.8 CForms Search Dialog for Daffodil . . . . . . . . . . . . . . . 108 4.9 Daffodil Search Results for Author Norbert Fuhr . . . . . . . 109

119

Listings

2.1 An IFRAME example . . . . . . . . . . . . . . . . . . . . . .

16

2.2 Heise Online Search Dropdown Box . . . . . . . . . . . . . .

19

2.3 Complicated Stylesheet Include . . . . . . . . . . . . . . . .

22

2.4 Standard Stylesheet Include . . . . . . . . . . . . . . . . . .

22

2.5 Remote Site Number 1 . . . . . . . . . . . . . . . . . . . . .

23

2.6 Remote Site Number 2 . . . . . . . . . . . . . . . . . . . . .

23

2.7 Merged Events - Non Working . . . . . . . . . . . . . . . . .

24

2.8 Merged Events - Working

. . . . . . . . . . . . . . . . . . .

24

2.9 bstore1.example.com/bib.xml example data . . . . . . . . .

33

2.10 An XQuery FLWOR example . . . . . . . . . . . . . . . . . .

35

2.11 XQuery expected result . . . . . . . . . . . . . . . . . . . . .

35

120

Listings

2.12 An XML UNIX Pipeline Example . . . . . . . . . . . . . . . .

43

2.13 document.xml . . . . . . . . . . . . . . . . . . . . . . . . . .

44

2.14 people.xml . . . . . . . . . . . . . . . . . . . . . . . . . . .

45

2.15 document2intermediate.xsl . . . . . . . . . . . . . . . . . .

47

2.16 Contents of Processing Step intermediate.xml . . . . . . . .

49

2.17 intermediate2html.xsl . . . . . . . . . . . . . . . . . . . . .

50

2.18 listing.html . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

2.19 intermediate2fo.xsl . . . . . . . . . . . . . . . . . . . . . . .

53

2.20 listing.fo . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

3.1 A Basic Pipeline . . . . . . . . . . . . . . . . . . . . . . . . .

66

3.2 calculator.js, a FlowScript Calculator . . . . . . . . . . . . .

71

3.3 sitemap.xmap, the Sitemap for calculator.js . . . . . . . . .

74

4.1 bookmarks.xml . . . . . . . . . . . . . . . . . . . . . . . . .

94

4.2 Portal Bookmark sitemap.xmap Snippet . . . . . . . . . . .

95

4.3 sitemap.xmap of the Proxy Coplet . . . . . . . . . . . . . . . 101 4.4 include.jx, The Template File of the Proxy Coplet . . . . . . 103 5.1 An RDF example . . . . . . . . . . . . . . . . . . . . . . . . 115

121

Bibliography

[1] Wikipedia - World Wide Web. http://en.wikipedia.org/wiki/World_Wide_Web#Publishing_web_pages

[visited

December, 2004].

[2] Deja Vu: (re-)creating web history. http://www.dejavu.org [visited December, 2004].

[3] Netcraft: September 2004 web server survey. http://news.netcraft.com/archives/2004/08/31/september_2004_web_server_survey.html [visited December, 2004].

[4] Jakob Nielsen. Alertbox: Search and you may find. Technical report, useit.com, 1997. http://www.useit.com/alertbox/9707b.html [visited December, 2004].

[5] Sascha Kriewel, Claus-Peter Klas, André Schaefer, and Norbert Fuhr. Daffodil - strategic support for user-oriented access to heterogeneous digital libraries. D-Lib Magazine, 10(6), June 2004.

122

Bibliography

[6] Wikipedia - Open Directory Project. http://en.wikipedia.org/wiki/Open_Directory_Project#Size_of_directory [visited December, 2004].

[7] Rainer Kuhlen. Nachhaltigkeit muss nicht Verknappung bedeuten - in Richtung Wissensöklologie. FlfF-Kommunikation, 2004. [8] Dr. Norman Paskin. The DOI Handbook, chapter Appendix 6, pages 147–149. International DOI Foundation, 4.0.0 edition, April 2004. http://www.doi.org/handbook_2000/appendix_6.pdf [visited December, 2004].

[9] Dr. Norman Paskin. The DOI Handbook, chapter 8, pages 94–95. International DOI Foundation, 4.0.0 edition, April 2004. http://www.doi.org/handbook_2000/registration_agencies.html [visited December, 2004].

[10] James Clark and Steve DeRose. XML Path Language (XPath). Technical report, Wold Wide Web Consortium, 1999. http://www.w3c.org/TR/xpath [visited December, 2004].

[11] Paul Grosso, Eve Maler, Jonathan Marsh, and Norman Walsh. XPoitnter Framework. Technical report, World Wide Web Consortium, 2003. http://www.w3.org/TR/xptr-framework/ [visited December, 2004].

[12] Jonathan Marsh and David Orchard.

XML Inclusions (XInclude).

Technical report, World Wide Web Consortium, 2004. http://www.w3.org/TR/xinclude/ [visited December, 2004].

[13] Eric van der Vlist. XML Schema. O’Reilly & associates, 2002. [14] Will Provost. Normalizing XML, part 1. Technical report, O’Reilly XML.com, November 2002. http://www.xml.com/pub/a/2002/11/13/normalizing.html [visited December, 2004].

123

[15] Will Provost. Normalizing XML, part 2. Technical report, O’Reilly XML.com, December 2002. http://www.xml.com/pub/a/2002/12/04/normalizing.html [visited December, 2004].

[16] Andreas Laux and Lars Martin. XML Update language. Technical report, XML:DB, September 2000. http://xmldb-org.sourceforge.net/xupdate/xupdate-wd.html [visited December, 2004].

[17] XML Query (XQuery). http://www.w3c.org/XML/Query [visited December, 2004].

[18] XML use cases: Queries and results, Q12. http://www.w3c.org/TR/xquery-use-cases/#xmp-queries-results-q12 [visited December, 2004].

[19] Kimbro Staken. An introductiont to the XML:DB API. Technical report, O’Reilly XML.com, 2002. [20] Douglas Mc Illroy, editor. Mass Produced Software Components. NATO, 1968. http://www.cs.dartmouth.edu/~doug/components.txt [visited December, 2004].

[21] John Reynolds. Definitional interpreters for higher order programming languages. In ACM Conference Proceedings, pages 717–740. ACM, 1972. [22] R. Kent Dybvig. The Scheme Programming Language, chapter 3 Going Further, pages 55–82. The MIT Press, 2003. [23] Alejandro Abdelnur and Stefan Hepper. Java™ Portlet Specification version 1.0. Technical report, Java™ Community Process, 2003. http://jcp.org/en/jsr/detail?id=168 [visited December, 2004].

124

Bibliography

[24] Hermann Maurer. Hyperwave the Next Generation Web Solution, chapter 2 Information Systems and the Internet, page 18. Addison Wesley, 1996. [25] Hyperwave IS/6 Release 3 Programmers Guide. Technical report, Hyperwave AG, 2004. [26] T. Berners-Lee, J. Handler, and O. Lassila.

The Semantic Web.

Scientific American, May 2001. http://www.scientificamerican.com/article.cfm?articleID=00048144-10D2-1C7084A9809EC588EF21 [visited December, 2004].