1 <?xml version="1.0" standalone="no"?>
2 <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1//EN"
3 "http://www.oasis-open.org/docbook/xml/4.1/docbookx.dtd"
5 <!ENTITY % local SYSTEM "local.ent">
7 <!ENTITY % entities SYSTEM "entities.ent">
9 <!ENTITY % idcommon SYSTEM "common/common.ent">
14 <title>Pazpar2 - User's Guide and Reference</title>
16 <firstname>Sebastian</firstname><surname>Hammer</surname>
19 <firstname>Adam</firstname><surname>Dickmeiss</surname>
22 <firstname>Marc</firstname><surname>Cromme</surname>
25 <firstname>Jakub</firstname><surname>Skoczen</surname>
27 <releaseinfo>&version;</releaseinfo>
29 <year>©right-year;</year>
30 <holder>Index Data</holder>
34 Pazpar2 is a high-performance, user interface-independent, data
35 model-independent metasearching
36 middle-ware featuring merging, relevance ranking, record sorting,
40 This document is a guide and reference to Pazpar2 version &version;.
45 <imagedata fileref="common/id.png" format="PNG"/>
48 <imagedata fileref="common/id.eps" format="EPS"/>
55 <chapter id="introduction">
56 <title>Introduction</title>
58 Pazpar2 is a stand-alone metasearch client with a web-service API, designed
59 to be used either from a browser-based client (JavaScript, Flash, Java,
60 etc.), from server-side code, or any combination of the two.
61 Pazpar2 is a highly optimized client designed to
62 search many resources in parallel. It implements record merging,
63 relevance-ranking and sorting by arbitrary data content, and facet
64 analysis for browsing purposes. It is designed to be data model
65 independent, and is capable of working with MARC, DublinCore, or any
66 other <ulink url="&url.xml;">XML</ulink>-structured response format
67 -- <ulink url="&url.xslt;">XSLT</ulink> is used to normalize and extract
68 data from retrieval records for display and analysis. It can be used
69 against any server which supports the
70 <ulink url="&url.z39.50;">Z39.50</ulink> and <ulink url="&url.sru;">SRU/SRW</ulink>
72 backend modules can be used to support a large number of other protocols
73 (please contact Index Data for further information about this).
76 Additional functionality such as
77 user management, attractive displays are expected to be implemented by
78 applications that use Pazpar2. Pazpar2 is user interface independent.
79 Its functionality is exposed through a simple REST-style web-service API,
80 designed to be simple to use from an Ajax-enabled browser, Flash
81 animation, Java applet, etc., or from a higher-level server-side language
82 like PHP or Java. Because session information can be shared between
83 browser-based logic and your server-side scripting, there is tremendous
84 flexibility in how you implement your business logic on top of Pazpar2.
87 Once you launch a search in Pazpar2, the operation continues behind the
88 scenes. Pazpar2 connects to servers, carries out searches, and
89 retrieves, deduplicates, and stores results internally. Your application
90 code may periodically inquire about the status of an ongoing operation,
91 and ask to see records or other result set facets. Result becomes
92 available immediately, and it is easy to build end-user interfaces which
93 feel extremely responsive, even when searching more than 100 servers
97 Pazpar2 is designed to be highly configurable. Incoming records are
98 normalized to XML/UTF-8, and then further normalized using XSLT to a
99 simple internal representation that is suitable for analysis. By
100 providing XSLT stylesheets for different kinds of result records, you
101 can tune Pazpar2 to work against different kinds of information
102 retrieval servers. Finally, metadata is extracted, in a configurable
103 way, from this internal record, to support display, merging, ranking,
104 result set facets, and sorting. Pazpar2 is not bound to a specific model
105 of metadata, such as DublinCore or MARC -- by providing the right
106 configuration, it can work with a number of different kinds of data in
107 support of many different applications.
110 Pazpar2 is designed to be efficient and scalable. You can set it up to
111 search several hundred targets in parallel, or you can use it to support
112 hundreds of concurrent users. It is implemented with the same attention
113 to performance and economy that we use in our indexing engines, so that
114 you can focus on building your application, without worrying about the
115 details of metasearch logic. You can devote all of your attention to
116 usability and let Pazpar2 do what it does best -- metasearch.
119 If you wish to connect to commercial or other databases which do not
120 support open standards, please contact Index Data. We have a licensing
121 agreement with a third party vendor which will enable Pazpar2 to access
122 thousands of online databases, in addition to the vast number of catalogs
123 and online services that support the Z39.50/SRU/SRW protocols.
126 Pazpar2 is our attempt to re-think the traditional paradigms for
127 implementing and deploying metasearch logic, with an uncompromising
128 approach to performance, and attempting to make maximum use of the
129 capabilities of modern browsers. The demo user interface that
130 accompanies the distribution is but one example. If you think of new
131 ways of using Pazpar2, we hope you'll share them with us, and if we
132 can provide assistance with regards to training, design, programming,
133 integration with different backends, hosting, or support, please don't
134 hesitate to contact us. If you'd like to see functionality in Pazpar2
135 that is not there today, please don't hesitate to contact us. It may
136 already be in our development pipeline, or there might be a
137 possibility for you to help out by sponsoring development time or
138 code. Either way, get in touch and we will give you straight answers.
144 Pazpar2 is covered by the GNU license version 2.
145 See <xref linkend="license"/> for further information.
149 <chapter id="installation">
150 <title>Installation</title>
152 The Pazpar2 package is very small. It includes documentation as well
153 as the Pazpar2 server. The package also includes a simple user
154 interface test1 which consists of a single HTML page and a single
155 JavaScript file to illustrate the use of Pazpar2.
158 Pazpar2 depends on the following tools/libraries:
160 <varlistentry><term><ulink url="&url.yaz;">YAZ</ulink></term>
163 The popular Z39.50 toolkit for the C language.
164 YAZ <emphasis>must</emphasis> be compiled with Libxml2/Libxslt support.
168 <varlistentry><term><ulink url="&url.icu;">International
169 Components for Unicode (ICU)</ulink></term>
172 ICU provides Unicode support for non-English languages with
173 character sets outside the range of 7bit ASCII, like
174 Greek, Russian, German and French. Pazpar2 uses the ICU
175 Unicode character conversions, Unicode normalization, case
176 folding and other fundamental operations needed in
177 tokenization, normalization and ranking of records.
180 Compiling, linking, and usage of the ICU libraries is optional,
181 but strongly recommended for usage in an international
189 In order to compile Pazpar2, a C compiler which supports C99 or later
193 <section id="installation.unix">
194 <title>Installation on Unix (from Source)</title>
196 The latest source code for Pazpar2 is available from
197 <ulink url="&url.pazpar2.download;"/>.
198 Only few systems have none of the required
199 tools binary packages.
200 If, for example, Libxml2/libXSLT libraries
201 are already installed as development packages use these.
205 Ensure that the development libraries + header files are
206 available on your system before compiling Pazpar2. For installation
207 of YAZ, refer to the YAZ installation chapter.
210 gunzip -c pazpar2-version.tar.gz|tar xf -
218 The <literal>make install</literal> will install manpages as well as the
219 Pazpar2 server, <literal>pazpar2</literal>,
220 in PREFIX<literal>/sbin</literal>.
221 By default, PREFIX is <literal>/usr/local/</literal> . This can be
222 changed with configure option <option>--prefix</option>.
226 <section id="installation.win32">
227 <title>Installation on Windows (from Source)</title>
229 Pazpar2 can be built for Windows using
230 <ulink url="&url.vstudio;">Microsoft Visual Studio</ulink>.
231 The support files for building YAZ on Windows are located in the
232 <filename>win</filename> directory. The compilation is performed
233 using the <filename>win/makefile</filename> which is to be
234 processed by the NMAKE utility part of Visual Studio.
237 Ensure that the development libraries + header files are
238 available on your system before compiling Pazpar2. For installation
239 of YAZ, refer to the YAZ installation chapter.
240 It is easiest if YAZ and Pazpar2 are unpacked in the same
241 directory (side-by-side).
244 The compilation is tuned by editing the makefile of Pazpar2.
245 The process is similar to YAZ. Adjust the various directories
246 <literal>YAZ_DIR</literal>, <literal>ZLIB_DIR</literal>, ..
249 Compile Pazpar2 by invoking <application>nmake</application> in
250 the <filename>win</filename> directory.
251 The resulting binaries of the build process are located in the
252 <filename>bin</filename> of the Pazpar2 source
253 tree - including the <filename>pazpar2.exe</filename> and necessary DLLs.
256 The Windows version of Pazpar2 is a console application. It may
257 be installed as a Windows Service by adding option
258 <literal>-install</literal> for the pazpar2 program. This will
259 register Pazpar2 as a service and use the other options provided
260 in the same invocation. For example:
263 ..\bin\pazpar2 -install -c pazpar2.cfg -l pazpar2.log
265 The Pazpar2 service may now be controlled via the Service Control
266 Panel. It may be unregistered by passing the <literal>-remove</literal>
270 ..\bin\pazpar2 -remove
275 <section id="installation.test1">
276 <title>Installation of test1 interface</title>
278 In this section we outline how to install a simple interface that
279 is part of the Pazpar2 source package. Note that Debian users can
280 save time by just installing package <literal>pazpar2-test1</literal>.
283 A web server must be installed and running on the system, such as Apache.
287 Start the Pazpar2 daemon using the 'in-source' binary of the Pazpar2
288 daemon. On Unix the process is:
291 cp pazpar2.cfg.dist pazpar2.cfg
293 ../src/pazpar2 -f pazpar2.cfg
298 copy pazpar2.cfg.dist pazpar2.cfg
299 copy edu.xml settings
300 ..\bin\pazpar2 -f pazpar2.cfg
302 This will start a Pazpar2 listener on port 9004. It will proxy
303 HTTP requests to localhost - port 80, which we assume will be the regular
304 HTTP server on the system. Inspect and modify pazpar2.cfg as needed
305 if this is to be changed. The pazpar2.cfg includes settings from the
306 directory <filename>settings</filename>.
310 Make a new console and move to the other stuff.
311 For more information about pazpar2 options refer to the manpage.
315 The test1 UI is located in <literal>www/test1</literal>. Ensure this
316 directory is available to the web server by either copying
317 <literal>test1</literal> to the document root, create a symlink or
318 use Apache's <literal>Alias</literal> directive.
322 The interface test1 interface should now be available on port 8004.
325 If you don't see the test1 interface. See if test1 is really available
326 on the same URL but on port 80. If it's not, the Apache configuration
327 (or other) is not correct.
330 In order to use Apache as frontend for the interface on port 80
331 for public access etc., refer to
332 <xref linkend="installation.apache2proxy"/>.
336 <section id="installation.debian">
337 <title>Installation on Debian GNU/Linux</title>
339 Index Data provides Debian packages for Pazpar2. These are prepared
340 for Debian versions Etch and Lenny (as of 2007).
341 These packages are available at
342 <ulink url="&url.pazpar2.download.debian;"/>.
346 <section id="installation.apache2proxy">
347 <title>Apache 2 Proxy</title>
350 <ulink url="http://httpd.apache.org/docs/2.2/mod/mod_proxy.html">
352 </ulink> which allows Pazpar2 to become a backend to an Apache 2
353 based web service. The Apache 2 proxy must operate in the
354 <emphasis>Reverse</emphasis> Proxy mode.
358 On a Debian based Apache 2 system, the relevant modules can
361 sudo a2enmod proxy_http
366 Traditionally Pazpar2 interprets URL paths with suffix
367 <literal>/search.pz2</literal>.
370 url="http://httpd.apache.org/docs/2.2/mod/mod_proxy.html#proxypass"
371 >ProxyPass</ulink> directive of Apache must be used to map a URL path
372 the the Pazpar2 server (listening port).
377 The ProxyPass directive takes a prefix rather than
378 a suffix as URL path. It is important that the Java Script code
379 uses the prefix given for it.
383 <example id="installation.apache2proxy.example">
384 <title>Apache 2 proxy configuration</title>
386 If Pazpar2 is running on port 8004 and the portal is using
387 <filename>search.pz2</filename> inside portal in directory
388 <filename>/myportal/</filename> we could use the following
389 Apache 2 configuration:
392 <IfModule mod_proxy.c>
396 AddDefaultCharset off
401 ProxyPass /myportal/search.pz2 http://localhost:8004/search.pz2
412 <title>Using Pazpar2</title>
414 This chapter provides a general introduction to the use and
415 deployment of Pazpar2.
418 <section id="architecture">
419 <title>Pazpar2 and your systems architecture</title>
421 Pazpar2 is designed to provide asynchronous, behind-the-scenes
422 metasearching functionality to your application, exposing this
423 functionality using a simple webservice API that can be accessed
424 from any number of development environments. In particular, it is
425 possible to combine Pazpar2 either with your server-side dynamic
426 website scripting, with scripting or code running in the browser, or
427 with any combination of the two. Pazpar2 is an excellent tool for
428 building advanced, Ajax-based user interfaces for metasearch
429 functionality, but it isn't a requirement -- you can choose to use
430 Pazpar2 entirely as a backend to your regular server-side scripting.
431 When you do use Pazpar2 in conjunction
432 with browser scripting (JavaScript/Ajax, Flash, applets,
433 etc.), there are special considerations.
437 Pazpar2 implements a simple but efficient HTTP server, and it is
438 designed to interact directly with scripting running in the browser
439 for the best possible performance, and to limit overhead when
440 several browser clients generate numerous webservice requests.
441 However, it is still desirable to use a conventional webserver,
442 such as Apache, to serve up graphics, HTML documents, and
443 server-side scripting. Because the security sandbox environment of
444 most browser-side programming environments only allows communication
445 with the server from which the enclosing HTML page or object
446 originated, Pazpar2 is designed so that it can act as a transparent
447 proxy in front of an existing webserver (see <xref
448 linkend="pazpar2_conf"/> for details).
449 In this mode, all regular
450 HTTP requests are transparently passed through to your webserver,
451 while Pazpar2 only intercepts search-related webservice requests.
455 If you want to expose your combined service on port 80, you can
456 either run your regular webserver on a different port, a different
457 server, or a different IP address associated with the same server.
461 Pazpar2 can also work behind
462 a reverse Proxy. Refer to <xref linkend="installation.apache2proxy"/>)
463 for more information.
464 This allows your existing HTTP server to operate on port 80 as usual.
465 Pazpar2 can be started on another (internal) port.
469 Sometimes, it may be necessary to implement functionality on your
470 regular webserver that makes use of search results, for example to
471 implement data import functionality, emailing results, history
472 lists, personal citation lists, interlibrary loan functionality,
473 etc. Fortunately, it is simple to exchange information between
474 Pazpar2, your browser scripting, and backend server-side scripting.
475 You can send a session ID and possibly a record ID from your browser
476 code to your server code, and from there use Pazpar2s webservice API
477 to access result sets or individual records. You could even 'hide'
478 all of Pazpar2s functionality between your own API implemented on
479 the server-side, and access that from the browser or elsewhere. The
480 possibilities are just about endless.
484 <section id="data_model">
485 <title>Your data model</title>
487 Pazpar2 does not have a preconceived model of what makes up a data
488 model. There are no assumptions that records have specific fields or
489 that they are organized in any particular way. The only assumption
490 is that data comes packaged in a form that the software can work
491 with (presently, that means XML or MARC), and that you can provide
492 the necessary information to massage it into Pazpar2's internal
497 Handling retrieval records in Pazpar2 is a two-step process. First,
498 you decide which data elements of the source record you are
499 interested in, and you specify any desired massaging or combining of
500 elements using an XSLT stylesheet (MARC records are automatically
501 normalized to <ulink url="&url.marcxml;">MARCXML</ulink> before this step).
502 If desired, you can run multiple XSLT stylesheets in series to accomplish
503 this, but the output of the last one should be a representation of the
504 record in a schema that Pazpar2 understands.
508 The intermediate, internal representation of the record looks like
511 <record xmlns="http://www.indexdata.com/pazpar2/1.0"
512 mergekey="title The Shining author King, Stephen">
514 <metadata type="title">The Shining</metadata>
516 <metadata type="author">King, Stephen</metadata>
518 <metadata type="kind">ebook</metadata>
520 <!-- ... and so on -->
524 As you can see, there isn't much to it. There are really only a few
525 important elements to this file.
529 Elements should belong to the namespace
530 <literal>http://www.indexdata.com/pazpar2/1.0</literal>.
531 If the root node contains the
532 attribute 'mergekey', then every record that generates the same
533 merge key (normalized for case differences, white space, and
534 truncation) will be joined into a cluster. In other words, you
535 decide how records are merged. If you don't include a merge key,
536 records are never merged. The 'metadata' elements provide the meat
537 of the elements -- the content. the 'type' attribute is used to
538 match each element against processing rules that determine what
539 happens to the data element next.
543 The next processing step is the extraction of metadata from the
544 intermediate representation of the record. This is governed by the
545 'metadata' elements in the 'service' section of the configuration
546 file. See <xref linkend="config-server"/> for details. The metadata
547 in the retrieval record ultimately drives merging, sorting, ranking,
548 the extraction of browse facets, and display, all configurable.
552 <section id="client">
553 <title>Client development overview</title>
555 You can use Pazpar2 from any environment that allows you to use
556 webservices. The initial goal of the software was to support
557 Ajax-based applications, but there literally are no limits to what
558 you can do. You can use Pazpar2 from Javascript, Flash, Java, etc.,
559 on the browser side, and from any development environment on the
560 server side, and you can pass session tokens and record IDs freely
561 around between these environments to build sophisticated applications.
562 Use your imagination.
566 The webservice API of Pazpar2 is described in detail in <xref
567 linkend="pazpar2_protocol"/>.
571 In brief, you use the 'init' command to create a session, a
572 temporary workspace which carries information about the current
573 search. You start a new search using the 'search' command. Once the
574 search has been started, you can follow its progress using the
575 'stat', 'bytarget', 'termlist', or 'show' commands. Detailed records
576 can be fetched using the 'record' command.
582 <section id="nonstandard">
583 <title>Connecting to non-standard resources</title>
585 Pazpar2 uses Z39.50 as its switchboard language -- i.e. as far as it
586 is concerned, all resources speak Z39.50, or its webservices derivatives,
587 SRU/SRW. It is, however, equipped
588 to handle a broad range of different server behavior, through
589 configurable query mapping and record normalization. If you develop
590 configuration, stylesheets, etc., for a new type of resources, we
591 encourage you to share your work. But you can also use Pazpar2 to
592 connect to hundreds of resources that do not support standard
597 For a growing number of resources, Z39.50 is all you need. Over the
598 last few years, a number of commercial, full-text resources have
599 implemented Z39.50. These can be used through Pazpar2 with little or
600 no effort. Resources that use non-standard record formats will
601 require a bit of XSLT work, but that's all.
605 But what about resources that don't support Z39.50 at all? Some resources might
606 support OpenSearch, private, XML/HTTP-based protocols, or something
607 else entirely. Some databases exist only as web user interfaces and
608 will require screen-scraping. Still others exist only as static
609 files, or perhaps as databases supporting the OAI-PMH protocol.
610 There is hope! Read on.
614 Index Data continues to advocate the support of open standards. We
615 work with database vendors to support standards, so you don't have
616 to worry about programming against non-standard services. We also
617 provide tools (see <ulink
618 url="http://www.indexdata.com/simpleserver">SimpleServer</ulink>)
619 which make it comparatively easy to build gateways against servers
620 with non-standard behavior. Again, we encourage you to share any
621 work you do in this direction.
625 But the bottom line is that working with non-standard resources in
626 metasearching is really, really hard. If you want to build a
627 project with Pazpar2, and you need access to resources with
628 non-standard interfaces, we can help. We run gateways to more than
629 2,000 popular, commercial databases and other resources,
631 to plug them directly into Pazpar2. For a small annual fee per
632 database, we can help you establish connections to your licensed
633 resources. Meanwhile, you can help! If you build your own
634 standards-compliant gateways, host them for others, or share the
635 code! And tell your vendors that they can save everybody money and
636 increase the appeal of their resources by supporting standards.
640 There are those who will ask us why we are using Z39.50 as our
641 switchboard language rather than a different protocol. Basically,
642 we believe that Z39.50 is presently the most widely implemented
643 information retrieval protocol that has the level of functionality
644 required to support a good metasearching experience (structured
645 searching, structured, well-defined results). It is also compact and
646 efficient, and there is a very broad range of tools available to
651 <section id="unicode">
652 <title>Unicode Compliance</title>
654 Pazpar2 is Unicode compliant and language and locale aware but relies
655 on character encoding for the targets to be specified correctly if
656 the targets themselves are not UTF-8 based (most aren't).
657 Just a few bad behaving targets can spoil the search experience
658 considerably if for example Greek, Russian or otherwise non 7-bit ASCII
659 search terms are entered. In these cases some targets return
660 records irrelevant to the query, and the result screens will be
661 cluttered with noise.
664 While noise from misbehaving targets can not be removed, it can
665 be reduced using truly Unicode based ranking. This is an
666 option which is available to the system administrator if ICU
667 support is compiled into Pazpar2, see
668 <xref linkend="installation"/> for details.
671 In addition, the ICU tokenization and normalization rules must
672 be defined in the master configuration file described in
673 <xref linkend="config-server"/>.
677 </chapter> <!-- Using Pazpar2 -->
679 <reference id="reference">
680 <title>Reference</title>
681 <partintro id="reference-introduction">
683 The material in this chapter is drawn directly from the individual
690 <appendix id="license"><title>License</title>
694 Copyright © ©right-year; Index Data.
698 Pazpar2 is free software; you can redistribute it and/or modify it under
699 the terms of the GNU General Public License as published by the Free
700 Software Foundation; either version 2, or (at your option) any later
705 Pazpar2 is distributed in the hope that it will be useful, but WITHOUT ANY
706 WARRANTY; without even the implied warranty of MERCHANTABILITY or
707 FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
712 You should have received a copy of the GNU General Public License
713 along with Pazpar2; see the file LICENSE. If not, write to the
714 Free Software Foundation,
715 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
724 <!-- Keep this comment at the end of the file
729 sgml-minimize-attributes:nil
730 sgml-always-quote-attributes:t
733 sgml-parent-document: nil
734 sgml-local-catalogs: nil
735 sgml-namecase-general:t