1 <?xml version="1.0" standalone="no"?>
2 <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
3 "http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"
5 <!ENTITY % local SYSTEM "local.ent">
7 <!ENTITY % entities SYSTEM "entities.ent">
9 <!ENTITY % idcommon SYSTEM "common/common.ent">
14 <title>Pazpar2 - User's Guide and Reference</title>
16 <firstname>Sebastian</firstname><surname>Hammer</surname>
19 <firstname>Adam</firstname><surname>Dickmeiss</surname>
22 <firstname>Marc</firstname><surname>Cromme</surname>
25 <firstname>Jakub</firstname><surname>Skoczen</surname>
28 <firstname>Mike</firstname><surname>Taylor</surname>
31 <firstname>Dennis</firstname><surname>Schafroth</surname>
33 <releaseinfo>&version;</releaseinfo>
35 <year>©right-year;</year>
36 <holder>Index Data</holder>
40 Pazpar2 is a high-performance metasearch engine featuring
41 merging, relevance ranking, record sorting,
43 It is middleware: it has no user interface of its own, but can be
44 configured and controlled by an XML-over-HTTP web-service to provide
45 metasearching functionality behind any user interface.
48 This document is a guide and reference to Pazpar2 version &version;.
53 <imagedata fileref="common/id.png" format="PNG"/>
56 <imagedata fileref="common/id.eps" format="EPS"/>
63 <chapter id="introduction">
64 <title>Introduction</title>
66 <section id="what.pazpar2.is">
67 <title>What Pazpar2 is</title>
69 Pazpar2 is a stand-alone metasearch engine with a web-service API, designed
70 to be used either from a browser-based client (JavaScript, Flash,
72 etc.), from server-side code, or any combination of the two.
73 Pazpar2 is a highly optimized client designed to
74 search many resources in parallel. It implements record merging,
75 relevance-ranking and sorting by arbitrary data content, and facet
76 analysis for browsing purposes. It is designed to be data-model
77 independent, and is capable of working with MARC, DublinCore, or any
78 other <ulink url="&url.xml;">XML</ulink>-structured response format
79 -- <ulink url="&url.xslt;">XSLT</ulink> is used to normalize and extract
80 data from retrieval records for display and analysis. It can be used
81 against any server which supports the
82 <ulink url="&url.z39.50;">Z39.50</ulink>, <ulink url="&url.sru;">SRU/SRW</ulink>
83 or <ulink url="&url.solr;">SOLR</ulink> protocol. Proprietary
84 backend modules can function as connectors between these standard
85 protocols and any non-standard API, including web-site scraping, to
86 support a large number of other protocols.
89 Additional functionality such as
90 user management and attractive displays are expected to be implemented by
91 applications that use Pazpar2. Pazpar2 itself is user-interface independent.
92 Its functionality is exposed through a simple XML-based web-service API,
93 designed to be easy to use from an Ajax-enabled browser, Flash
94 animation, Java applet, etc., or from a higher-level server-side language
95 like PHP, Perl or Java. Because session information can be shared between
96 browser-based logic and server-side scripting, there is tremendous
97 flexibility in how you implement application-specific logic on top
101 Once you launch a search in Pazpar2, the operation continues behind the
102 scenes. Pazpar2 connects to servers, carries out searches, and
103 retrieves, deduplicates, and stores results internally. Your application
104 code may periodically inquire about the status of an ongoing operation,
105 and ask to see records or result set facets. Results become
106 available immediately, and it is easy to build end-user interfaces than
107 feel extremely responsive, even when searching more than 100 servers
111 Pazpar2 is designed to be highly configurable. Incoming records are
112 normalized to XML/UTF-8, and then further normalized using XSLT to a
113 simple internal representation that is suitable for analysis. By
114 providing XSLT stylesheets for different kinds of result records, you
115 can configure Pazpar2 to work against different kinds of information
116 retrieval servers. Finally, metadata is extracted in a configurable
117 way from this internal record, to support display, merging, ranking,
118 result set facets, and sorting. Pazpar2 is not bound to a specific model
119 of metadata, such as DublinCore or MARC: by providing the right
120 configuration, it can work with any combination of different kinds of data
121 in support of many different applications.
124 Pazpar2 is designed to be efficient and scalable. You can set it up to
125 search several hundred targets in parallel, or you can use it to support
126 hundreds of concurrent users. It is implemented with the same attention
127 to performance and economy that we use in our indexing engines, so that
128 you can focus on building your application without worrying about the
129 details of metasearch logic. You can devote all of your attention to
130 usability and let Pazpar2 do what it does best -- metasearch.
133 Pazpar2 is our attempt to re-think the traditional paradigms for
134 implementing and deploying metasearch logic, with an uncompromising
135 approach to performance, and attempting to make maximum use of the
136 capabilities of modern browsers. The demo user interface that
137 accompanies the distribution is but one example. If you think of new
138 ways of using Pazpar2, we hope you'll share them with us, and if we
139 can provide assistance with regards to training, design, programming,
140 integration with different backends, hosting, or support, please don't
141 hesitate to contact us. If you'd like to see functionality in Pazpar2
142 that is not there today, please don't hesitate to contact us. It may
143 already be in our development pipeline, or there might be a
144 possibility for you to help out by sponsoring development time or
145 code. Either way, get in touch and we will give you straight answers.
151 Pazpar2 is covered by the GNU General Public License (GPL) version 2.
152 See <xref linkend="license"/> for further information.
156 <section id="connectors">
157 <title>Connectors to non-standard databases</title>
159 If you need to access commercial or open access resources that don't support
160 Z39.50 or SRU, one approach would be to use a tool like <ulink
161 url="&url.simpleserver;">SimpleServer</ulink> to build a
162 gateway. An easier option is to use Index Data's <ulink
163 url="&url.mkc;">MasterKey Connect</ulink>
164 service, which will expose virtually <emphasis>any</emphasis> resource
165 through Z39.50/SRU, dead easy to integrate with Pazpar2.
166 The service is hosted, so all you have to do is to let us
167 know which resources you are interested in, and we operate the gateways,
168 or Connectors for you for a low annual charge.
169 Types of resources supported include
170 commercial databases, free online resources, and even local resources;
171 almost anything that can be accessed through a web-facing user
172 interface can be accessed in this way.
173 Contact <email>info@indexdata.com</email> for more information.
174 See <xref linkend="masterkey_connect"/> for an example.
179 <title>A note on the name Pazpar2</title>
181 The name Pazpar2 derives from three sources. One one hand, it is
182 Index Data's second major piece of software that does parallel
183 searching of Z39.50 targets. On the other, it is a near-homophone
184 of Passpartout, the ever-helpful servant in Jules Verne's novel
185 Around the World in Eighty Days (who helpfully uses the language
186 of his master). Finally, "passe par tout" means something like
187 "passes through anything" in French -- on other words, a universal
188 solution, or if you like a MasterKey.
193 <chapter id="installation">
194 <title>Installation</title>
196 The Pazpar2 package includes documentation as well
197 as the Pazpar2 server. The package also includes a simple user
198 interface called "test1", which consists of a single HTML page and a single
199 JavaScript file to illustrate the use of Pazpar2.
202 Pazpar2 depends on the following tools/libraries:
204 <varlistentry><term><ulink url="&url.yaz;">YAZ</ulink></term>
207 The popular Z39.50 toolkit for the C language.
208 YAZ <emphasis>must</emphasis> be compiled with Libxml2/Libxslt support.
212 <varlistentry><term><ulink url="&url.icu;">International
213 Components for Unicode (ICU)</ulink></term>
216 ICU provides Unicode support for non-English languages with
217 character sets outside the range of 7bit ASCII, like
218 Greek, Russian, German and French. Pazpar2 uses the ICU
219 Unicode character conversions, Unicode normalization, case
220 folding and other fundamental operations needed in
221 tokenization, normalization and ranking of records.
224 Compiling, linking, and usage of the ICU libraries is optional,
225 but strongly recommended for usage in an international
233 In order to compile Pazpar2, a C compiler which supports C99 or later
237 <section id="installation.unix">
238 <title>Installation from source on Unix (including Linux, MacOS, etc.)</title>
240 The latest source code for Pazpar2 is available from
241 <ulink url="&url.pazpar2.download;"/>.
242 Most Unix-based operating systems have the required
243 tools available as binary packages.
244 For example, if Libxml2/libXSLT libraries
245 are already installed as development packages, use these.
249 Ensure that the development libraries and header files are
250 available on your system before compiling Pazpar2. For installation
251 of YAZ, refer to the Installation chapter of the YAZ manual at
252 <ulink url="&url.yaz.install;"/>.
255 Once the dependencies are in place, Pazpar2 can be unpacked and
256 installed as follows:
259 tar xzf pazpar2-VERSION.tar.gz
266 The <literal>make install</literal> will install manpages as well as the
267 Pazpar2 server, <literal>pazpar2</literal>,
268 in PREFIX<literal>/sbin</literal>.
269 By default, PREFIX is <literal>/usr/local/</literal> . This can be
270 changed with configure option <option>--prefix</option>.
274 <section id="installation.win32">
275 <title>Installation from source on Windows</title>
277 Pazpar2 can be built for Windows using
278 <ulink url="&url.vstudio;">Microsoft Visual Studio</ulink>.
279 The support files for building YAZ on Windows are located in the
280 <filename>win</filename> directory. The compilation is performed
281 using the <filename>win/makefile</filename> which is to be
282 processed by the NMAKE utility part of Visual Studio.
285 Ensure that the development libraries and header files are
286 available on your system before compiling Pazpar2. For installation
288 the Installation chapter of the YAZ manual at
289 <ulink url="&url.yaz.install;"/>.
290 It is easiest if YAZ and Pazpar2 are unpacked in the same
291 directory (side-by-side).
294 The compilation is tuned by editing the makefile of Pazpar2.
295 The process is similar to YAZ. Adjust the various directories
296 <literal>YAZ_DIR</literal>, <literal>ZLIB_DIR</literal>, etc.,
300 Compile Pazpar2 by invoking <application>nmake</application> in
301 the <filename>win</filename> directory.
302 The resulting binaries of the build process are located in the
303 <filename>bin</filename> of the Pazpar2 source
304 tree - including the <filename>pazpar2.exe</filename> and necessary DLLs.
307 The Windows version of Pazpar2 is a console application. It may
308 be installed as a Windows Service by adding option
309 <literal>-install</literal> for the pazpar2 program. This will
310 register Pazpar2 as a service and use the other options provided
311 in the same invocation. For example:
314 ..\bin\pazpar2 -install -f pazpar2.cfg -l pazpar2.log
316 The Pazpar2 service may now be controlled via the Service Control
317 Panel. It may be unregistered by passing the <literal>-remove</literal>
321 ..\bin\pazpar2 -remove
326 <section id="installation.test1">
327 <title>Installation of test interfaces</title>
329 In this section we show how to make available the set of simple
330 interfaces that are part of the Pazpar2 source package, and which
331 demonstrate some ways to use Pazpar2. (Note that Debian users can
332 save time by just installing the package <literal>pazpar2-test1</literal>.)
335 A web server, such as Apache, must be installed and running on the system.
339 Start the Pazpar2 daemon using the 'in-source' binary of the Pazpar2
340 daemon. On Unix the process is:
343 cp pazpar2.cfg.dist pazpar2.cfg
344 ../src/pazpar2 -f pazpar2.cfg
349 copy pazpar2.cfg.dist pazpar2.cfg
350 ..\bin\pazpar2 -f pazpar2.cfg
352 This will start a Pazpar2 listener on port 9004. It will proxy
353 HTTP requests to port 80 on localhost, which we assume will be the regular
354 HTTP server on the system. Inspect and modify pazpar2.cfg as needed
355 if this is to be changed. The pazpar2.cfg file includes settings from the
356 file <filename>settings/edu.xml</filename>
361 The test UIs are located in <literal>www</literal>. Ensure that this
362 directory is available to the web server by copying
363 <literal>www</literal> to the document root,
364 using Apache's <literal>Alias</literal> directive, or
365 creating a symbolic link: for example, on a Debian or Ubuntu
366 system with Apache2 installed from the standard package, you might
367 make the link as follows:
370 sudo ln -s `pwd`/www /var/www/pazpar2-demo
375 This makes the test applications visible at
376 <ulink url="http://localhost/pazpar2-demo/"/>
377 but they can not be run successfully from that URL, as they submit
378 search requests back to the server form which they were served,
379 and Apache2 doesn't know how to handle them. Instead, the test
380 applications must be accessed from Pazpar2 itself, acting as a
381 proxy to Apache2, at the URL
382 <ulink url="http://localhost:9004/pazpar2-demo/"/>
386 From here, the demo applications can be
387 accessed: <literal>test1</literal>, <literal>test2</literal> and
388 <literal>jsdemo</literal>
389 are pure HTML+JavaScript setups, needing no server-side
391 <literal>demo</literal>
392 requires PHP on the server.
395 If you don't see the test interfaces, check whether they are available
396 on port 80 (i.e. directly from the Apache2 server). If not, the
397 Apache configuration is incorrect.
400 In order to use Apache as frontend for the interface on port 80
401 for public access etc., refer to
402 <xref linkend="installation.apache2proxy"/>.
406 <section id="installation.debian">
407 <title>Installation on Debian GNU/Linux and Ubuntu</title>
409 Index Data provides Debian and Ubuntu packages for Pazpar2.
410 As of February 2010, these
411 are prepared for Debian versions Etch, Lenny and Squeeze; and for
412 Ubuntu versions 8.04 (hardy), 8.10 (intrepid), 9.04 (jaunty) and
413 9.10 (karmic). These packages are available at
414 <ulink url="&url.pazpar2.download.debian;"/> and
415 <ulink url="&url.pazpar2.download.ubuntu;"/>.
419 <section id="installation.apache2proxy">
420 <title>Apache 2 Proxy</title>
424 url="http://httpd.apache.org/docs/2.2/mod/mod_proxy.html">
427 which allows Pazpar2 to become a backend to an Apache 2
428 based web service. The Apache 2 proxy must operate in the
429 <emphasis>Reverse</emphasis> Proxy mode.
433 On a Debian based Apache 2 system, the relevant modules can
436 sudo a2enmod proxy_http proxy_balancer
441 Traditionally Pazpar2 interprets URL paths with suffix
442 <literal>/search.pz2</literal>.
445 url="http://httpd.apache.org/docs/2.2/mod/mod_proxy.html#proxypass">
448 directive of Apache must be used to map a URL path
449 the the Pazpar2 server (listening port).
454 The ProxyPass directive takes a prefix rather than
455 a suffix as URL path. It is important that the Java Script code
456 uses the prefix given for it.
460 <example id="installation.apache2proxy.example">
461 <title>Apache 2 proxy configuration</title>
463 If Pazpar2 is running on port 8004 and the portal is using
464 <filename>search.pz2</filename> inside portal in directory
465 <filename>/myportal/</filename> we could use the following
466 Apache 2 configuration:
469 <IfModule mod_proxy.c>
473 AddDefaultCharset off
478 ProxyPass /myportal/search.pz2 http://localhost:8004/search.pz2
489 <title>Using Pazpar2</title>
491 This chapter provides a general introduction to the use and
492 deployment of Pazpar2.
495 <section id="architecture">
496 <title>Pazpar2 and your systems architecture</title>
498 Pazpar2 is designed to provide asynchronous, behind-the-scenes
499 metasearching functionality to your application, exposing this
500 functionality using a simple webservice API that can be accessed
501 from any number of development environments. In particular, it is
502 possible to combine Pazpar2 either with your server-side dynamic
503 website scripting, with scripting or code running in the browser, or
504 with any combination of the two. Pazpar2 is an excellent tool for
505 building advanced, Ajax-based user interfaces for metasearch
506 functionality, but it isn't a requirement -- you can choose to use
507 Pazpar2 entirely as a backend to your regular server-side scripting.
508 When you do use Pazpar2 in conjunction
509 with browser scripting (JavaScript/Ajax, Flash, applets,
510 etc.), there are special considerations.
514 Pazpar2 implements a simple but efficient HTTP server, and it is
515 designed to interact directly with scripting running in the browser
516 for the best possible performance, and to limit overhead when
517 several browser clients generate numerous webservice requests.
518 However, it is still desirable to use a conventional webserver,
519 such as Apache, to serve up graphics, HTML documents, and
520 server-side scripting. Because the security sandbox environment of
521 most browser-side programming environments only allows communication
522 with the server from which the enclosing HTML page or object
523 originated, Pazpar2 is designed so that it can act as a transparent
524 proxy in front of an existing webserver (see <xref
525 linkend="pazpar2_conf"/> for details).
526 In this mode, all regular
527 HTTP requests are transparently passed through to your webserver,
528 while Pazpar2 only intercepts search-related webservice requests.
532 If you want to expose your combined service on port 80, you can
533 either run your regular webserver on a different port, a different
534 server, or a different IP address associated with the same server.
538 Pazpar2 can also work behind
539 a reverse Proxy. Refer to <xref linkend="installation.apache2proxy"/>)
540 for more information.
541 This allows your existing HTTP server to operate on port 80 as usual.
542 Pazpar2 can be started on another (internal) port.
546 Sometimes, it may be necessary to implement functionality on your
547 regular webserver that makes use of search results, for example to
548 implement data import functionality, emailing results, history
549 lists, personal citation lists, interlibrary loan functionality,
550 etc. Fortunately, it is simple to exchange information between
551 Pazpar2, your browser scripting, and backend server-side scripting.
552 You can send a session ID and possibly a record ID from your browser
553 code to your server code, and from there use Pazpar2s webservice API
554 to access result sets or individual records. You could even 'hide'
555 all of Pazpar2s functionality between your own API implemented on
556 the server-side, and access that from the browser or elsewhere. The
557 possibilities are just about endless.
561 <section id="data_model">
562 <title>Your data model</title>
564 Pazpar2 does not have a preconceived model of what makes up a data
565 model. There are no assumptions that records have specific fields or
566 that they are organized in any particular way. The only assumption
567 is that data comes packaged in a form that the software can work
568 with (presently, that means XML or MARC), and that you can provide
569 the necessary information to massage it into Pazpar2's internal
574 Handling retrieval records in Pazpar2 is a two-step process. First,
575 you decide which data elements of the source record you are
576 interested in, and you specify any desired massaging or combining of
577 elements using an XSLT stylesheet (MARC records are automatically
578 normalized to <ulink url="&url.marcxml;">MARCXML</ulink> before this step).
579 If desired, you can run multiple XSLT stylesheets in series to accomplish
580 this, but the output of the last one should be a representation of the
581 record in a schema that Pazpar2 understands.
585 The intermediate, internal representation of the record looks like
588 <record xmlns="http://www.indexdata.com/pazpar2/1.0"
589 mergekey="title The Shining author King, Stephen">
591 <metadata type="title" rank="2">The Shining</metadata>
593 <metadata type="author">King, Stephen</metadata>
595 <metadata type="kind">ebook</metadata>
596 <!-- ... and so on -->
600 As you can see, there isn't much to it. There are really only a few
601 important elements to this file.
605 Elements should belong to the namespace
606 <literal>http://www.indexdata.com/pazpar2/1.0</literal>.
607 If the root node contains the
608 attribute 'mergekey', then every record that generates the same
609 merge key (normalized for case differences, white space, and
610 truncation) will be joined into a cluster. In other words, you
611 decide how records are merged. If you don't include a merge key,
612 records are never merged. The 'metadata' elements provide the meat
613 of the elements -- the content. the 'type' attribute is used to
614 match each element against processing rules that determine what
615 happens to the data element next. The attribute, 'rank' specifies
616 specifies a multipler for ranking for this element.
620 The next processing step is the extraction of metadata from the
621 intermediate representation of the record. This is governed by the
622 'metadata' elements in the 'service' section of the configuration
623 file. See <xref linkend="config-server"/> for details. The metadata
624 in the retrieval record ultimately drives merging, sorting, ranking,
625 the extraction of browse facets, and display, all configurable.
629 Pazpar2 1.6.37 and later also allows already clustered records to
630 be ingested. Suppose a database already clusters for us and we would like
631 to keep that cluster for Pazpar2. In that case we can generate a
632 <literal>cluster</literal> wrapper element that holds individual
633 <literal>record</literal> elements.
636 Cluster record example:
638 <cluster xmlns="http://www.indexdata.com/pazpar2/1.0">
640 <metadata type="title" rank="2">The Shining</metadata>
641 <metadata type="author">King, Stephen</metadata>
642 <metadata type="kind">ebook</metadata>
645 <metadata type="title" rank="2">The Shining</metadata>
646 <metadata type="author">King, Stephen</metadata>
647 <metadata type="kind">audio</metadata>
654 <section id="client">
655 <title>Client development overview</title>
657 You can use Pazpar2 from any environment that allows you to use
658 webservices. The initial goal of the software was to support
659 Ajax-based applications, but there literally are no limits to what
660 you can do. You can use Pazpar2 from Javascript, Flash, Java, etc.,
661 on the browser side, and from any development environment on the
662 server side, and you can pass session tokens and record IDs freely
663 around between these environments to build sophisticated applications.
664 Use your imagination.
668 The webservice API of Pazpar2 is described in detail in <xref
669 linkend="pazpar2_protocol"/>.
673 In brief, you use the 'init' command to create a session, a
674 temporary workspace which carries information about the current
675 search. You start a new search using the 'search' command. Once the
676 search has been started, you can follow its progress using the
677 'stat', 'bytarget', 'termlist', or 'show' commands. Detailed records
678 can be fetched using the 'record' command.
684 <section id="unicode">
685 <title>Unicode Compliance</title>
687 Pazpar2 is Unicode compliant and language and locale aware but relies
688 on character encoding for the targets to be specified correctly if
689 the targets themselves are not UTF-8 based (most aren't).
690 Just a few bad behaving targets can spoil the search experience
691 considerably if for example Greek, Russian or otherwise non 7-bit ASCII
692 search terms are entered. In these cases some targets return
693 records irrelevant to the query, and the result screens will be
694 cluttered with noise.
697 While noise from misbehaving targets can not be removed, it can
698 be reduced using truly Unicode based ranking. This is an
699 option which is available to the system administrator if ICU
700 support is compiled into Pazpar2, see
701 <xref linkend="installation"/> for details.
704 In addition, the ICU tokenization and normalization rules must
705 be defined in the master configuration file described in
706 <xref linkend="config-server"/>.
710 <section id="load_balancing">
711 <title>Load balancing</title>
713 Just like any web server, Pazpar2, can be load balanced by a standard
714 hardware or software load balancer as long as the session stickiness
715 is ensured. If you are already running the Apache2 web server in front
716 of Pazpar2 and use the apache mod_proxy module to 'relay' client
717 requests to Pazpar2, this set up can be easily extended to include
718 load balancing capabilites.
719 To do so you need to enable the
720 <ulink url="http://httpd.apache.org/docs/2.2/mod/mod_proxy_balancer.html">
723 module in your Apache2 installation.
727 On a Debian based Apache 2 system, the relevant modules can
730 sudo a2enmod proxy_http
735 The mod_proxy_balancer can pass all 'sessionsticky' requests to the
736 same backend worker as long as the requests are marked with the
737 originating worker's ID (called 'route'). If the Pazpar2 serverID is
738 configured (by setting an 'id' attribute on the 'server' element in
739 the Pazpar2 configuration file) Pazpar2 will append it to the
740 'session' element returned during the 'init' in a mod_proxy_balancer
742 Since the 'session' is then re-sent by the client (for all pazpar2
743 request besides 'init'), the balancer can use the marker to pass
744 the request to the right route. To do so the balancer needs to be
745 configured to inspect the 'session' parameter.
748 <example id="load_balancing.example">
749 <title>Apache 2 load balancing configuration</title>
751 Having 4 Pazpar2 instances running on the same host, port range of
752 8004-8007 and serverIDs of: pz1, pz2, pz3 and pz4 respectively we
753 could use the following Apache 2 configuration to expose a single
754 pazpar2 'endpoint' on a standard
755 (<filename>/pazpar2/search.pz2</filename>) location:
759 AddDefaultCharset off
765 # 'route' has to match the configured pazpar2 server ID
766 <Proxy balancer://pz2cluster>
767 BalancerMember http://localhost:8004 route=pz1
768 BalancerMember http://localhost:8005 route=pz2
769 BalancerMember http://localhost:8006 route=pz3
770 BalancerMember http://localhost:8007 route=pz4
773 # route is resent in the 'session' param which has the form:
774 # 'sessid.serverid', understandable by the mod_proxy_load_balancer
775 # this is not going to work if the client tampers with the 'session' param
776 ProxyPass /pazpar2/search.pz2 balancer://pz2cluster lbmethod=byrequests stickysession=session nofailover=On
779 The 'ProxyPass' line sets up a reverse proxy for request
780 ‘/pazpar2/search.pz2’ and delegates all requests to the load balancer
781 (virtual worker) with name ‘pz2cluster’.
782 Sticky sessions are enabled and implemented using the ‘session’ parameter.
783 The ‘Proxy’ section lists all the servers (real workers) which the
784 load balancer can use.
791 <section id="relevance_ranking">
792 <title>Relevance ranking</title>
794 Pazpar2 uses a variant of the fterm frequency–inverse document frequency
795 (Tf-idf) ranking algorithm.
798 The Tf-part is straightforward to calculate and is based on the
799 documents that Pazpar2 fetches. The idf-part, however, is more tricky
800 since the corpus at hand is ONLY the relevant documents and not
801 irrelevant ones. Pazpar2 does not have the full corpus -- only the
802 documents that match a particular search.
805 Computatation of the Tf-part is based on the normalized documents.
806 The length, the position and terms are thus normalized at this point.
807 Also the computation if performed for each document received from the
808 target - before merging takes place. The result of a TF-compuation is
809 added to the TF-total of a cluster. Thus, if a document occurs twice,
810 then the TF-part is doubled. That, however, can be adjusted, because the
811 TF-part may be divided by the number of documents in a cluster.
814 The algorithm used by Pazpar2 has two phases. In phase one
815 Pazpar2 computes a tf-array .. This is being done as records are
816 fetched form the database. In this case, the rank weigth
817 <literal>w</literal>, the and rank tweaks <literal>lead</literal>,
818 <literal>follow</literal> and <literal>length</literal>.
823 foreach document in a cluster
826 for i = 1, .. N: (each term)
827 foreach pos (where term i occurs in field)
828 // w is configured weight for field
829 // pos is position of term in field
830 w[i] += w / (1 + log2(1+lead*pos))
832 w[i] += w[i] * follow / (1+log2(d)
833 // length: length of field (number of terms that is)
834 if (length strategy is "linear")
835 tf[i] += w[i] / length;
836 else if (length strategy is "log")
837 tf[i] += w[i] / log2(length);
838 else if (length strategy is "none")
842 In phase two, the idf-array is computed and the final score
843 is computed. This is done for each cluster as part of each show command.
844 The rank tweak <literal>cluster</literal> is in use here.
847 // dococcur[i]: number of records where term occurs
848 // doctotal: number of records
849 for i = 1, .., N (each term)
851 idf[i] = log(1 + doctotal / dococcur[i])
856 for i = 1, .., N: (each term)
857 if (cluster is "yes")
858 tf[i] = tf[i] / cluster_size;
859 relevance += 100000 * tf[i] / idf[i];
862 For controlling the ranking parameters, refer to the
863 <link linkend="service-rank">rank</link> element of the
865 Refer to the <link linkend="metadata-rank">rank</link> attribute
866 of the metadata element for how to control ranking for individual
869 </section> <!-- relevance_ranking -->
871 <section id="masterkey_connect">
872 <title>Pazpar2 and MasterKey Connect</title>
874 MasterKey Connect is a hosted connector, or gateway, service that exposes
875 whatever searchable resources you need. Since the service exposes all
876 resources using Z39.50 (or SRU), it is easy to set up Pazpar2 to use the
877 service. In particular, since all connectors expose basically the same core
878 behavior, it is a good use of Pazpar2's mechanism for managing default
879 behaviors across similar databases.
882 After installation of Pazpar2, the directory
883 <filename>/etc/pazpar2/settings/mkc</filename> (location may
884 vary depending on installation preferences) contains an example setup that
885 searches two different resources through a MasterKey Connect demo account.
886 The file mkc.xml contains default parameters that will work for all
887 MasterKey Connect resources (if you decide to become a customer of the
888 service, you will substitute your own account credentials for
889 the guest/guest). The other files contain specific information about
890 a couple of demonstration resources.
894 To play with the demo, just create a symlink from
895 <filename>/etc/pazpar2/services-enabled/default.xml</filename>
896 to <filename>/etc/pazpar2/services-available/mkc.xml</filename>.
897 And restart Pazpar2. You should now be able to search the two demo
898 resources using JSDemo or any user interface of your choice.
899 If you are interested in learning more about MasterKey Connect, or to
900 try out the service for free against your favorite online resource, just
901 contact us at <email>info@indexdata.com</email>.
905 </chapter> <!-- Using Pazpar2 -->
907 <reference id="reference">
908 <title>Reference</title>
909 <partintro id="reference-introduction">
911 The material in this chapter is drawn directly from the individual
918 <appendix id="license">
919 <title>License</title>
923 Copyright © ©right-year; Index Data.
927 Pazpar2 is free software; you can redistribute it and/or modify it under
928 the terms of the GNU General Public License as published by the Free
929 Software Foundation; either version 2, or (at your option) any later
934 Pazpar2 is distributed in the hope that it will be useful, but WITHOUT ANY
935 WARRANTY; without even the implied warranty of MERCHANTABILITY or
936 FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
941 You should have received a copy of the GNU General Public License
942 along with Pazpar2; see the file LICENSE. If not, write to the
943 Free Software Foundation,
944 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
953 <!-- Keep this comment at the end of the file