1 <?xml version="1.0" standalone="no"?>
2 <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
3 "http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"
5 <!ENTITY % local SYSTEM "local.ent">
7 <!ENTITY % entities SYSTEM "entities.ent">
9 <!ENTITY % idcommon SYSTEM "common/common.ent">
14 <title>Pazpar2 - User's Guide and Reference</title>
16 <firstname>Sebastian</firstname><surname>Hammer</surname>
19 <firstname>Adam</firstname><surname>Dickmeiss</surname>
22 <firstname>Marc</firstname><surname>Cromme</surname>
25 <firstname>Jakub</firstname><surname>Skoczen</surname>
28 <firstname>Mike</firstname><surname>Taylor</surname>
31 <firstname>Dennis</firstname><surname>Schafroth</surname>
33 <releaseinfo>&version;</releaseinfo>
35 <year>©right-year;</year>
36 <holder>Index Data</holder>
40 Pazpar2 is a high-performance metasearch engine featuring
41 merging, relevance ranking, record sorting,
43 It is middleware: it has no user interface of its own, but can be
44 configured and controlled by an XML-over-HTTP web-service to provide
45 metasearching functionality behind any user interface.
48 This document is a guide and reference to Pazpar2 version &version;.
53 <imagedata fileref="common/id.png" format="PNG"/>
56 <imagedata fileref="common/id.eps" format="EPS"/>
63 <chapter id="introduction">
64 <title>Introduction</title>
66 <section id="what.pazpar2.is">
67 <title>What Pazpar2 is</title>
69 Pazpar2 is a stand-alone metasearch engine with a web-service API, designed
70 to be used either from a browser-based client (JavaScript, Flash,
72 etc.), from server-side code, or any combination of the two.
73 Pazpar2 is a highly optimized client designed to
74 search many resources in parallel. It implements record merging,
75 relevance-ranking and sorting by arbitrary data content, and facet
76 analysis for browsing purposes. It is designed to be data-model
77 independent, and is capable of working with MARC, DublinCore, or any
78 other <ulink url="&url.xml;">XML</ulink>-structured response format
79 -- <ulink url="&url.xslt;">XSLT</ulink> is used to normalize and extract
80 data from retrieval records for display and analysis. It can be used
81 against any server which supports the
82 <ulink url="&url.z39.50;">Z39.50</ulink>,
83 <ulink url="&url.sru;">SRU/SRW</ulink>
84 or <ulink url="&url.solr;">Solr</ulink> protocol. Proprietary
85 backend modules can function as connectors between these standard
86 protocols and any non-standard API, including web-site scraping, to
87 support a large number of other protocols.
90 Additional functionality such as
91 user management and attractive displays are expected to be implemented by
92 applications that use Pazpar2. Pazpar2 itself is user-interface independent.
93 Its functionality is exposed through a simple XML-based web-service API,
94 designed to be easy to use from an Ajax-enabled browser, Flash
95 animation, Java applet, etc., or from a higher-level server-side language
96 like PHP, Perl or Java. Because session information can be shared between
97 browser-based logic and server-side scripting, there is tremendous
98 flexibility in how you implement application-specific logic on top
102 Once you launch a search in Pazpar2, the operation continues behind the
103 scenes. Pazpar2 connects to servers, carries out searches, and
104 retrieves, deduplicates, and stores results internally. Your application
105 code may periodically inquire about the status of an ongoing operation,
106 and ask to see records or result set facets. Results become
107 available immediately, and it is easy to build end-user interfaces than
108 feel extremely responsive, even when searching more than 100 servers
112 Pazpar2 is designed to be highly configurable. Incoming records are
113 normalized to XML/UTF-8, and then further normalized using XSLT to a
114 simple internal representation that is suitable for analysis. By
115 providing XSLT stylesheets for different kinds of result records, you
116 can configure Pazpar2 to work against different kinds of information
117 retrieval servers. Finally, metadata is extracted in a configurable
118 way from this internal record, to support display, merging, ranking,
119 result set facets, and sorting. Pazpar2 is not bound to a specific model
120 of metadata, such as DublinCore or MARC: by providing the right
121 configuration, it can work with any combination of different kinds of data
122 in support of many different applications.
125 Pazpar2 is designed to be efficient and scalable. You can set it up to
126 search several hundred targets in parallel, or you can use it to support
127 hundreds of concurrent users. It is implemented with the same attention
128 to performance and economy that we use in our indexing engines, so that
129 you can focus on building your application without worrying about the
130 details of metasearch logic. You can devote all of your attention to
131 usability and let Pazpar2 do what it does best -- metasearch.
134 Pazpar2 is our attempt to re-think the traditional paradigms for
135 implementing and deploying metasearch logic, with an uncompromising
136 approach to performance, and attempting to make maximum use of the
137 capabilities of modern browsers. The demo user interface that
138 accompanies the distribution is but one example. If you think of new
139 ways of using Pazpar2, we hope you'll share them with us, and if we
140 can provide assistance with regards to training, design, programming,
141 integration with different backends, hosting, or support, please don't
142 hesitate to contact us. If you'd like to see functionality in Pazpar2
143 that is not there today, please don't hesitate to contact us. It may
144 already be in our development pipeline, or there might be a
145 possibility for you to help out by sponsoring development time or
146 code. Either way, get in touch and we will give you straight answers.
152 Pazpar2 is covered by the GNU General Public License (GPL) version 2.
153 See <xref linkend="license"/> for further information.
157 <section id="connectors">
158 <title>Connectors to non-standard databases</title>
160 If you need to access commercial or open access resources that don't support
161 Z39.50 or SRU, one approach would be to use a tool like <ulink
162 url="&url.simpleserver;">SimpleServer</ulink> to build a
163 gateway. An easier option is to use Index Data's <ulink
164 url="&url.mkc;">MasterKey Connect</ulink>
165 service, which will expose virtually <emphasis>any</emphasis> resource
166 through Z39.50/SRU, dead easy to integrate with Pazpar2.
167 The service is hosted, so all you have to do is to let us
168 know which resources you are interested in, and we operate the gateways,
169 or Connectors for you for a low annual charge.
170 Types of resources supported include
171 commercial databases, free online resources, and even local resources;
172 almost anything that can be accessed through a web-facing user
173 interface can be accessed in this way.
174 Contact <email>info@indexdata.com</email> for more information.
175 See <xref linkend="masterkey_connect"/> for an example.
180 <title>A note on the name Pazpar2</title>
182 The name Pazpar2 derives from three sources. One one hand, it is
183 Index Data's second major piece of software that does parallel
184 searching of Z39.50 targets. On the other, it is a near-homophone
185 of Passpartout, the ever-helpful servant in Jules Verne's novel
186 Around the World in Eighty Days (who helpfully uses the language
187 of his master). Finally, "passe par tout" means something like
188 "passes through anything" in French -- on other words, a universal
189 solution, or if you like a MasterKey.
194 <chapter id="installation">
195 <title>Installation</title>
197 The Pazpar2 package includes documentation as well
198 as the Pazpar2 server. The package also includes a simple user
199 interface called "test1", which consists of a single HTML page and a single
200 JavaScript file to illustrate the use of Pazpar2.
203 Pazpar2 depends on the following tools/libraries:
205 <varlistentry><term><ulink url="&url.yaz;">YAZ</ulink></term>
208 The popular Z39.50 toolkit for the C language.
209 YAZ <emphasis>must</emphasis> be compiled with
210 <ulink url="&url.libxml2;">Libxml2</ulink>/<ulink url="&url.libxslt;">Libxslt</ulink> support.
213 It is highly recommended that YAZ is also compiled with
214 <ulink url="&url.icu;">ICU</ulink> support.
221 In order to compile Pazpar2, a C compiler which supports C99 or later
225 <section id="installation.unix">
226 <title>Installation from source on Unix (including Linux, MacOS, etc.)</title>
228 The latest source code for Pazpar2 is available from
229 <ulink url="&url.pazpar2.download;"/>.
230 Most Unix-based operating systems have the required
231 tools available as binary packages.
232 For example, if Libxml2/libXSLT libraries
233 are already installed as development packages, use these.
237 Ensure that the development libraries and header files are
238 available on your system before compiling Pazpar2. For installation
239 of YAZ, refer to the Installation chapter of the YAZ manual at
240 <ulink url="&url.yaz.install;"/>.
243 Once the dependencies are in place, Pazpar2 can be unpacked and
244 installed as follows:
247 tar xzf pazpar2-VERSION.tar.gz
254 The <literal>make install</literal> will install manpages as well as the
255 Pazpar2 server, <literal>pazpar2</literal>,
256 in PREFIX<literal>/sbin</literal>.
257 By default, PREFIX is <literal>/usr/local/</literal> . This can be
258 changed with configure option <option>--prefix</option>.
262 <section id="installation.win32">
263 <title>Installation from source on Windows</title>
265 Pazpar2 can be built for Windows using
266 <ulink url="&url.vstudio;">Microsoft Visual Studio</ulink>.
267 The support files for building YAZ on Windows are located in the
268 <filename>win</filename> directory. The compilation is performed
269 using the <filename>win/makefile</filename> which is to be
270 processed by the NMAKE utility part of Visual Studio.
273 Ensure that the development libraries and header files are
274 available on your system before compiling Pazpar2. For installation
276 the Installation chapter of the YAZ manual at
277 <ulink url="&url.yaz.install;"/>.
278 It is easiest if YAZ and Pazpar2 are unpacked in the same
279 directory (side-by-side).
282 The compilation is tuned by editing the makefile of Pazpar2.
283 The process is similar to YAZ. Adjust the various directories
284 <literal>YAZ_DIR</literal>, <literal>ZLIB_DIR</literal>, etc.,
288 Compile Pazpar2 by invoking <application>nmake</application> in
289 the <filename>win</filename> directory.
290 The resulting binaries of the build process are located in the
291 <filename>bin</filename> of the Pazpar2 source
292 tree - including the <filename>pazpar2.exe</filename> and necessary DLLs.
295 The Windows version of Pazpar2 is a console application. It may
296 be installed as a Windows Service by adding option
297 <literal>-install</literal> for the pazpar2 program. This will
298 register Pazpar2 as a service and use the other options provided
299 in the same invocation. For example:
302 ..\bin\pazpar2 -install -f pazpar2.cfg -l pazpar2.log
304 The Pazpar2 service may now be controlled via the Service Control
305 Panel. It may be unregistered by passing the <literal>-remove</literal>
309 ..\bin\pazpar2 -remove
314 <section id="installation.test1">
315 <title>Installation of test interfaces</title>
317 In this section we show how to make available the set of simple
318 interfaces that are part of the Pazpar2 source package, and which
319 demonstrate some ways to use Pazpar2. (Note that Debian users can
320 save time by just installing the package <literal>pazpar2-test1</literal>.)
323 A web server, such as Apache, must be installed and running on the system.
327 Start the Pazpar2 daemon using the 'in-source' binary of the Pazpar2
328 daemon. On Unix the process is:
331 cp pazpar2.cfg.dist pazpar2.cfg
332 ../src/pazpar2 -f pazpar2.cfg
337 copy pazpar2.cfg.dist pazpar2.cfg
338 ..\bin\pazpar2 -f pazpar2.cfg
340 This will start a Pazpar2 listener on port 9004. It will proxy
341 HTTP requests to port 80 on localhost, which we assume will be the regular
342 HTTP server on the system. Inspect and modify pazpar2.cfg as needed
343 if this is to be changed. The pazpar2.cfg file includes settings from the
344 file <filename>settings/edu.xml</filename>
349 The test UIs are located in <literal>www</literal>. Ensure that this
350 directory is available to the web server by copying
351 <literal>www</literal> to the document root,
352 using Apache's <literal>Alias</literal> directive, or
353 creating a symbolic link: for example, on a Debian or Ubuntu
354 system with Apache2 installed from the standard package, you might
355 make the link as follows:
358 sudo ln -s `pwd`/www /var/www/pazpar2-demo
363 This makes the test applications visible at
364 <ulink url="http://localhost/pazpar2-demo/"/>
365 but they can not be run successfully from that URL, as they submit
366 search requests back to the server form which they were served,
367 and Apache2 doesn't know how to handle them. Instead, the test
368 applications must be accessed from Pazpar2 itself, acting as a
369 proxy to Apache2, at the URL
370 <ulink url="http://localhost:9004/pazpar2-demo/"/>
374 From here, the demo applications can be
375 accessed: <literal>test1</literal>, <literal>test2</literal> and
376 <literal>jsdemo</literal>
377 are pure HTML+JavaScript setups, needing no server-side
379 <literal>demo</literal>
380 requires PHP on the server.
383 If you don't see the test interfaces, check whether they are available
384 on port 80 (i.e. directly from the Apache2 server). If not, the
385 Apache configuration is incorrect.
388 In order to use Apache as frontend for the interface on port 80
389 for public access etc., refer to
390 <xref linkend="installation.apache2proxy"/>.
394 <section id="installation.debian">
395 <title>Installation on Debian GNU/Linux and Ubuntu</title>
397 Index Data provides Debian and Ubuntu packages for Pazpar2 and YAZ.
398 Refer to these directories:
399 <ulink url="&url.pazpar2.download;debian/"/> and
400 <ulink url="&url.pazpar2.download;ubuntu/"/>.
404 <section id="installation.centos">
405 <title>Installation on RedHat / CentOS</title>
407 Index Data provides CentOS packages for Pazpar2 and YAZ.
409 <ulink url="&url.pazpar2.download;redhat/centos"/> for
414 <section id="installation.apache2proxy">
415 <title>Apache 2 Proxy</title>
419 url="http://httpd.apache.org/docs/2.2/mod/mod_proxy.html">
422 which allows Pazpar2 to become a backend to an Apache 2
423 based web service. The Apache 2 proxy must operate in the
424 <emphasis>Reverse</emphasis> Proxy mode.
428 On a Debian based Apache 2 system, the relevant modules can
431 sudo a2enmod proxy_http proxy_balancer
436 Traditionally Pazpar2 interprets URL paths with suffix
437 <literal>/search.pz2</literal>.
440 url="http://httpd.apache.org/docs/2.2/mod/mod_proxy.html#proxypass">
443 directive of Apache must be used to map a URL path
444 the the Pazpar2 server (listening port).
449 The ProxyPass directive takes a prefix rather than
450 a suffix as URL path. It is important that the Java Script code
451 uses the prefix given for it.
455 <example id="installation.apache2proxy.example">
456 <title>Apache 2 proxy configuration</title>
458 If Pazpar2 is running on port 8004 and the portal is using
459 <filename>search.pz2</filename> inside portal in directory
460 <filename>/myportal/</filename> we could use the following
461 Apache 2 configuration:
464 <IfModule mod_proxy.c>
468 AddDefaultCharset off
473 ProxyPass /myportal/search.pz2 http://localhost:8004/search.pz2
484 <title>Using Pazpar2</title>
486 This chapter provides a general introduction to the use and
487 deployment of Pazpar2.
490 <section id="architecture">
491 <title>Pazpar2 and your systems architecture</title>
493 Pazpar2 is designed to provide asynchronous, behind-the-scenes
494 metasearching functionality to your application, exposing this
495 functionality using a simple webservice API that can be accessed
496 from any number of development environments. In particular, it is
497 possible to combine Pazpar2 either with your server-side dynamic
498 website scripting, with scripting or code running in the browser, or
499 with any combination of the two. Pazpar2 is an excellent tool for
500 building advanced, Ajax-based user interfaces for metasearch
501 functionality, but it isn't a requirement -- you can choose to use
502 Pazpar2 entirely as a backend to your regular server-side scripting.
503 When you do use Pazpar2 in conjunction
504 with browser scripting (JavaScript/Ajax, Flash, applets,
505 etc.), there are special considerations.
509 Pazpar2 implements a simple but efficient HTTP server, and it is
510 designed to interact directly with scripting running in the browser
511 for the best possible performance, and to limit overhead when
512 several browser clients generate numerous webservice requests.
513 However, it is still desirable to use a conventional webserver,
514 such as Apache, to serve up graphics, HTML documents, and
515 server-side scripting. Because the security sandbox environment of
516 most browser-side programming environments only allows communication
517 with the server from which the enclosing HTML page or object
518 originated, Pazpar2 is designed so that it can act as a transparent
519 proxy in front of an existing webserver (see <xref
520 linkend="pazpar2_conf"/> for details).
521 In this mode, all regular
522 HTTP requests are transparently passed through to your webserver,
523 while Pazpar2 only intercepts search-related webservice requests.
527 If you want to expose your combined service on port 80, you can
528 either run your regular webserver on a different port, a different
529 server, or a different IP address associated with the same server.
533 Pazpar2 can also work behind
534 a reverse Proxy. Refer to <xref linkend="installation.apache2proxy"/>)
535 for more information.
536 This allows your existing HTTP server to operate on port 80 as usual.
537 Pazpar2 can be started on another (internal) port.
541 Sometimes, it may be necessary to implement functionality on your
542 regular webserver that makes use of search results, for example to
543 implement data import functionality, emailing results, history
544 lists, personal citation lists, interlibrary loan functionality,
545 etc. Fortunately, it is simple to exchange information between
546 Pazpar2, your browser scripting, and backend server-side scripting.
547 You can send a session ID and possibly a record ID from your browser
548 code to your server code, and from there use Pazpar2s webservice API
549 to access result sets or individual records. You could even 'hide'
550 all of Pazpar2s functionality between your own API implemented on
551 the server-side, and access that from the browser or elsewhere. The
552 possibilities are just about endless.
556 <section id="data_model">
557 <title>Your data model</title>
559 Pazpar2 does not have a preconceived model of what makes up a data
560 model. There are no assumptions that records have specific fields or
561 that they are organized in any particular way. The only assumption
562 is that data comes packaged in a form that the software can work
563 with (presently, that means XML or MARC), and that you can provide
564 the necessary information to massage it into Pazpar2's internal
569 Handling retrieval records in Pazpar2 is a two-step process. First,
570 you decide which data elements of the source record you are
571 interested in, and you specify any desired massaging or combining of
572 elements using an XSLT stylesheet (MARC records are automatically
573 normalized to <ulink url="&url.marcxml;">MARCXML</ulink> before this step).
574 If desired, you can run multiple XSLT stylesheets in series to accomplish
575 this, but the output of the last one should be a representation of the
576 record in a schema that Pazpar2 understands.
580 The intermediate, internal representation of the record looks like
583 <record xmlns="http://www.indexdata.com/pazpar2/1.0"
584 mergekey="title The Shining author King, Stephen">
586 <metadata type="title" rank="2">The Shining</metadata>
588 <metadata type="author">King, Stephen</metadata>
590 <metadata type="kind">ebook</metadata>
591 <!-- ... and so on -->
595 As you can see, there isn't much to it. There are really only a few
596 important elements to this file.
600 Elements should belong to the namespace
601 <literal>http://www.indexdata.com/pazpar2/1.0</literal>.
602 If the root node contains the
603 attribute 'mergekey', then every record that generates the same
604 merge key (normalized for case differences, white space, and
605 truncation) will be joined into a cluster. In other words, you
606 decide how records are merged. If you don't include a merge key,
607 records are never merged. The 'metadata' elements provide the meat
608 of the elements -- the content. the 'type' attribute is used to
609 match each element against processing rules that determine what
610 happens to the data element next. The attribute, 'rank' specifies
611 specifies a multipler for ranking for this element.
615 The next processing step is the extraction of metadata from the
616 intermediate representation of the record. This is governed by the
617 'metadata' elements in the 'service' section of the configuration
618 file. See <xref linkend="config-server"/> for details. The metadata
619 in the retrieval record ultimately drives merging, sorting, ranking,
620 the extraction of browse facets, and display, all configurable.
624 Pazpar2 1.6.37 and later also allows already clustered records to
625 be ingested. Suppose a database already clusters for us and we would like
626 to keep that cluster for Pazpar2. In that case we can generate a
627 <literal>cluster</literal> wrapper element that holds individual
628 <literal>record</literal> elements.
631 Cluster record example:
633 <cluster xmlns="http://www.indexdata.com/pazpar2/1.0">
635 <metadata type="title" rank="2">The Shining</metadata>
636 <metadata type="author">King, Stephen</metadata>
637 <metadata type="kind">ebook</metadata>
640 <metadata type="title" rank="2">The Shining</metadata>
641 <metadata type="author">King, Stephen</metadata>
642 <metadata type="kind">audio</metadata>
649 <section id="client">
650 <title>Client development overview</title>
652 You can use Pazpar2 from any environment that allows you to use
653 webservices. The initial goal of the software was to support
654 Ajax-based applications, but there literally are no limits to what
655 you can do. You can use Pazpar2 from Javascript, Flash, Java, etc.,
656 on the browser side, and from any development environment on the
657 server side, and you can pass session tokens and record IDs freely
658 around between these environments to build sophisticated applications.
659 Use your imagination.
663 The webservice API of Pazpar2 is described in detail in <xref
664 linkend="pazpar2_protocol"/>.
668 In brief, you use the 'init' command to create a session, a
669 temporary workspace which carries information about the current
670 search. You start a new search using the 'search' command. Once the
671 search has been started, you can follow its progress using the
672 'stat', 'bytarget', 'termlist', or 'show' commands. Detailed records
673 can be fetched using the 'record' command.
679 <section id="unicode">
680 <title>Unicode Compliance</title>
682 Pazpar2 is Unicode compliant and language and locale aware but relies
683 on character encoding for the targets to be specified correctly if
684 the targets themselves are not UTF-8 based (most aren't).
685 Just a few bad behaving targets can spoil the search experience
686 considerably if for example Greek, Russian or otherwise non 7-bit ASCII
687 search terms are entered. In these cases some targets return
688 records irrelevant to the query, and the result screens will be
689 cluttered with noise.
692 While noise from misbehaving targets can not be removed, it can
693 be reduced using truly Unicode based ranking. This is an
694 option which is available to the system administrator if ICU
695 support is compiled into YAZ, see
696 <xref linkend="installation"/> for details.
699 In addition, the ICU tokenization and normalization rules must
700 be defined in the master configuration file described in
701 <xref linkend="config-server"/>.
705 <section id="load_balancing">
706 <title>Load balancing</title>
708 Just like any web server, Pazpar2, can be load balanced by a standard
709 hardware or software load balancer as long as the session stickiness
710 is ensured. If you are already running the Apache2 web server in front
711 of Pazpar2 and use the apache mod_proxy module to 'relay' client
712 requests to Pazpar2, this set up can be easily extended to include
713 load balancing capabilites.
714 To do so you need to enable the
715 <ulink url="http://httpd.apache.org/docs/2.2/mod/mod_proxy_balancer.html">
718 module in your Apache2 installation.
722 On a Debian based Apache 2 system, the relevant modules can
725 sudo a2enmod proxy_http
730 The mod_proxy_balancer can pass all 'sessionsticky' requests to the
731 same backend worker as long as the requests are marked with the
732 originating worker's ID (called 'route'). If the Pazpar2 serverID is
733 configured (by setting an 'id' attribute on the 'server' element in
734 the Pazpar2 configuration file) Pazpar2 will append it to the
735 'session' element returned during the 'init' in a mod_proxy_balancer
737 Since the 'session' is then re-sent by the client (for all pazpar2
738 request besides 'init'), the balancer can use the marker to pass
739 the request to the right route. To do so the balancer needs to be
740 configured to inspect the 'session' parameter.
743 <example id="load_balancing.example">
744 <title>Apache 2 load balancing configuration</title>
746 Having 4 Pazpar2 instances running on the same host, port range of
747 8004-8007 and serverIDs of: pz1, pz2, pz3 and pz4 respectively we
748 could use the following Apache 2 configuration to expose a single
749 pazpar2 'endpoint' on a standard
750 (<filename>/pazpar2/search.pz2</filename>) location:
754 AddDefaultCharset off
760 # 'route' has to match the configured pazpar2 server ID
761 <Proxy balancer://pz2cluster>
762 BalancerMember http://localhost:8004 route=pz1
763 BalancerMember http://localhost:8005 route=pz2
764 BalancerMember http://localhost:8006 route=pz3
765 BalancerMember http://localhost:8007 route=pz4
768 # route is resent in the 'session' param which has the form:
769 # 'sessid.serverid', understandable by the mod_proxy_load_balancer
770 # this is not going to work if the client tampers with the 'session' param
771 ProxyPass /pazpar2/search.pz2 balancer://pz2cluster lbmethod=byrequests stickysession=session nofailover=On
774 The 'ProxyPass' line sets up a reverse proxy for request
775 ‘/pazpar2/search.pz2’ and delegates all requests to the load balancer
776 (virtual worker) with name ‘pz2cluster’.
777 Sticky sessions are enabled and implemented using the ‘session’ parameter.
778 The ‘Proxy’ section lists all the servers (real workers) which the
779 load balancer can use.
786 <section id="relevance_ranking">
787 <title>Relevance ranking</title>
789 Pazpar2 uses a variant of the fterm frequency–inverse document frequency
790 (Tf-idf) ranking algorithm.
793 The Tf-part is straightforward to calculate and is based on the
794 documents that Pazpar2 fetches. The idf-part, however, is more tricky
795 since the corpus at hand is ONLY the relevant documents and not
796 irrelevant ones. Pazpar2 does not have the full corpus -- only the
797 documents that match a particular search.
800 Computatation of the Tf-part is based on the normalized documents.
801 The length, the position and terms are thus normalized at this point.
802 Also the computation if performed for each document received from the
803 target - before merging takes place. The result of a TF-compuation is
804 added to the TF-total of a cluster. Thus, if a document occurs twice,
805 then the TF-part is doubled. That, however, can be adjusted, because the
806 TF-part may be divided by the number of documents in a cluster.
809 The algorithm used by Pazpar2 has two phases. In phase one
810 Pazpar2 computes a tf-array .. This is being done as records are
811 fetched form the database. In this case, the rank weigth
812 <literal>w</literal>, the and rank tweaks <literal>lead</literal>,
813 <literal>follow</literal> and <literal>length</literal>.
818 foreach document in a cluster
821 for i = 1, .. N: (each term)
822 foreach pos (where term i occurs in field)
823 // w is configured weight for field
824 // pos is position of term in field
825 w[i] += w / (1 + log2(1+lead*pos))
827 w[i] += w[i] * follow / (1+log2(d)
828 // length: length of field (number of terms that is)
829 if (length strategy is "linear")
830 tf[i] += w[i] / length;
831 else if (length strategy is "log")
832 tf[i] += w[i] / log2(length);
833 else if (length strategy is "none")
837 In phase two, the idf-array is computed and the final score
838 is computed. This is done for each cluster as part of each show command.
839 The rank tweak <literal>cluster</literal> is in use here.
842 // dococcur[i]: number of records where term occurs
843 // doctotal: number of records
844 for i = 1, .., N (each term)
846 idf[i] = log(1 + doctotal / dococcur[i])
851 for i = 1, .., N: (each term)
852 if (cluster is "yes")
853 tf[i] = tf[i] / cluster_size;
854 relevance += 100000 * tf[i] / idf[i];
857 For controlling the ranking parameters, refer to the
858 <link linkend="service-rank">rank</link> element of the
860 Refer to the <link linkend="metadata-rank">rank</link> attribute
861 of the metadata element for how to control ranking for individual
864 </section> <!-- relevance_ranking -->
866 <section id="masterkey_connect">
867 <title>Pazpar2 and MasterKey Connect</title>
869 MasterKey Connect is a hosted connector, or gateway, service that exposes
870 whatever searchable resources you need. Since the service exposes all
871 resources using Z39.50 (or SRU), it is easy to set up Pazpar2 to use the
872 service. In particular, since all connectors expose basically the same core
873 behavior, it is a good use of Pazpar2's mechanism for managing default
874 behaviors across similar databases.
877 After installation of Pazpar2, the directory
878 <filename>/etc/pazpar2/settings/mkc</filename> (location may
879 vary depending on installation preferences) contains an example setup that
880 searches two different resources through a MasterKey Connect demo account.
881 The file mkc.xml contains default parameters that will work for all
882 MasterKey Connect resources (if you decide to become a customer of the
883 service, you will substitute your own account credentials for
884 the guest/guest). The other files contain specific information about
885 a couple of demonstration resources.
889 To play with the demo, just create a symlink from
890 <filename>/etc/pazpar2/services-enabled/default.xml</filename>
891 to <filename>/etc/pazpar2/services-available/mkc.xml</filename>.
892 And restart Pazpar2. You should now be able to search the two demo
893 resources using JSDemo or any user interface of your choice.
894 If you are interested in learning more about MasterKey Connect, or to
895 try out the service for free against your favorite online resource, just
896 contact us at <email>info@indexdata.com</email>.
900 </chapter> <!-- Using Pazpar2 -->
902 <reference id="reference">
903 <title>Reference</title>
904 <partintro id="reference-introduction">
906 The material in this chapter is drawn directly from the individual
913 <appendix id="license">
914 <title>License</title>
918 Copyright © ©right-year; Index Data.
922 Pazpar2 is free software; you can redistribute it and/or modify it under
923 the terms of the GNU General Public License as published by the Free
924 Software Foundation; either version 2, or (at your option) any later
929 Pazpar2 is distributed in the hope that it will be useful, but WITHOUT ANY
930 WARRANTY; without even the implied warranty of MERCHANTABILITY or
931 FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
936 You should have received a copy of the GNU General Public License
937 along with Pazpar2; see the file LICENSE. If not, write to the
938 Free Software Foundation,
939 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
948 <!-- Keep this comment at the end of the file