1 <?xml version="1.0" standalone="no"?>
2 <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
3 "http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"
5 <!ENTITY % local SYSTEM "local.ent">
7 <!ENTITY % entities SYSTEM "entities.ent">
9 <!ENTITY % idcommon SYSTEM "common/common.ent">
14 <title>Pazpar2 - User's Guide and Reference</title>
16 <firstname>Sebastian</firstname><surname>Hammer</surname>
19 <firstname>Adam</firstname><surname>Dickmeiss</surname>
22 <firstname>Marc</firstname><surname>Cromme</surname>
25 <firstname>Jakub</firstname><surname>Skoczen</surname>
28 <firstname>Mike</firstname><surname>Taylor</surname>
31 <firstname>Dennis</firstname><surname>Schafroth</surname>
33 <releaseinfo>&version;</releaseinfo>
35 <year>©right-year;</year>
36 <holder>Index Data</holder>
40 Pazpar2 is a high-performance metasearch engine featuring
41 merging, relevance ranking, record sorting,
43 It is middleware: it has no user interface of its own, but can be
44 configured and controlled by an XML-over-HTTP web-service to provide
45 metasearching functionality behind any user interface.
48 This document is a guide and reference to Pazpar2 version &version;.
53 <imagedata fileref="common/id.png" format="PNG"/>
56 <imagedata fileref="common/id.eps" format="EPS"/>
63 <chapter id="introduction">
64 <title>Introduction</title>
66 <section id="what.pazpar2.is">
67 <title>What Pazpar2 is</title>
69 Pazpar2 is a stand-alone metasearch engine with a web-service API, designed
70 to be used either from a browser-based client (JavaScript, Flash,
72 etc.), from server-side code, or any combination of the two.
73 Pazpar2 is a highly optimized client designed to
74 search many resources in parallel. It implements record merging,
75 relevance-ranking and sorting by arbitrary data content, and facet
76 analysis for browsing purposes. It is designed to be data-model
77 independent, and is capable of working with MARC, DublinCore, or any
78 other <ulink url="&url.xml;">XML</ulink>-structured response format
79 -- <ulink url="&url.xslt;">XSLT</ulink> is used to normalize and extract
80 data from retrieval records for display and analysis. It can be used
81 against any server which supports the
82 <ulink url="&url.z39.50;">Z39.50</ulink>, <ulink url="&url.sru;">SRU/SRW</ulink>
83 or <ulink url="&url.solr;">SOLR</ulink> protocol. Proprietary
84 backend modules can function as connectors between these standard
85 protocols and any non-standard API, including web-site scraping, to
86 support a large number of other protocols.
89 Additional functionality such as
90 user management and attractive displays are expected to be implemented by
91 applications that use Pazpar2. Pazpar2 itself is user-interface independent.
92 Its functionality is exposed through a simple XML-based web-service API,
93 designed to be easy to use from an Ajax-enabled browser, Flash
94 animation, Java applet, etc., or from a higher-level server-side language
95 like PHP, Perl or Java. Because session information can be shared between
96 browser-based logic and server-side scripting, there is tremendous
97 flexibility in how you implement application-specific logic on top
101 Once you launch a search in Pazpar2, the operation continues behind the
102 scenes. Pazpar2 connects to servers, carries out searches, and
103 retrieves, deduplicates, and stores results internally. Your application
104 code may periodically inquire about the status of an ongoing operation,
105 and ask to see records or result set facets. Results become
106 available immediately, and it is easy to build end-user interfaces than
107 feel extremely responsive, even when searching more than 100 servers
111 Pazpar2 is designed to be highly configurable. Incoming records are
112 normalized to XML/UTF-8, and then further normalized using XSLT to a
113 simple internal representation that is suitable for analysis. By
114 providing XSLT stylesheets for different kinds of result records, you
115 can configure Pazpar2 to work against different kinds of information
116 retrieval servers. Finally, metadata is extracted in a configurable
117 way from this internal record, to support display, merging, ranking,
118 result set facets, and sorting. Pazpar2 is not bound to a specific model
119 of metadata, such as DublinCore or MARC: by providing the right
120 configuration, it can work with any combination of different kinds of data
121 in support of many different applications.
124 Pazpar2 is designed to be efficient and scalable. You can set it up to
125 search several hundred targets in parallel, or you can use it to support
126 hundreds of concurrent users. It is implemented with the same attention
127 to performance and economy that we use in our indexing engines, so that
128 you can focus on building your application without worrying about the
129 details of metasearch logic. You can devote all of your attention to
130 usability and let Pazpar2 do what it does best -- metasearch.
133 Pazpar2 is our attempt to re-think the traditional paradigms for
134 implementing and deploying metasearch logic, with an uncompromising
135 approach to performance, and attempting to make maximum use of the
136 capabilities of modern browsers. The demo user interface that
137 accompanies the distribution is but one example. If you think of new
138 ways of using Pazpar2, we hope you'll share them with us, and if we
139 can provide assistance with regards to training, design, programming,
140 integration with different backends, hosting, or support, please don't
141 hesitate to contact us. If you'd like to see functionality in Pazpar2
142 that is not there today, please don't hesitate to contact us. It may
143 already be in our development pipeline, or there might be a
144 possibility for you to help out by sponsoring development time or
145 code. Either way, get in touch and we will give you straight answers.
151 Pazpar2 is covered by the GNU General Public License (GPL) version 2.
152 See <xref linkend="license"/> for further information.
156 <section id="connectors">
157 <title>Connectors to non-standard databases</title>
159 If you need to access commercial or open access resources that don't support
160 Z39.50 or SRU, one approach would be to use a tool like <ulink
161 url="&url.simpleserver;">SimpleServer</ulink> to build a
162 gateway. An easier option is to use Index Data's <ulink
163 url="&url.mkc;">MasterKey Connect</ulink>
164 service, which will expose virtually <emphasis>any</emphasis> resource
165 through Z39.50/SRU, dead easy to integrate with Pazpar2.
166 The service is hosted, so all you have to do is to let us
167 know which resources you are interested in, and we operate the gateways,
168 or Connectors for you for a low annual charge.
169 Types of resources supported include
170 commercial databases, free online resources, and even local resources;
171 almost anything that can be accessed through a web-facing user
172 interface can be accessed in this way.
173 Contact <email>info@indexdata.com</email> for more information.
174 See <xref linkend="masterkey_connect"/> for an example.
179 <title>A note on the name Pazpar2</title>
181 The name Pazpar2 derives from three sources. One one hand, it is
182 Index Data's second major piece of software that does parallel
183 searching of Z39.50 targets. On the other, it is a near-homophone
184 of Passpartout, the ever-helpful servant in Jules Verne's novel
185 Around the World in Eighty Days (who helpfully uses the language
186 of his master). Finally, "passe par tout" means something like
187 "passes through anything" in French -- on other words, a universal
188 solution, or if you like a MasterKey.
193 <chapter id="installation">
194 <title>Installation</title>
196 The Pazpar2 package includes documentation as well
197 as the Pazpar2 server. The package also includes a simple user
198 interface called "test1", which consists of a single HTML page and a single
199 JavaScript file to illustrate the use of Pazpar2.
202 Pazpar2 depends on the following tools/libraries:
204 <varlistentry><term><ulink url="&url.yaz;">YAZ</ulink></term>
207 The popular Z39.50 toolkit for the C language.
208 YAZ <emphasis>must</emphasis> be compiled with Libxml2/Libxslt support.
212 <varlistentry><term><ulink url="&url.icu;">International
213 Components for Unicode (ICU)</ulink></term>
216 ICU provides Unicode support for non-English languages with
217 character sets outside the range of 7bit ASCII, like
218 Greek, Russian, German and French. Pazpar2 uses the ICU
219 Unicode character conversions, Unicode normalization, case
220 folding and other fundamental operations needed in
221 tokenization, normalization and ranking of records.
224 Compiling, linking, and usage of the ICU libraries is optional,
225 but strongly recommended for usage in an international
233 In order to compile Pazpar2, a C compiler which supports C99 or later
237 <section id="installation.unix">
238 <title>Installation from source on Unix (including Linux, MacOS, etc.)</title>
240 The latest source code for Pazpar2 is available from
241 <ulink url="&url.pazpar2.download;"/>.
242 Most Unix-based operating systems have the required
243 tools available as binary packages.
244 For example, if Libxml2/libXSLT libraries
245 are already installed as development packages, use these.
249 Ensure that the development libraries and header files are
250 available on your system before compiling Pazpar2. For installation
251 of YAZ, refer to the Installation chapter of the YAZ manual at
252 <ulink url="&url.yaz.install;"/>.
255 Once the dependencies are in place, Pazpar2 can be unpacked and
256 installed as follows:
259 tar xzf pazpar2-VERSION.tar.gz
266 The <literal>make install</literal> will install manpages as well as the
267 Pazpar2 server, <literal>pazpar2</literal>,
268 in PREFIX<literal>/sbin</literal>.
269 By default, PREFIX is <literal>/usr/local/</literal> . This can be
270 changed with configure option <option>--prefix</option>.
274 <section id="installation.win32">
275 <title>Installation from source on Windows</title>
277 Pazpar2 can be built for Windows using
278 <ulink url="&url.vstudio;">Microsoft Visual Studio</ulink>.
279 The support files for building YAZ on Windows are located in the
280 <filename>win</filename> directory. The compilation is performed
281 using the <filename>win/makefile</filename> which is to be
282 processed by the NMAKE utility part of Visual Studio.
285 Ensure that the development libraries and header files are
286 available on your system before compiling Pazpar2. For installation
288 the Installation chapter of the YAZ manual at
289 <ulink url="&url.yaz.install;"/>.
290 It is easiest if YAZ and Pazpar2 are unpacked in the same
291 directory (side-by-side).
294 The compilation is tuned by editing the makefile of Pazpar2.
295 The process is similar to YAZ. Adjust the various directories
296 <literal>YAZ_DIR</literal>, <literal>ZLIB_DIR</literal>, etc.,
300 Compile Pazpar2 by invoking <application>nmake</application> in
301 the <filename>win</filename> directory.
302 The resulting binaries of the build process are located in the
303 <filename>bin</filename> of the Pazpar2 source
304 tree - including the <filename>pazpar2.exe</filename> and necessary DLLs.
307 The Windows version of Pazpar2 is a console application. It may
308 be installed as a Windows Service by adding option
309 <literal>-install</literal> for the pazpar2 program. This will
310 register Pazpar2 as a service and use the other options provided
311 in the same invocation. For example:
314 ..\bin\pazpar2 -install -f pazpar2.cfg -l pazpar2.log
316 The Pazpar2 service may now be controlled via the Service Control
317 Panel. It may be unregistered by passing the <literal>-remove</literal>
321 ..\bin\pazpar2 -remove
326 <section id="installation.test1">
327 <title>Installation of test interfaces</title>
329 In this section we show how to make available the set of simple
330 interfaces that are part of the Pazpar2 source package, and which
331 demonstrate some ways to use Pazpar2. (Note that Debian users can
332 save time by just installing the package <literal>pazpar2-test1</literal>.)
335 A web server, such as Apache, must be installed and running on the system.
339 Start the Pazpar2 daemon using the 'in-source' binary of the Pazpar2
340 daemon. On Unix the process is:
343 cp pazpar2.cfg.dist pazpar2.cfg
344 ../src/pazpar2 -f pazpar2.cfg
349 copy pazpar2.cfg.dist pazpar2.cfg
350 ..\bin\pazpar2 -f pazpar2.cfg
352 This will start a Pazpar2 listener on port 9004. It will proxy
353 HTTP requests to port 80 on localhost, which we assume will be the regular
354 HTTP server on the system. Inspect and modify pazpar2.cfg as needed
355 if this is to be changed. The pazpar2.cfg file includes settings from the
356 file <filename>settings/edu.xml</filename>
361 The test UIs are located in <literal>www</literal>. Ensure that this
362 directory is available to the web server by copying
363 <literal>www</literal> to the document root,
364 using Apache's <literal>Alias</literal> directive, or
365 creating a symbolic link: for example, on a Debian or Ubuntu
366 system with Apache2 installed from the standard package, you might
367 make the link as follows:
370 sudo ln -s `pwd`/www /var/www/pazpar2-demo
375 This makes the test applications visible at
376 <ulink url="http://localhost/pazpar2-demo/"/>
377 but they can not be run successfully from that URL, as they submit
378 search requests back to the server form which they were served,
379 and Apache2 doesn't know how to handle them. Instead, the test
380 applications must be accessed from Pazpar2 itself, acting as a
381 proxy to Apache2, at the URL
382 <ulink url="http://localhost:9004/pazpar2-demo/"/>
386 From here, the demo applications can be
387 accessed: <literal>test1</literal>, <literal>test2</literal> and
388 <literal>jsdemo</literal>
389 are pure HTML+JavaScript setups, needing no server-side
391 <literal>demo</literal>
392 requires PHP on the server.
395 If you don't see the test interfaces, check whether they are available
396 on port 80 (i.e. directly from the Apache2 server). If not, the
397 Apache configuration is incorrect.
400 In order to use Apache as frontend for the interface on port 80
401 for public access etc., refer to
402 <xref linkend="installation.apache2proxy"/>.
406 <section id="installation.debian">
407 <title>Installation on Debian GNU/Linux and Ubuntu</title>
409 Index Data provides Debian and Ubuntu packages for Pazpar2.
410 As of February 2010, these
411 are prepared for Debian versions Etch, Lenny and Squeeze; and for
412 Ubuntu versions 8.04 (hardy), 8.10 (intrepid), 9.04 (jaunty) and
413 9.10 (karmic). These packages are available at
414 <ulink url="&url.pazpar2.download.debian;"/> and
415 <ulink url="&url.pazpar2.download.ubuntu;"/>.
419 <section id="installation.apache2proxy">
420 <title>Apache 2 Proxy</title>
424 url="http://httpd.apache.org/docs/2.2/mod/mod_proxy.html">
427 which allows Pazpar2 to become a backend to an Apache 2
428 based web service. The Apache 2 proxy must operate in the
429 <emphasis>Reverse</emphasis> Proxy mode.
433 On a Debian based Apache 2 system, the relevant modules can
436 sudo a2enmod proxy_http proxy_balancer
441 Traditionally Pazpar2 interprets URL paths with suffix
442 <literal>/search.pz2</literal>.
445 url="http://httpd.apache.org/docs/2.2/mod/mod_proxy.html#proxypass">
448 directive of Apache must be used to map a URL path
449 the the Pazpar2 server (listening port).
454 The ProxyPass directive takes a prefix rather than
455 a suffix as URL path. It is important that the Java Script code
456 uses the prefix given for it.
460 <example id="installation.apache2proxy.example">
461 <title>Apache 2 proxy configuration</title>
463 If Pazpar2 is running on port 8004 and the portal is using
464 <filename>search.pz2</filename> inside portal in directory
465 <filename>/myportal/</filename> we could use the following
466 Apache 2 configuration:
469 <IfModule mod_proxy.c>
473 AddDefaultCharset off
478 ProxyPass /myportal/search.pz2 http://localhost:8004/search.pz2
489 <title>Using Pazpar2</title>
491 This chapter provides a general introduction to the use and
492 deployment of Pazpar2.
495 <section id="architecture">
496 <title>Pazpar2 and your systems architecture</title>
498 Pazpar2 is designed to provide asynchronous, behind-the-scenes
499 metasearching functionality to your application, exposing this
500 functionality using a simple webservice API that can be accessed
501 from any number of development environments. In particular, it is
502 possible to combine Pazpar2 either with your server-side dynamic
503 website scripting, with scripting or code running in the browser, or
504 with any combination of the two. Pazpar2 is an excellent tool for
505 building advanced, Ajax-based user interfaces for metasearch
506 functionality, but it isn't a requirement -- you can choose to use
507 Pazpar2 entirely as a backend to your regular server-side scripting.
508 When you do use Pazpar2 in conjunction
509 with browser scripting (JavaScript/Ajax, Flash, applets,
510 etc.), there are special considerations.
514 Pazpar2 implements a simple but efficient HTTP server, and it is
515 designed to interact directly with scripting running in the browser
516 for the best possible performance, and to limit overhead when
517 several browser clients generate numerous webservice requests.
518 However, it is still desirable to use a conventional webserver,
519 such as Apache, to serve up graphics, HTML documents, and
520 server-side scripting. Because the security sandbox environment of
521 most browser-side programming environments only allows communication
522 with the server from which the enclosing HTML page or object
523 originated, Pazpar2 is designed so that it can act as a transparent
524 proxy in front of an existing webserver (see <xref
525 linkend="pazpar2_conf"/> for details).
526 In this mode, all regular
527 HTTP requests are transparently passed through to your webserver,
528 while Pazpar2 only intercepts search-related webservice requests.
532 If you want to expose your combined service on port 80, you can
533 either run your regular webserver on a different port, a different
534 server, or a different IP address associated with the same server.
538 Pazpar2 can also work behind
539 a reverse Proxy. Refer to <xref linkend="installation.apache2proxy"/>)
540 for more information.
541 This allows your existing HTTP server to operate on port 80 as usual.
542 Pazpar2 can be started on another (internal) port.
546 Sometimes, it may be necessary to implement functionality on your
547 regular webserver that makes use of search results, for example to
548 implement data import functionality, emailing results, history
549 lists, personal citation lists, interlibrary loan functionality,
550 etc. Fortunately, it is simple to exchange information between
551 Pazpar2, your browser scripting, and backend server-side scripting.
552 You can send a session ID and possibly a record ID from your browser
553 code to your server code, and from there use Pazpar2s webservice API
554 to access result sets or individual records. You could even 'hide'
555 all of Pazpar2s functionality between your own API implemented on
556 the server-side, and access that from the browser or elsewhere. The
557 possibilities are just about endless.
561 <section id="data_model">
562 <title>Your data model</title>
564 Pazpar2 does not have a preconceived model of what makes up a data
565 model. There are no assumptions that records have specific fields or
566 that they are organized in any particular way. The only assumption
567 is that data comes packaged in a form that the software can work
568 with (presently, that means XML or MARC), and that you can provide
569 the necessary information to massage it into Pazpar2's internal
574 Handling retrieval records in Pazpar2 is a two-step process. First,
575 you decide which data elements of the source record you are
576 interested in, and you specify any desired massaging or combining of
577 elements using an XSLT stylesheet (MARC records are automatically
578 normalized to <ulink url="&url.marcxml;">MARCXML</ulink> before this step).
579 If desired, you can run multiple XSLT stylesheets in series to accomplish
580 this, but the output of the last one should be a representation of the
581 record in a schema that Pazpar2 understands.
585 The intermediate, internal representation of the record looks like
588 <record xmlns="http://www.indexdata.com/pazpar2/1.0"
589 mergekey="title The Shining author King, Stephen">
591 <metadata type="title" rank="2">The Shining</metadata>
593 <metadata type="author">King, Stephen</metadata>
595 <metadata type="kind">ebook</metadata>
597 <!-- ... and so on -->
601 As you can see, there isn't much to it. There are really only a few
602 important elements to this file.
606 Elements should belong to the namespace
607 <literal>http://www.indexdata.com/pazpar2/1.0</literal>.
608 If the root node contains the
609 attribute 'mergekey', then every record that generates the same
610 merge key (normalized for case differences, white space, and
611 truncation) will be joined into a cluster. In other words, you
612 decide how records are merged. If you don't include a merge key,
613 records are never merged. The 'metadata' elements provide the meat
614 of the elements -- the content. the 'type' attribute is used to
615 match each element against processing rules that determine what
616 happens to the data element next. The attribute, 'rank' specifies
617 specifies a multipler for ranking for this element.
621 The next processing step is the extraction of metadata from the
622 intermediate representation of the record. This is governed by the
623 'metadata' elements in the 'service' section of the configuration
624 file. See <xref linkend="config-server"/> for details. The metadata
625 in the retrieval record ultimately drives merging, sorting, ranking,
626 the extraction of browse facets, and display, all configurable.
630 <section id="client">
631 <title>Client development overview</title>
633 You can use Pazpar2 from any environment that allows you to use
634 webservices. The initial goal of the software was to support
635 Ajax-based applications, but there literally are no limits to what
636 you can do. You can use Pazpar2 from Javascript, Flash, Java, etc.,
637 on the browser side, and from any development environment on the
638 server side, and you can pass session tokens and record IDs freely
639 around between these environments to build sophisticated applications.
640 Use your imagination.
644 The webservice API of Pazpar2 is described in detail in <xref
645 linkend="pazpar2_protocol"/>.
649 In brief, you use the 'init' command to create a session, a
650 temporary workspace which carries information about the current
651 search. You start a new search using the 'search' command. Once the
652 search has been started, you can follow its progress using the
653 'stat', 'bytarget', 'termlist', or 'show' commands. Detailed records
654 can be fetched using the 'record' command.
660 <section id="unicode">
661 <title>Unicode Compliance</title>
663 Pazpar2 is Unicode compliant and language and locale aware but relies
664 on character encoding for the targets to be specified correctly if
665 the targets themselves are not UTF-8 based (most aren't).
666 Just a few bad behaving targets can spoil the search experience
667 considerably if for example Greek, Russian or otherwise non 7-bit ASCII
668 search terms are entered. In these cases some targets return
669 records irrelevant to the query, and the result screens will be
670 cluttered with noise.
673 While noise from misbehaving targets can not be removed, it can
674 be reduced using truly Unicode based ranking. This is an
675 option which is available to the system administrator if ICU
676 support is compiled into Pazpar2, see
677 <xref linkend="installation"/> for details.
680 In addition, the ICU tokenization and normalization rules must
681 be defined in the master configuration file described in
682 <xref linkend="config-server"/>.
686 <section id="load_balancing">
687 <title>Load balancing</title>
689 Just like any web server, Pazpar2, can be load balanced by a standard
690 hardware or software load balancer as long as the session stickiness
691 is ensured. If you are already running the Apache2 web server in front
692 of Pazpar2 and use the apache mod_proxy module to 'relay' client
693 requests to Pazpar2, this set up can be easily extended to include
694 load balancing capabilites.
695 To do so you need to enable the
696 <ulink url="http://httpd.apache.org/docs/2.2/mod/mod_proxy_balancer.html">
699 module in your Apache2 installation.
703 On a Debian based Apache 2 system, the relevant modules can
706 sudo a2enmod proxy_http
711 The mod_proxy_balancer can pass all 'sessionsticky' requests to the
712 same backend worker as long as the requests are marked with the
713 originating worker's ID (called 'route'). If the Pazpar2 serverID is
714 configured (by setting an 'id' attribute on the 'server' element in
715 the Pazpar2 configuration file) Pazpar2 will append it to the
716 'session' element returned during the 'init' in a mod_proxy_balancer
718 Since the 'session' is then re-sent by the client (for all pazpar2
719 request besides 'init'), the balancer can use the marker to pass
720 the request to the right route. To do so the balancer needs to be
721 configured to inspect the 'session' parameter.
724 <example id="load_balancing.example">
725 <title>Apache 2 load balancing configuration</title>
727 Having 4 Pazpar2 instances running on the same host, port range of
728 8004-8007 and serverIDs of: pz1, pz2, pz3 and pz4 respectively we
729 could use the following Apache 2 configuration to expose a single
730 pazpar2 'endpoint' on a standard
731 (<filename>/pazpar2/search.pz2</filename>) location:
735 AddDefaultCharset off
741 # 'route' has to match the configured pazpar2 server ID
742 <Proxy balancer://pz2cluster>
743 BalancerMember http://localhost:8004 route=pz1
744 BalancerMember http://localhost:8005 route=pz2
745 BalancerMember http://localhost:8006 route=pz3
746 BalancerMember http://localhost:8007 route=pz4
749 # route is resent in the 'session' param which has the form:
750 # 'sessid.serverid', understandable by the mod_proxy_load_balancer
751 # this is not going to work if the client tampers with the 'session' param
752 ProxyPass /pazpar2/search.pz2 balancer://pz2cluster lbmethod=byrequests stickysession=session nofailover=On
755 The 'ProxyPass' line sets up a reverse proxy for request
756 ‘/pazpar2/search.pz2’ and delegates all requests to the load balancer
757 (virtual worker) with name ‘pz2cluster’.
758 Sticky sessions are enabled and implemented using the ‘session’ parameter.
759 The ‘Proxy’ section lists all the servers (real workers) which the
760 load balancer can use.
767 <section id="relevance_ranking">
768 <title>Relevance ranking</title>
770 Pazpar2 uses a variant of the fterm frequency–inverse document frequency
771 (Tf-idf) ranking algorithm.
774 The Tf-part is straightforward to calculate and is based on the
775 documents that Pazpar2 fetches. The idf-part, however, is more tricky
776 since the corpus at hand is ONLY the relevant documents and not
777 irrelevant ones. Pazpar2 does not have the full corpus -- only the
778 documents that match a particular search.
781 Computatation of the Tf-part is based on the normalized documents.
782 The length, the position and terms are thus normalized at this point.
783 Also the computation if performed for each document received from the
784 target - before merging takes place. The result of a TF-compuation is
785 added to the TF-total of a cluster. Thus, if a document occurs twice,
786 then the TF-part is doubled. That, however, can be adjusted, because the
787 TF-part may be divided by the number of documents in a cluster.
790 The algorithm used by Pazpar2 has two phases. In phase one
791 Pazpar2 computes a tf-array .. This is being done as records are
792 fetched form the database. In this case, the rank weigth
793 <literal>w</literal>, the and rank tweaks <literal>lead</literal>,
794 <literal>follow</literal> and <literal>length</literal>.
799 foreach document in a cluster
802 for i = 1, .. N: (each term)
803 foreach pos (where term i occurs in field)
804 // w is configured weight for field
805 // pos is position of term in field
806 w[i] += w / (1 + log2(1+lead*pos))
808 w[i] += w[i] * follow / (1+log2(d)
809 // length: length of field (number of terms that is)
810 if (length strategy is "linear")
811 tf[i] += w[i] / length;
812 else if (length strategy is "log")
813 tf[i] += w[i] / log2(length);
814 else if (length strategy is "none")
818 In phase two, the idf-array is computed and the final score
819 is computed. This is done for each cluster as part of each show command.
820 The rank tweak <literal>cluster</literal> is in use here.
823 // dococcur[i]: number of records where term occurs
824 // doctotal: number of records
825 for i = 1, .., N (each term)
827 idf[i] = log(1 + doctotal / dococcur[i])
832 for i = 1, .., N: (each term)
833 if (cluster is "yes")
834 tf[i] = tf[i] / cluster_size;
835 relevance += 100000 * tf[i] / idf[i];
837 </section> <!-- relevance_ranking -->
839 <section id="masterkey_connect">
840 <title>Pazpar2 and MasterKey Connect</title>
842 MasterKey Connect is a hosted connector, or gateway, service that exposes
843 whatever searchable resources you need. Since the service exposes all
844 resources using Z39.50 (or SRU), it is easy to set up Pazpar2 to use the
845 service. In particular, since all connectors expose basically the same core
846 behavior, it is a good use of Pazpar2's mechanism for managing default
847 behaviors across similar databases.
850 After installation of Pazpar2, the directory
851 <filename>/etc/pazpar2/settings/mkc</filename> (location may
852 vary depending on installation preferences) contains an example setup that
853 searches two different resources through a MasterKey Connect demo account.
854 The file mkc.xml contains default parameters that will work for all
855 MasterKey Connect resources (if you decide to become a customer of the
856 service, you will substitute your own account credentials for
857 the guest/guest). The other files contain specific information about
858 a couple of demonstration resources.
862 To play with the demo, just create a symlink from
863 <filename>/etc/pazpar2/services-enabled/default.xml</filename>
864 to <filename>/etc/pazpar2/services-available/mkc.xml</filename>.
865 And restart Pazpar2. You should now be able to search the two demo
866 resources using JSDemo or any user interface of your choice.
867 If you are interested in learning more about MasterKey Connect, or to
868 try out the service for free against your favorite online resource, just
869 contact us at <email>info@indexdata.com</email>.
873 </chapter> <!-- Using Pazpar2 -->
875 <reference id="reference">
876 <title>Reference</title>
877 <partintro id="reference-introduction">
879 The material in this chapter is drawn directly from the individual
886 <appendix id="license">
887 <title>License</title>
891 Copyright © ©right-year; Index Data.
895 Pazpar2 is free software; you can redistribute it and/or modify it under
896 the terms of the GNU General Public License as published by the Free
897 Software Foundation; either version 2, or (at your option) any later
902 Pazpar2 is distributed in the hope that it will be useful, but WITHOUT ANY
903 WARRANTY; without even the implied warranty of MERCHANTABILITY or
904 FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
909 You should have received a copy of the GNU General Public License
910 along with Pazpar2; see the file LICENSE. If not, write to the
911 Free Software Foundation,
912 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
921 <!-- Keep this comment at the end of the file