+
+ <chapter id="using">
+ <title>Using Pazpar2</title>
+ <para>
+ This chapter provides a general introduction to the use and
+ deployment of Pazpar2.
+ </para>
+
+ <section id="architecture">
+ <title>Pazpar2 and your systems architecture</title>
+ <para>
+ Pazpar2 is designed to provide asynchronous, behind-the-scenes
+ metasearching functionality to your application, exposing this
+ functionality using a simple webservice API that can be accessed
+ from any number of development environments. In particular, it is
+ possible to combine Pazpar2 either with your server-side dynamic
+ website scripting, with scripting or code running in the browser, or
+ with any combination of the two. Pazpar2 is an excellent tool for
+ building advanced, Ajax-based user interfaces for metasearch
+ functionality, but it isn't a requirement -- you can choose to use
+ Pazpar2 entirely as a backend to your regular server-side scripting.
+ When you do use Pazpar2 in conjunction
+ with browser scripting (JavaScript/Ajax, Flash, applets,
+ etc.), there are special considerations.
+ </para>
+
+ <para>
+ Pazpar2 implements a simple but efficient HTTP server, and it is
+ designed to interact directly with scripting running in the browser
+ for the best possible performance, and to limit overhead when
+ several browser clients generate numerous webservice requests.
+ However, it is still desirable to use a conventional webserver,
+ such as Apache, to serve up graphics, HTML documents, and
+ server-side scripting. Because the security sandbox environment of
+ most browser-side programming environments only allows communication
+ with the server from which the enclosing HTML page or object
+ originated, Pazpar2 is designed so that it can act as a transparent
+ proxy in front of an existing webserver (see <xref
+ linkend="pazpar2_conf"/> for details).
+ In this mode, all regular
+ HTTP requests are transparently passed through to your webserver,
+ while Pazpar2 only intercepts search-related webservice requests.
+ </para>
+
+ <para>
+ If you want to expose your combined service on port 80, you can
+ either run your regular webserver on a different port, a different
+ server, or a different IP address associated with the same server.
+ </para>
+
+ <para>
+ Pazpar2 can also work behind
+ a reverse Proxy. Refer to <xref linkend="installation.apache2proxy"/>)
+ for more information.
+ This allows your existing HTTP server to operate on port 80 as usual.
+ Pazpar2 can be started on another (internal) port.
+ </para>
+
+ <para>
+ Sometimes, it may be necessary to implement functionality on your
+ regular webserver that makes use of search results, for example to
+ implement data import functionality, emailing results, history
+ lists, personal citation lists, interlibrary loan functionality,
+ etc. Fortunately, it is simple to exchange information between
+ Pazpar2, your browser scripting, and backend server-side scripting.
+ You can send a session ID and possibly a record ID from your browser
+ code to your server code, and from there use Pazpar2s webservice API
+ to access result sets or individual records. You could even 'hide'
+ all of Pazpar2s functionality between your own API implemented on
+ the server-side, and access that from the browser or elsewhere. The
+ possibilities are just about endless.
+ </para>
+ </section>
+
+ <section id="data_model">
+ <title>Your data model</title>
+ <para>
+ Pazpar2 does not have a preconceived model of what makes up a data
+ model. There are no assumptions that records have specific fields or
+ that they are organized in any particular way. The only assumption
+ is that data comes packaged in a form that the software can work
+ with (presently, that means XML or MARC), and that you can provide
+ the necessary information to massage it into Pazpar2's internal
+ record abstraction.
+ </para>
+
+ <para>
+ Handling retrieval records in Pazpar2 is a two-step process. First,
+ you decide which data elements of the source record you are
+ interested in, and you specify any desired massaging or combining of
+ elements using an XSLT stylesheet (MARC records are automatically
+ normalized to <ulink url="&url.marcxml;">MARCXML</ulink> before this step).
+ If desired, you can run multiple XSLT stylesheets in series to accomplish
+ this, but the output of the last one should be a representation of the
+ record in a schema that Pazpar2 understands.
+ </para>
+
+ <para>
+ The intermediate, internal representation of the record looks like
+ this:
+ <screen><![CDATA[
+ <record xmlns="http://www.indexdata.com/pazpar2/1.0"
+ mergekey="title The Shining author King, Stephen">
+
+ <metadata type="title" rank="2">The Shining</metadata>
+
+ <metadata type="author">King, Stephen</metadata>
+
+ <metadata type="kind">ebook</metadata>
+ <!-- ... and so on -->
+ </record>
+]]></screen>
+
+ As you can see, there isn't much to it. There are really only a few
+ important elements to this file.
+ </para>
+
+ <para>
+ Elements should belong to the namespace
+ <literal>http://www.indexdata.com/pazpar2/1.0</literal>.
+ If the root node contains the
+ attribute 'mergekey', then every record that generates the same
+ merge key (normalized for case differences, white space, and
+ truncation) will be joined into a cluster. In other words, you
+ decide how records are merged. If you don't include a merge key,
+ records are never merged. The 'metadata' elements provide the meat
+ of the elements -- the content. the 'type' attribute is used to
+ match each element against processing rules that determine what
+ happens to the data element next. The attribute, 'rank' specifies
+ specifies a multipler for ranking for this element.
+ </para>
+
+ <para>
+ The next processing step is the extraction of metadata from the
+ intermediate representation of the record. This is governed by the
+ 'metadata' elements in the 'service' section of the configuration
+ file. See <xref linkend="config-server"/> for details. The metadata
+ in the retrieval record ultimately drives merging, sorting, ranking,
+ the extraction of browse facets, and display, all configurable.
+ </para>
+
+ <para>
+ Pazpar2 1.6.37 and later also allows already clustered records to
+ be ingested. Suppose a database already clusters for us and we would like
+ to keep that cluster for Pazpar2. In that case we can generate a
+ <literal>cluster</literal> wrapper element that holds individual
+ <literal>record</literal> elements.
+ </para>
+ <para>
+ Cluster record example:
+ <screen><![CDATA[
+ <cluster xmlns="http://www.indexdata.com/pazpar2/1.0">
+ <record>
+ <metadata type="title" rank="2">The Shining</metadata>
+ <metadata type="author">King, Stephen</metadata>
+ <metadata type="kind">ebook</metadata>
+ </record>
+ <record>
+ <metadata type="title" rank="2">The Shining</metadata>
+ <metadata type="author">King, Stephen</metadata>
+ <metadata type="kind">audio</metadata>
+ </record>
+ </cluster>
+ ]]></screen>
+ </para>
+ </section>
+
+ <section id="client">
+ <title>Client development overview</title>
+ <para>
+ You can use Pazpar2 from any environment that allows you to use
+ webservices. The initial goal of the software was to support
+ Ajax-based applications, but there literally are no limits to what
+ you can do. You can use Pazpar2 from Javascript, Flash, Java, etc.,
+ on the browser side, and from any development environment on the
+ server side, and you can pass session tokens and record IDs freely
+ around between these environments to build sophisticated applications.
+ Use your imagination.
+ </para>
+
+ <para>
+ The webservice API of Pazpar2 is described in detail in <xref
+ linkend="pazpar2_protocol"/>.
+ </para>
+
+ <para>
+ In brief, you use the 'init' command to create a session, a
+ temporary workspace which carries information about the current
+ search. You start a new search using the 'search' command. Once the
+ search has been started, you can follow its progress using the
+ 'stat', 'bytarget', 'termlist', or 'show' commands. Detailed records
+ can be fetched using the 'record' command.
+ </para>
+ </section>
+
+ §-ajaxdev;
+
+ <section id="unicode">
+ <title>Unicode Compliance</title>
+ <para>
+ Pazpar2 is Unicode compliant and language and locale aware but relies
+ on character encoding for the targets to be specified correctly if
+ the targets themselves are not UTF-8 based (most aren't).
+ Just a few bad behaving targets can spoil the search experience
+ considerably if for example Greek, Russian or otherwise non 7-bit ASCII
+ search terms are entered. In these cases some targets return
+ records irrelevant to the query, and the result screens will be
+ cluttered with noise.
+ </para>
+ <para>
+ While noise from misbehaving targets can not be removed, it can
+ be reduced using truly Unicode based ranking. This is an
+ option which is available to the system administrator if ICU
+ support is compiled into Pazpar2, see
+ <xref linkend="installation"/> for details.
+ </para>
+ <para>
+ In addition, the ICU tokenization and normalization rules must
+ be defined in the master configuration file described in
+ <xref linkend="config-server"/>.
+ </para>
+ </section>
+
+ <section id="load_balancing">
+ <title>Load balancing</title>
+ <para>
+ Just like any web server, Pazpar2, can be load balanced by a standard
+ hardware or software load balancer as long as the session stickiness
+ is ensured. If you are already running the Apache2 web server in front
+ of Pazpar2 and use the apache mod_proxy module to 'relay' client
+ requests to Pazpar2, this set up can be easily extended to include
+ load balancing capabilites.
+ To do so you need to enable the
+ <ulink url="http://httpd.apache.org/docs/2.2/mod/mod_proxy_balancer.html">
+ mod_proxy_balance
+ </ulink>
+ module in your Apache2 installation.
+ </para>
+
+ <para>
+ On a Debian based Apache 2 system, the relevant modules can
+ be enabled with:
+ <screen>
+ sudo a2enmod proxy_http
+ </screen>
+ </para>
+
+ <para>
+ The mod_proxy_balancer can pass all 'sessionsticky' requests to the
+ same backend worker as long as the requests are marked with the
+ originating worker's ID (called 'route'). If the Pazpar2 serverID is
+ configured (by setting an 'id' attribute on the 'server' element in
+ the Pazpar2 configuration file) Pazpar2 will append it to the
+ 'session' element returned during the 'init' in a mod_proxy_balancer
+ compatible manner.
+ Since the 'session' is then re-sent by the client (for all pazpar2
+ request besides 'init'), the balancer can use the marker to pass
+ the request to the right route. To do so the balancer needs to be
+ configured to inspect the 'session' parameter.
+ </para>
+
+ <example id="load_balancing.example">
+ <title>Apache 2 load balancing configuration</title>
+ <para>
+ Having 4 Pazpar2 instances running on the same host, port range of
+ 8004-8007 and serverIDs of: pz1, pz2, pz3 and pz4 respectively we
+ could use the following Apache 2 configuration to expose a single
+ pazpar2 'endpoint' on a standard
+ (<filename>/pazpar2/search.pz2</filename>) location:
+
+ <screen><![CDATA[
+ <Proxy *>
+ AddDefaultCharset off
+ Order deny,allow
+ Allow from all
+ </Proxy>
+ ProxyVia Off
+
+ # 'route' has to match the configured pazpar2 server ID
+ <Proxy balancer://pz2cluster>
+ BalancerMember http://localhost:8004 route=pz1
+ BalancerMember http://localhost:8005 route=pz2
+ BalancerMember http://localhost:8006 route=pz3
+ BalancerMember http://localhost:8007 route=pz4
+ </Proxy>
+
+ # route is resent in the 'session' param which has the form:
+ # 'sessid.serverid', understandable by the mod_proxy_load_balancer
+ # this is not going to work if the client tampers with the 'session' param
+ ProxyPass /pazpar2/search.pz2 balancer://pz2cluster lbmethod=byrequests stickysession=session nofailover=On
+ ]]></screen>
+
+ The 'ProxyPass' line sets up a reverse proxy for request
+ ‘/pazpar2/search.pz2’ and delegates all requests to the load balancer
+ (virtual worker) with name ‘pz2cluster’.
+ Sticky sessions are enabled and implemented using the ‘session’ parameter.
+ The ‘Proxy’ section lists all the servers (real workers) which the
+ load balancer can use.
+ </para>
+
+ </example>
+
+ </section>
+
+ <section id="relevance_ranking">
+ <title>Relevance ranking</title>
+ <para>
+ Pazpar2 uses a variant of the fterm frequency–inverse document frequency
+ (Tf-idf) ranking algorithm.
+ </para>
+ <para>
+ The Tf-part is straightforward to calculate and is based on the
+ documents that Pazpar2 fetches. The idf-part, however, is more tricky
+ since the corpus at hand is ONLY the relevant documents and not
+ irrelevant ones. Pazpar2 does not have the full corpus -- only the
+ documents that match a particular search.
+ </para>
+ <para>
+ Computatation of the Tf-part is based on the normalized documents.
+ The length, the position and terms are thus normalized at this point.
+ Also the computation if performed for each document received from the
+ target - before merging takes place. The result of a TF-compuation is
+ added to the TF-total of a cluster. Thus, if a document occurs twice,
+ then the TF-part is doubled. That, however, can be adjusted, because the
+ TF-part may be divided by the number of documents in a cluster.
+ </para>
+ <para>
+ The algorithm used by Pazpar2 has two phases. In phase one
+ Pazpar2 computes a tf-array .. This is being done as records are
+ fetched form the database. In this case, the rank weigth
+ <literal>w</literal>, the and rank tweaks <literal>lead</literal>,
+ <literal>follow</literal> and <literal>length</literal>.
+
+ </para>
+ <screen><![CDATA[
+ tf[1,2,..N] = 0;
+ foreach document in a cluster
+ foreach field
+ w[1,2,..N] = 0;
+ for i = 1, .. N: (each term)
+ foreach pos (where term i occurs in field)
+ // w is configured weight for field
+ // pos is position of term in field
+ w[i] += w / (1 + log2(1+lead*pos))
+ if (d > 0)
+ w[i] += w[i] * follow / (1+log2(d)
+ // length: length of field (number of terms that is)
+ if (length strategy is "linear")
+ tf[i] += w[i] / length;
+ else if (length strategy is "log")
+ tf[i] += w[i] / log2(length);
+ else if (length strategy is "none")
+ tf[i] += w[i];
+ ]]></screen>
+ <para>
+ In phase two, the idf-array is computed and the final score
+ is computed. This is done for each cluster as part of each show command.
+ The rank tweak <literal>cluster</literal> is in use here.
+ </para>
+ <screen><![CDATA[
+ // dococcur[i]: number of records where term occurs
+ // doctotal: number of records
+ for i = 1, .., N (each term)
+ if (dococcur[i] > 0)
+ idf[i] = log(1 + doctotal / dococcur[i])
+ else
+ idf[i] = 0;
+
+ relevance = 0;
+ for i = 1, .., N: (each term)
+ if (cluster is "yes")
+ tf[i] = tf[i] / cluster_size;
+ relevance += 100000 * tf[i] / idf[i];
+ ]]></screen>
+ </section> <!-- relevance_ranking -->
+
+ <section id="masterkey_connect">
+ <title>Pazpar2 and MasterKey Connect</title>
+ <para>
+ MasterKey Connect is a hosted connector, or gateway, service that exposes
+ whatever searchable resources you need. Since the service exposes all
+ resources using Z39.50 (or SRU), it is easy to set up Pazpar2 to use the
+ service. In particular, since all connectors expose basically the same core
+ behavior, it is a good use of Pazpar2's mechanism for managing default
+ behaviors across similar databases.
+ </para>
+ <para>
+ After installation of Pazpar2, the directory
+ <filename>/etc/pazpar2/settings/mkc</filename> (location may
+ vary depending on installation preferences) contains an example setup that
+ searches two different resources through a MasterKey Connect demo account.
+ The file mkc.xml contains default parameters that will work for all
+ MasterKey Connect resources (if you decide to become a customer of the
+ service, you will substitute your own account credentials for
+ the guest/guest). The other files contain specific information about
+ a couple of demonstration resources.
+ </para>
+
+ <para>
+ To play with the demo, just create a symlink from
+ <filename>/etc/pazpar2/services-enabled/default.xml</filename>
+ to <filename>/etc/pazpar2/services-available/mkc.xml</filename>.
+ And restart Pazpar2. You should now be able to search the two demo
+ resources using JSDemo or any user interface of your choice.
+ If you are interested in learning more about MasterKey Connect, or to
+ try out the service for free against your favorite online resource, just
+ contact us at <email>info@indexdata.com</email>.
+ </para>
+ </section>
+
+ </chapter> <!-- Using Pazpar2 -->
+