1 <?xml version="1.0" standalone="no"?>
2 <!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook V4.1//EN"
3 "http://www.oasis-open.org/docbook/xml/4.1/docbookx.dtd"
5 <!ENTITY % local SYSTEM "local.ent">
7 <!ENTITY % entities SYSTEM "entities.ent">
9 <!ENTITY % idcommon SYSTEM "common/common.ent">
12 <!-- $Id: pazpar2_conf.xml,v 1.31 2007-09-10 16:25:49 adam Exp $ -->
13 <refentry id="pazpar2_conf">
15 <productname>Pazpar2</productname>
16 <productnumber>&version;</productnumber>
19 <refentrytitle>Pazpar2 conf</refentrytitle>
20 <manvolnum>5</manvolnum>
24 <refname>pazpar2_conf</refname>
25 <refpurpose>Pazpar2 Configuration</refpurpose>
30 <command>pazpar2.conf</command>
34 <refsect1><title>DESCRIPTION</title>
36 The Pazpar2 configuration file, together with any referenced XSLT files,
37 govern Pazpar2's behavior as a client, and control the normalization and
38 extraction of data elements from incoming result records, for the
39 purposes of merging, sorting, facet analysis, and display.
43 The file is specified using the option -f on the Pazpar2 command line.
44 There is not presently a way to reload the configuration file without
45 restarting Pazpar2, although this will most likely be added some time
50 <refsect1><title>FORMAT</title>
52 The configuration file is XML-structured. It must be valid XML. All
53 elements specific to Pazpar2 should belong to the namespace
54 <literal>http://www.indexdata.com/pazpar2/1.0</literal>
55 (this is assumed in the
56 following examples). The root element is named <literal>pazpar2</literal>.
57 Under the root element are a number of elements which group categories of
58 information. The categories are described below.
61 <refsect2 id="config-server"><title>server</title>
63 This section governs overall behavior of the client. The data
64 elements are described below.
66 <variablelist> <!-- level 1 -->
71 Configures the webservice -- this controls how you can connect
72 to Pazpar2 from your browser or server-side code. The
73 attributes 'host' and 'port' control the binding of the
74 server. The 'host' attribute can be used to bind the server to
75 a secondary IP address of your system, enabling you to run
76 Pazpar2 on port 80 alongside a conventional web server. You
77 can override this setting on the command line using the option -h.
86 If this item is given, Pazpar2 will forward all incoming HTTP
87 requests that do not contain the filename 'search.pz2' to the
88 host and port specified using the 'host' and 'port'
89 attributes. The 'myurl' attribute is required, and should provide
90 the base URL of the server. Generally, the HTTP URL for the host
91 specified in the 'listen' parameter. This functionality is
92 crucial if you wish to use
93 Pazpar2 in conjunction with browser-based code (JS, Flash,
94 applets, etc.) which operates in a security sandbox. Such code
95 can only connect to the same server from which the enclosing
96 HTML page originated. Pazpar2s proxy functionality enables you
97 to host all of the main pages (plus images, CSS, etc) of your
98 application on a conventional webserver, while efficiently
99 processing webservice requests for metasearch status, results,
106 <term>relevance</term>
109 Specifies ICU tokenization and normalization rules
110 for tokens that are used in Pazpar2's relevance ranking. The 'id'
111 attribute is currently not used, and the 'locale'
112 attribute must be set to one of the locale strings
113 defined in ICU. The child elements listed below can be
114 in any order, except the 'index' element which logically
115 belongs to the end of the list. The stated tokenization,
116 normalization and charmapping instructions are performed
117 in order from top to bottom.
119 <variablelist> <!-- Level 2 -->
120 <varlistentry><term>casemap</term>
123 The attribute 'rule' defines the direction of the
124 per-character casemapping, allowed values are "l"
125 (lower), "u" (upper), "t" (title).
129 <varlistentry><term>normalize</term>
132 Normalization and transformation of tokens follows
133 the rules defined in the 'rule' attribute. For
134 possible values we refer to the extensive ICU
135 documentation found at the
136 <ulink url="&url.icu.transform;">ICU
137 transformation</ulink> home page. Set filtering
138 principles are explained at the
139 <ulink url="&url.icu.unicode.set;">ICU set and
140 filtering</ulink> page.
144 <varlistentry><term>tokenize</term>
147 Tokenization is the only rule in the ICU chain
148 which splits one token into multiple tokens. The
149 'rule' attribute may have the following values:
150 "s" (sentence), "l" (line-break), "w" (word), and
151 "c" (character), the later probably not being
152 very useful in a pruning Pazpar2 installation.
156 <varlistentry><term>index</term>
159 Finally the 'index' element instruction - without
160 any 'rule' attribute - is used to store the tokens
161 after chain processing in the relevance ranking
162 unit of Pazpar2. It will always be the last
163 instruction in the chain.
175 Specifies ICU tokenization and normalization rules
176 for tokens that are used in Pazpar2's sorting. The contents
177 is similar to that of <literal>relevance</literal>.
183 <term>mergekey</term>
186 Specifies ICU tokenization and normalization rules
187 for tokens that are used in Pazpar2's mergekey. The contents
188 is similar to that of <literal>relevance</literal>.
197 This nested element controls the behavior of Pazpar2 with
198 respect to your data model. In Pazpar2, incoming records are
199 normalized, using XSLT, into an internal representation.
200 The 'service' section controls the further processing and
201 extraction of data from the internal representation, primarily
202 through the 'metadata' sub-element.
205 <variablelist> <!-- Level 2 -->
206 <varlistentry><term>metadata</term>
209 One of these elements is required for every data element in
210 the internal representation of the record (see
211 <xref linkend="data_model"/>. It governs
212 subsequent processing as pertains to sorting, relevance
213 ranking, merging, and display of data elements. It supports
214 the following attributes:
217 <variablelist> <!-- level 3 -->
218 <varlistentry><term>name</term>
221 This is the name of the data element. It is matched
222 against the 'type' attribute of the
224 in the normalized record. A warning is produced if
225 metadata elements with an unknown name are
227 normalized record. This name is also used to
229 data elements in the records returned by the
230 webservice API, and to name sort lists and browse
236 <varlistentry><term>type</term>
239 The type of data element. This value governs any
240 normalization or special processing that might take
241 place on an element. Possible values are 'generic'
242 (basic string), 'year' (a range is computed if
243 multiple years are found in the record). Note: This
244 list is likely to increase in the future.
249 <varlistentry><term>brief</term>
252 If this is set to 'yes', then the data element is
253 includes in brief records in the webservice API. Note
254 that this only makes sense for metadata elements that
255 are merged (see below). The default value is 'no'.
260 <varlistentry><term>sortkey</term>
263 Specifies that this data element is to be used for
264 sorting. The possible values are 'numeric' (numeric
265 value), 'skiparticle' (string; skip common, leading
266 articles), and 'no' (no sorting). The default value is
272 <varlistentry><term>rank</term>
275 Specifies that this element is to be used to
277 records against the user's query (when ranking is
278 requested). The value is an integer, used as a
279 multiplier against the basic TF*IDF score. A value of
280 1 is the base, higher values give additional
282 elements of this type. The default is '0', which
283 excludes this element from the rank calculation.
288 <varlistentry><term>termlist</term>
291 Specifies that this element is to be used as a
292 termlist, or browse facet. Values are tabulated from
293 incoming records, and a highscore of values (with
294 their associated frequency) is made available to the
295 client through the webservice API.
297 are 'yes' and 'no' (default).
302 <varlistentry><term>merge</term>
305 This governs whether, and how elements are extracted
306 from individual records and merged into cluster
307 records. The possible values are: 'unique' (include
308 all unique elements), 'longest' (include only the
309 longest element (strlen), 'range' (calculate a range
310 of values across all matching records), 'all' (include
311 all elements), or 'no' (don't merge; this is the
317 <varlistentry><term>setting</term>
320 This attribute allows you to make use of static database
321 settings in the processing of records. Three possible values
322 are allowed. 'no' is the default and doesn't do anything.
323 'postproc' copies the value of a setting with the same name
324 into the output of the normalization stylesheet(s). 'parameter'
325 makes the value of a setting with the same name available
326 as a parameter to the normalization stylesheet, so you
327 can further process the value inside of the stylesheet, or use
328 the value to decide how to deal with other data values.
332 The purpose of using settings in this way can either be to
333 control the behavior of normalization stylesheet in a database-
334 dependent way, or to easily make database-dependent values
335 available to display-logic in your user interface, without having
336 to implement complicated interactions between the user interface
337 and your configuration system.
340 </variablelist> <!-- attributes to metadata -->
344 </variablelist> <!-- Data elements in service directive -->
347 </variablelist> <!-- Data elements in server directive -->
352 <refsect1><title>EXAMPLE</title>
353 <para>Below is a working example configuration:
355 <?xml version="1.0" encoding="UTF-8"?>
356 <pazpar2 xmlns="http://www.indexdata.com/pazpar2/1.0">
359 <listen port="9004"/>
360 <proxy host="us1.indexdata.com" myurl="us1.indexdata.com"/>
362 <!-- optional ICU ranking configuration example -->
364 <icu_chain id="el:word" locale="el">
365 <normalize rule="[:Control:] Any-Remove"/>
367 <normalize rule="[[:WhiteSpace:][:Punctuation:]] Remove"/>
374 <metadata name="title" brief="yes" sortkey="skiparticle" merge="longest" rank="6"/>
375 <metadata name="isbn" merge="unique"/>
376 <metadata name="date" brief="yes" sortkey="numeric" type="year" merge="range"
378 <metadata name="author" brief="yes" termlist="yes" merge="longest" rank="2"/>
379 <metadata name="subject" merge="unique" termlist="yes" rank="3"/>
380 <metadata name="url" merge="unique"/>
389 <refsect1 id="target_settings"><title>TARGET SETTINGS</title>
391 Pazpar2 features a cunning scheme by which you can associate various
392 kinds of attributes, or settings with search targets. This can be done
393 through XML files which are read at startup; each file can associate
394 one or more settings with one or more targets. The file format is generic
395 in nature, designed to support a wide range of application requirements. The
396 settings can be purely technical things, like, how to perform a title
397 search against a given target, or it can associate arbitrary name=value
398 pairs with groups of targets -- for instance, if you would like to
399 place all commercial full-text bases in one group for selection
400 purposes, or you would like to control what targets are accessible
401 to users by default. Per-database settings values can even be used
402 to drive sorting, facet/termlist generation, or end-user interface display
407 During startup, Pazpar2 will recursively read a specified directory
408 (can be identified in the pazpar2.cfg file or on the command line), and
409 process any settings files found therein.
413 Clients of the Pazpar2 webservice interface can selectively override
414 settings for individual targets within the scope of one session. This
415 can be used in conjunction with an external authentication system to
416 determine which resources are to be accessible to which users. Pazpar2
417 itself has no notion of end-users, and so can be used in conjunction
418 with any type of authentication system. Similarly, the authentication
419 tokens submitted to access-controlled search targets can similarly be
420 overridden, to allow use of Pazpar2 in a consortial or multi-library
421 environment, where different end-users may need to be represented to
422 some search targets in different ways. This, again, can be managed
423 using an external database or other lookup mechanism. Setting overrides
424 can be performed either using the 'init' or the 'settings' webservice
429 In fact, every setting that applies to a database (except pz:id, which
430 can only be used for filtering targets to use for a search) can be overridden
431 on a per-session basis. This allows the client to override specific CCL fields
432 for searching, etc., to meet the needs of a session or user.
436 Finally, as an extreme case of this, the webservice client can
437 introduce entirely new targets, on the fly, as part of the init or
438 settings command. This is useful if you desire to manage information
439 about your search targets in a separate application such as a database.
440 You do not need any static settings file whatsoever to run Pazpar2 -- as
441 long as the webservice client is prepared to supply the necessary
442 information at the beginning of every session.
447 The following discussion of practical issues related to session and settings
448 management are cast in terms of a user interface based on Ajax/Javascript
449 technology. It would apply equally well to many other kinds of browser-based logic.
454 Typically, a Javascript client is not allowed to directly alter the parameters
455 of a session. There are two reasons for this. One has to do with access
456 to information; typically, information about a user will be stored in a
457 system on the server side, or it will be accessible in some way from the server.
458 However, since the Javascript client cannot be entirely trusted (some hostile
459 agent might in fact 'pretend' to be a regular ws client), it is more robust
460 to control session settings from scripting that you run as part of your
461 webserver. Typically, this can be handled during the session initialization,
466 Step 1: The Javascript client loads, and asks the webserver for a new Pazpar2
467 session ID. This can be done using a Javascript call, for instance. Note that
468 it is possible to submit Ajax HTTPXmlRequest calls either to Pazpar2 or to the
469 webserver that Pazpar2 is proxying for. See (XXX Insert link to Pazpar2 protocol).
473 Step 2: Code on the webserver authenticates the user, by database lookup,
474 LDAP access, NCIP, etc. Determines which resources the user has access to,
475 and any user-specific parameters that are to be applied during this session.
479 Step 3: The webserver initializes a new Pazpar2 settings, and sets user-specific
480 parameters as necessary, using the init webservice command. A new session ID is
485 Step 4: The webserver returns this session ID to the Javascript client, which then
486 uses the session ID to submit searches, show results, etc.
490 Step 5: When the Javascript client ceases to use the session, Pazpar2 destroys
491 any session-specific information.
494 <refsect2><title>SETTINGS FILE FORMAT</title>
496 Each file contains a root element named <settings>. It may
497 contain one or more <set> elements. The settings and set
498 elements may contain the following attributes. Attributes in the set node
499 overrides those in the setting root element. Each set node must
500 specify (directly, or inherited from the parent node) at least a
501 target, name, and value.
509 This specifies the search target to which this setting should be
510 applied. Targets are identified by their Z39.50 URL, generally
511 including the host, port, and database name, (e.g.
512 <literal>bagel.indexdata.com:210/marc</literal>).
513 Two wildcard forms are accepted:
514 * (asterisk) matches all known targets;
515 <literal>bagel.indexdata.com:210/*</literal> matches all
516 known databases on the given host.
519 A precedence system determines what happens if there are
520 overlapping values for the same setting name for the same
521 target. A setting for a specific target name overrides a
522 setting which specifies target using a wildcard. This makes it
523 easy to set defaults for all targets, and then override them
524 for specific targets or hosts. If there are
525 multiple overlapping settings with the same name and target
526 value, the 'precedence' attribute determines what happens.
534 The name of the setting. This can be anything you like.
535 However, Pazpar2 reserves a number of setting names for
536 specific purposes, all starting with 'pz:', and it is a good
537 idea to avoid that prefix if you make up your own setting
538 names. See below for a list of reserved variables.
546 The value of the setting. Generally, this can be anything you
547 want -- however, some of the reserved settings may expect
548 specific kinds of values.
553 <term>precedence</term>
556 This should be an integer. If not provided, the default value
557 is 0. If two (or more) settings have the same content for
558 target and name, the precedence value determines the outcome.
559 If both settings have the same precedence value, they are both
560 applied to the target(s). If one has a higher value, then the
561 value of that setting is applied, and the other one is ignored.
568 By setting defaults for target, name, or value in the root
569 settings node, you can use the settings files in many different
570 ways. For instance, you can use a single file to set defaults for
571 many different settings, like search fields, retrieval syntaxes,
572 etc. You can have one file per server, which groups settings for
573 that server or target. You could also have one file which associates
574 a number of targets with a given setting, for instance, to associate
575 many databases with a given category or class that makes sense
576 within your application.
580 The following examples illustrate uses of the settings system to
581 associate settings with targets to meet different requirements.
585 The example below associates a set of default values that can be
586 used across many targets. Note the wildcard for targets.
587 This associates the given settings with all targets for which no
588 other information is provided.
590 <settings target="*">
592 <!-- This file introduces default settings for pazpar2 -->
593 <!-- $Id: pazpar2_conf.xml,v 1.31 2007-09-10 16:25:49 adam Exp $ -->
595 <!-- mapping for unqualified search -->
596 <set name="pz:cclmap:term" value="u=1016 t=l,r s=al"/>
598 <!-- field-specific mappings -->
599 <set name="pz:cclmap:ti" value="u=4 s=al"/>
600 <set name="pz:cclmap:su" value="u=21 s=al"/>
601 <set name="pz:cclmap:isbn" value="u=7"/>
602 <set name="pz:cclmap:issn" value="u=8"/>
603 <set name="pz:cclmap:date" value="u=30 r=r"/>
605 <!-- Retrieval settings -->
607 <set name="pz:requestsyntax" value="marc21"/>
608 <!-- <set name="pz:elements" value="F"/> NOT YET IMPLEMENTED -->
610 <!-- Result normalization settings -->
612 <set name="pz:nativesyntax" value="iso2709"/>
613 <set name="pz:xslt" value="../etc/marc21.xsl"/>
621 The next example shows certain settings overridden for one target,
622 one which returns XML records containing DublinCore elements, and
623 which furthermore requires a username/password.
625 <settings target="funkytarget.com:210/db1">
626 <set name="pz:requestsyntax" value="xml"/>
627 <set name="pz:nativesyntax" value="xml"/>
628 <set name="pz:xslt" value="../etc/dublincore.xsl"/>
630 <set name="pz:authentication" value="myuser/password"/>
636 The following example associates a specific name/value combination
637 with a number of targets. The targets below are access-restricted,
638 and can only be used by users with special credentials.
640 <settings name="pz:allow" value="0">
641 <set target="funkytarget.com:210/*"/>
642 <set target="commercial.com:2100/expensiveDb"/>
649 <refsect2><title>RESERVED SETTING NAMES</title>
651 The following setting names are reserved by Pazpar2 to control the
652 behavior of the client function.
657 <term>pz:cclmap:xxx</term>
660 This establishes a CCL field definition or other setting, for
661 the purpose of mapping end-user queries. XXX is the field or
662 setting name, and the value of the setting provides parameters
663 (e.g. parameters to send to the server, etc.). Please consult
664 the YAZ manual for a full overview of the many capabilities of
665 the powerful and flexible CCL parser.
668 Note that it is easy to establish a set of default parameters,
669 and then override them individually for a given target.
674 <term>pz:requestsyntax</term>
677 This specifies the record syntax to use when requesting
678 records from a given server. The value can be a symbolic name like
679 marc21 or xml, or it can be a Z39.50-style dot-separated OID.
684 <term>pz:elements</term>
687 The element set name to be used when retrieving records from a
688 server (not yet implemented).
693 <term>pz:piggyback</term>
696 Piggybacking enables the server to retrieve records from the
697 server as part of the search response in Z39.50. Almost all
698 servers support this (or fail it gracefully), but a few
699 servers will produce undesirable results.
700 Set to '1' to enable piggybacking, '0' to disable it. Default
701 is 1 (piggybacking enabled).
706 <term>pz:nativesyntax</term>
709 The representation (syntax) of the retrieval records. Currently
710 recognized values are iso2709 and xml.
713 For iso2709, can also specify a native character set, e.g. "iso2709;latin-1".
714 If no character set is provided, MARC-8 is assumed.
722 Provides the path of an XSLT stylesheet which will be used to
723 map incoming records to the internal representation.
728 <term>pz:authentication</term>
731 Sets an authentication string for a given server. See the section on
732 authorization and authentication for discussion.
737 <term>pz:allow</term>
740 Allows or denies access to the resources it is applied to. Possible
741 values are '0' and '1'. The default is '1' (allow access to this resource).
742 See the manual section on authorization and authentication for discussion
743 about how to use this setting.
748 <term>pz:maxrecs</term>
751 Controls the maximum number of records to be retrieved from a
752 server. The default is 100 (not yet implemented).
760 This setting can't be 'set' -- it contains the ID (normally
761 ZURL) for a given target, and is useful for filtering --
762 specifically when you want to select one or more specific
763 targets in the search command.
768 <term>pz:zproxy</term>
771 The 'pz:zproxy' setting has the value syntax
772 'host.internet.adress:port', it is used to tunnel Z39.50
773 requests through the named Z39.50 proxy.
779 <term>pz:apdulog</term>
782 If the 'pz:apdulog' setting is defined and has other value than 0,
783 then Z39.50 APDUs are written to the log.
791 <refsect1><title>SEE ALSO</title>
795 <refentrytitle>pazpar2</refentrytitle>
796 <manvolnum>8</manvolnum>
802 <refentrytitle>pazpar2_protocol</refentrytitle>
803 <manvolnum>7</manvolnum>
808 <!-- Keep this comment at the end of the file
813 sgml-minimize-attributes:nil
814 sgml-always-quote-attributes:t
817 sgml-parent-document:nil
818 sgml-local-catalogs: nil
819 sgml-namecase-general:t