1 <chapter id="quick-start">
2 <title>Quick Start </title>
5 In this section, we will test the system by indexing a small set of sample
6 GILS records that are included with the software distribution. Go to the
7 <literal remap="tt">test/gils</literal> subdirectory of the distribution archive. There you will
9 file named <literal remap="tt">zebra.cfg</literal> with the following contents:
12 # Where are the YAZ tables located.
13 profilePath: ../../../yaz/tab ../../tab
15 # Files that describe the attribute sets supported.
23 Now, edit the file and set <literal remap="tt">profilePath</literal> to the path of the
24 YAZ profile tables (sub directory <literal remap="tt">tab</literal> of the YAZ distribution
29 The 48 test records are located in the sub directory <literal remap="tt">records</literal>.
33 $ ../../index/zebraidx -t grs.sgml update records
39 In the command above the option <literal remap="tt">-t</literal> specified the record
40 type — in this case <literal remap="tt">grs.sgml</literal>. The word <literal remap="tt">update</literal> followed
41 by a directory root updates all files below that directory node.
45 If your indexing command was successful, you are now ready to
46 fire up a server. To start a server on port 2100, type:
49 $ ../../index/zebrasrv tcp:@:2100
55 The Zebra index that you have just created has a single database
56 named <literal remap="tt">Default</literal>. The database contains records structured according to
57 the GILS profile, and the server will
58 return records in either either USMARC, GRS-1, or SUTRS depending
59 on what your client asks
64 To test the server, you can use any Z39.50 client (1992 or later). For
65 instance, you can use the demo client that comes with YAZ: Just cd to
66 the <literal remap="tt">client</literal> subdirectory of the YAZ distribution and type:
72 $ client tcp:localhost:2100
78 When the client has connected, you can type:
91 The default retrieval syntax for the client is USMARC. To try other
92 formats for the same record, try:
109 <emphasis remap="it">NOTE: You may notice that more fields are returned when your
110 client requests SUTRS or GRS-1 records. When retrieving GILS records,
111 this is normal - not all of the GILS data elements have mappings in
112 the USMARC record format.</emphasis>
116 If you've made it this far, there's a good chance that
117 you've got through the compilation OK.
122 <chapter id="administration">
123 <title>Administrating Zebra</title>
126 Unlike many simpler retrieval systems, Zebra supports safe, incremental
127 updates to an existing index.
131 Normally, when Zebra modifies the index it reads a number of records
133 Depending on your specifications and on the contents of each record
134 one the following events take place for each record:
141 The record is indexed as if it never occurred
142 before. Either the Zebra system doesn't know how to identify the record or
143 Zebra can identify the record but didn't find it to be already indexed.
151 The record has already been indexed. In this case
152 either the contents of the record or the location (file) of the record
153 indicates that it has been indexed before.
161 The record is deleted from the index. As in the
162 update-case it must be able to identify the record.
170 Please note that in both the modify- and delete- case the Zebra
171 indexer must be able to generate a unique key that identifies the record in
172 question (more on this below).
176 To administrate the Zebra retrieval system, you run the
177 <literal remap="tt">zebraidx</literal> program. This program supports a number of options
178 which are preceded by a minus, and a few commands (not preceded by
183 Both the Zebra administrative tool and the Z39.50 server share a
184 set of index files and a global configuration file. The
185 name of the configuration file defaults to <literal remap="tt">zebra.cfg</literal>.
186 The configuration file includes specifications on how to index
187 various kinds of records and where the other configuration files
188 are located. <literal remap="tt">zebrasrv</literal> and <literal remap="tt">zebraidx</literal> <emphasis>must</emphasis>
189 be run in the directory where the configuration file lives unless you
190 indicate the location of the configuration file by option
191 <literal remap="tt">-c</literal>.
194 <sect1 id="record-types">
195 <title>Record Types</title>
198 Indexing is a per-record process, in which either insert/modify/delete
199 will occur. Before a record is indexed search keys are extracted from
200 whatever might be the layout the original record (sgml,html,text, etc..).
201 The Zebra system currently supports two fundamantal types of records:
202 structured and simple text.
203 To specify a particular extraction process, use either the
204 command line option <literal remap="tt">-t</literal> or specify a
205 <literal remap="tt">recordType</literal> setting in the configuration file.
210 <sect1 id="configuration-file">
211 <title>The Zebra Configuration File</title>
214 The Zebra configuration file, read by <literal remap="tt">zebraidx</literal> and
215 <literal remap="tt">zebrasrv</literal> defaults to <literal remap="tt">zebra.cfg</literal> unless specified
216 by <literal remap="tt">-c</literal> option.
220 You can edit the configuration file with a normal text editor.
221 parameter names and values are seperated by colons in the file. Lines
222 starting with a hash sign (<literal remap="tt">#</literal>) are treated as comments.
226 If you manage different sets of records that share common
227 characteristics, you can organize the configuration settings for each
229 When <literal remap="tt">zebraidx</literal> is run and you wish to address a given group
230 you specify the group name with the <literal remap="tt">-g</literal> option. In this case
231 settings that have the group name as their prefix will be used
232 by <literal remap="tt">zebraidx</literal>. If no <literal remap="tt">-g</literal> option is specified, the settings
233 with no prefix are used.
237 In the configuration file, the group name is placed before the option
238 name itself, separated by a dot (.). For instance, to set the record type
239 for group <literal remap="tt">public</literal> to <literal remap="tt">grs.sgml</literal> (the SGML-like format for structured
240 records) you would write:
246 public.recordType: grs.sgml
252 To set the default value of the record type to <literal remap="tt">text</literal> write:
264 The available configuration settings are summarized below. They will be
265 explained further in the following sections.
272 <term><emphasis remap="it">group</emphasis>.recordType[<emphasis remap="it">.name</emphasis>]</term>
275 Specifies how records with the file extension <emphasis remap="it">name</emphasis> should
276 be handled by the indexer. This option may also be specified
277 as a command line option (<literal remap="tt">-t</literal>). Note that if you do not
278 specify a <emphasis remap="it">name</emphasis>, the setting applies to all files. In general,
279 the record type specifier consists of the elements (each
280 element separated by dot), <emphasis remap="it">fundamental-type</emphasis>,
281 <emphasis remap="it">file-read-type</emphasis> and arguments. Currently, two
282 fundamental types exist, <literal remap="tt">text</literal> and <literal remap="tt">grs</literal>.
287 <term><emphasis remap="it">group</emphasis>.recordId</term>
290 Specifies how the records are to be identified when updated. See
291 section <xref linkend="locating-records"/>.
296 <term><emphasis remap="it">group</emphasis>.database</term>
299 Specifies the Z39.50 database name.
304 <term><emphasis remap="it">group</emphasis>.storeKeys</term>
307 Specifies whether key information should be saved for a given
308 group of records. If you plan to update/delete this type of
309 records later this should be specified as 1; otherwise it
310 should be 0 (default), to save register space. See section
311 <xref linkend="file-ids"/>.
316 <term><emphasis remap="it">group</emphasis>.storeData</term>
319 Specifies whether the records should be stored internally
320 in the Zebra system files. If you want to maintain the raw records yourself,
321 this option should be false (0). If you want Zebra to take care of the records
322 for you, it should be true(1).
327 <term>register</term>
330 Specifies the location of the various register files that Zebra uses
331 to represent your databases. See section
332 <xref linkend="register-location"/>.
340 Enables the <emphasis remap="it">safe update</emphasis> facility of Zebra, and tells the system
341 where to place the required, temporary files. See section
342 <xref linkend="shadow-registers"/>.
350 Directory in which various lock files are stored.
355 <term>keyTmpDir</term>
358 Directory in which temporary files used during zebraidx' update
364 <term>setTmpDir</term>
367 Specifies the directory that the server uses for temporary result sets.
368 If not specified <literal remap="tt">/tmp</literal> will be used.
373 <term>profilePath</term>
376 Specifies the location of profile specification files.
384 Specifies the filename(s) of attribute set files for use in
385 searching. At least the Bib-1 set should be loaded (<literal remap="tt">bib1.att</literal>).
386 The <literal remap="tt">profilePath</literal> setting is used to look for the specified files.
387 See section <xref linkend="attset-files"/>
395 Specifies size of internal memory to use for the zebraidx program. The
396 amount is given in megabytes - default is 4 (4 MB).
405 <sect1 id="locating-records">
406 <title>Locating Records</title>
409 The default behaviour of the Zebra system is to reference the
410 records from their original location, i.e. where they were found when you
411 ran <literal remap="tt">zebraidx</literal>. That is, when a client wishes to retrieve a record
412 following a search operation, the files are accessed from the place
413 where you originally put them - if you remove the files (without
414 running <literal remap="tt">zebraidx</literal> again, the client will receive a diagnostic
419 If your input files are not permanent - for example if you retrieve
420 your records from an outside source, or if they were temporarily
421 mounted on a CD-ROM drive,
422 you may want Zebra to make an internal copy of them. To do this,
423 you specify 1 (true) in the <literal remap="tt">storeData</literal> setting. When
424 the Z39.50 server retrieves the records they will be read from the
425 internal file structures of the system.
430 <sect1 id="simple-indexing">
431 <title>Indexing with no Record IDs (Simple Indexing)</title>
434 If you have a set of records that are not expected to change over time
435 you may can build your database without record IDs.
436 This indexing method uses less space than the other methods and
441 To use this method, you simply omit the <literal remap="tt">recordId</literal> entry
442 for the group of files that you index. To add a set of records you use
443 <literal remap="tt">zebraidx</literal> with the <literal remap="tt">update</literal> command. The
444 <literal remap="tt">update</literal> command will always add all of the records that it
445 encounters to the index - whether they have already been indexed or
446 not. If the set of indexed files change, you should delete all of the
447 index files, and build a new index from scratch.
451 Consider a system in which you have a group of text files called
452 <literal remap="tt">simple</literal>. That group of records should belong to a Z39.50 database
453 called <literal remap="tt">textbase</literal>. The following <literal remap="tt">zebra.cfg</literal> file will suffice:
459 profilePath: /usr/local/yaz
461 simple.recordType: text
462 simple.database: textbase
468 Since the existing records in an index can not be addressed by their
469 IDs, it is impossible to delete or modify records when using this method.
474 <sect1 id="file-ids">
475 <title>Indexing with File Record IDs</title>
478 If you have a set of files that regularly change over time: Old files
479 are deleted, new ones are added, or existing files are modified, you
480 can benefit from using the <emphasis remap="it">file ID</emphasis> indexing methodology. Examples
481 of this type of database might include an index of WWW resources, or a
482 USENET news spool area. Briefly speaking, the file key methodology
483 uses the directory paths of the individual records as a unique
484 identifier for each record. To perform indexing of a directory with
485 file keys, again, you specify the top-level directory after the
486 <literal remap="tt">update</literal> command. The command will recursively traverse the
487 directories and compare each one with whatever have been indexed before in
488 that same directory. If a file is new (not in the previous version of
489 the directory) it is inserted into the registers; if a file was
490 already indexed and it has been modified since the last update,
491 the index is also modified; if a file has been removed since the last
492 visit, it is deleted from the index.
496 The resulting system is easy to administrate. To delete a record you
497 simply have to delete the corresponding file (say, with the <literal remap="tt">rm</literal>
498 command). And to add records you create new files (or directories with
499 files). For your changes to take effect in the register you must run
500 <literal remap="tt">zebraidx update</literal> with the same directory root again. This mode
501 of operation requires more disk space than simpler indexing methods,
502 but it makes it easier for you to keep the index in sync with a
503 frequently changing set of data. If you combine this system with the
504 <emphasis remap="it">safe update</emphasis> facility (see below), you never have to take your
505 server offline for maintenance or register updating purposes.
509 To enable indexing with pathname IDs, you must specify <literal remap="tt">file</literal> as
510 the value of <literal remap="tt">recordId</literal> in the configuration file. In addition,
511 you should set <literal remap="tt">storeKeys</literal> to <literal remap="tt">1</literal>, since the Zebra
512 indexer must save additional information about the contents of each record
513 in order to modify the indices correctly at a later time.
517 For example, to update records of group <literal remap="tt">esdd</literal> located below
518 <literal remap="tt">/data1/records/</literal> you should type:
521 $ zebraidx -g esdd update /data1/records
527 The corresponding configuration file includes:
531 esdd.recordType: grs.sgml
538 <emphasis>Important note: You cannot start out with a group of records with simple
539 indexing (no record IDs as in the previous section) and then later
540 enable file record Ids. Zebra must know from the first time that you
542 the files should be indexed with file record IDs.</emphasis>
546 You cannot explicitly delete records when using this method (using the
547 <emphasis remap="bf">delete</emphasis> command to <literal remap="tt">zebraidx</literal>. Instead
548 you have to delete the files from the file system (or move them to a
550 and then run <literal remap="tt">zebraidx</literal> with the <emphasis remap="bf">update</emphasis> command.
555 <sect1 id="generic-ids">
556 <title>Indexing with General Record IDs</title>
559 When using this method you construct an (almost) arbritrary, internal
560 record key based on the contents of the record itself and other system
561 information. If you have a group of records that explicitly associates
562 an ID with each record, this method is convenient. For example, the
563 record format may contain a title or a ID-number - unique within the group.
564 In either case you specify the Z39.50 attribute set and use-attribute
565 location in which this information is stored, and the system looks at
566 that field to determine the identity of the record.
570 As before, the record ID is defined by the <literal remap="tt">recordId</literal> setting
571 in the configuration file. The value of the record ID specification
572 consists of one or more tokens separated by whitespace. The resulting
574 represented in the index by concatenating the tokens and separating them by
579 There are three kinds of tokens:
583 <term>Internal record info</term>
586 The token refers to a key that is
587 extracted from the record. The syntax of this token is
588 <literal remap="tt">(</literal> <emphasis>set</emphasis> <literal remap="tt">,</literal> <emphasis>use</emphasis> <literal remap="tt">)</literal>, where <emphasis>set</emphasis> is the
589 attribute set name <emphasis>use</emphasis> is the name or value of the attribute.
594 <term>System variable</term>
597 The system variables are preceded by
602 and immediately followed by the system variable name, which
615 <term>database</term>
618 Current database specified.
635 <term>Constant string</term>
638 A string used as part of the ID — surrounded
639 by single- or double quotes.
647 For instance, the sample GILS records that come with the Zebra
648 distribution contain a unique ID in the data tagged Control-Identifier.
649 The data is mapped to the Bib-1 use attribute Identifier-standard
650 (code 1007). To use this field as a record id, specify
651 <literal remap="tt">(bib1,Identifier-standard)</literal> as the value of the
652 <literal remap="tt">recordId</literal> in the configuration file.
653 If you have other record types that uses the same field for a
654 different purpose, you might add the record type
655 (or group or database name) to the record id of the gils
656 records as well, to prevent matches with other types of records.
657 In this case the recordId might be set like this:
660 gils.recordId: $type (bib1,Identifier-standard)
666 (see section <xref linkend="data-model"/>
667 for details of how the mapping between elements of your records and
668 searchable attributes is established).
672 As for the file record ID case described in the previous section,
673 updating your system is simply a matter of running <literal remap="tt">zebraidx</literal>
674 with the <literal remap="tt">update</literal> command. However, the update with general
675 keys is considerably slower than with file record IDs, since all files
676 visited must be (re)read to discover their IDs.
680 As you might expect, when using the general record IDs
681 method, you can only add or modify existing records with the <literal remap="tt">update</literal>
682 command. If you wish to delete records, you must use the,
683 <literal remap="tt">delete</literal> command, with a directory as a parameter.
684 This will remove all records that match the files below that root
690 <sect1 id="register-location">
691 <title>Register Location</title>
694 Normally, the index files that form dictionaries, inverted
695 files, record info, etc., are stored in the directory where you run
696 <literal remap="tt">zebraidx</literal>. If you wish to store these, possibly large, files
697 somewhere else, you must add the <literal remap="tt">register</literal> entry to the
698 <literal remap="tt">zebra.cfg</literal> file. Furthermore, the Zebra system allows its file
700 span multiple file systems, which is useful for managing very large
705 The value of the <literal remap="tt">register</literal> setting is a sequence of tokens.
706 Each token takes the form:
709 <emphasis>dir</emphasis><literal remap="tt">:</literal><emphasis>size</emphasis>.
712 The <emphasis>dir</emphasis> specifies a directory in which index files will be
713 stored and the <emphasis>size</emphasis> specifies the maximum size of all
714 files in that directory. The Zebra indexer system fills each directory
715 in the order specified and use the next specified directories as needed.
716 The <emphasis>size</emphasis> is an integer followed by a qualifier
717 code, <literal remap="tt">M</literal> for megabytes, <literal remap="tt">k</literal> for kilobytes.
721 For instance, if you have allocated two disks for your register, and
722 the first disk is mounted
723 on <literal remap="tt">/d1</literal> and has 200 Mb of free space and the
724 second, mounted on <literal remap="tt">/d2</literal> has 300 Mb, you could
725 put this entry in your configuration file:
728 register: /d1:200M /d2:300M
734 Note that Zebra does not verify that the amount of space specified is
735 actually available on the directory (file system) specified - it is
736 your responsibility to ensure that enough space is available, and that
737 other applications do not attempt to use the free space. In a large production system,
738 it is recommended that you allocate one or more filesystem exclusively
739 to the Zebra register files.
744 <sect1 id="shadow-registers">
745 <title>Safe Updating - Using Shadow Registers</title>
748 <title>Description</title>
751 The Zebra server supports <emphasis remap="it">updating</emphasis> of the index structures. That is,
752 you can add, modify, or remove records from databases managed by Zebra
753 without rebuilding the entire index. Since this process involves
754 modifying structured files with various references between blocks of
755 data in the files, the update process is inherently sensitive to
756 system crashes, or to process interruptions: Anything but a
757 successfully completed update process will leave the register files in
758 an unknown state, and you will essentially have no recourse but to
759 re-index everything, or to restore the register files from a backup
760 medium. Further, while the update process is active, users cannot be
761 allowed to access the system, as the contents of the register files
762 may change unpredictably.
766 You can solve these problems by enabling the shadow register system in
767 Zebra. During the updating procedure, <literal remap="tt">zebraidx</literal> will temporarily
768 write changes to the involved files in a set of "shadow
769 files", without modifying the files that are accessed by the
770 active server processes. If the update procedure is interrupted by a
771 system crash or a signal, you simply repeat the procedure - the
772 register files have not been changed or damaged, and the partially
773 written shadow files are automatically deleted before the new updating
778 At the end of the updating procedure (or in a separate operation, if
779 you so desire), the system enters a "commit mode". First,
780 any active server processes are forced to access those blocks that
781 have been changed from the shadow files rather than from the main
782 register files; the unmodified blocks are still accessed at their
783 normal location (the shadow files are not a complete copy of the
784 register files - they only contain those parts that have actually been
785 modified). If the commit process is interrupted at any point during the
786 commit process, the server processes will continue to access the
787 shadow files until you can repeat the commit procedure and complete
788 the writing of data to the main register files. You can perform
789 multiple update operations to the registers before you commit the
790 changes to the system files, or you can execute the commit operation
791 at the end of each update operation. When the commit phase has
792 completed successfully, any running server processes are instructed to
793 switch their operations to the new, operational register, and the
794 temporary shadow files are deleted.
800 <title>How to Use Shadow Register Files</title>
803 The first step is to allocate space on your system for the shadow
804 files. You do this by adding a <literal remap="tt">shadow</literal> entry to the <literal remap="tt">zebra.cfg</literal>
805 file. The syntax of the <literal remap="tt">shadow</literal> entry is exactly the same as for
806 the <literal remap="tt">register</literal> entry (see section <xref linkend="register-location"/>). The location of the shadow area should be
807 <emphasis remap="it">different</emphasis> from the location of the main register area (if you
808 have specified one - remember that if you provide no <literal remap="tt">register</literal>
809 setting, the default register area is the
810 working directory of the server and indexing processes).
814 The following excerpt from a <literal remap="tt">zebra.cfg</literal> file shows one example of
815 a setup that configures both the main register location and the shadow
816 file area. Note that two directories or partitions have been set aside
817 for the shadow file area. You can specify any number of directories
818 for each of the file areas, but remember that there should be no
819 overlaps between the directories used for the main registers and the
820 shadow files, respectively.
828 shadow: /scratch1:100M /scratch2:200M
834 When shadow files are enabled, an extra command is available at the
835 <literal remap="tt">zebraidx</literal> command line. In order to make changes to the system
836 take effect for the users, you'll have to submit a
837 "commit" command after a (sequence of) update
838 operation(s). You can ask the indexer to commit the changes
839 immediately after the update operation:
845 $ zebraidx update /d1/records update /d2/more-records commit
851 Or you can execute multiple updates before committing the changes:
857 $ zebraidx -g books update /d1/records update /d2/more-records
858 $ zebraidx -g fun update /d3/fun-records
865 If one of the update operations above had been interrupted, the commit
866 operation on the last line would fail: <literal remap="tt">zebraidx</literal> will not let you
867 commit changes that would destroy the running register. You'll have to
868 rerun all of the update operations since your last commit operation,
869 before you can commit the new changes.
873 Similarly, if the commit operation fails, <literal remap="tt">zebraidx</literal> will not let
874 you start a new update operation before you have successfully repeated
875 the commit operation. The server processes will keep accessing the
876 shadow files rather than the (possibly damaged) blocks of the main
877 register files until the commit operation has successfully completed.
881 You should be aware that update operations may take slightly longer
882 when the shadow register system is enabled, since more file access
883 operations are involved. Further, while the disk space required for
884 the shadow register data is modest for a small update operation, you
885 may prefer to disable the system if you are adding a very large number
886 of records to an already very large database (we use the terms
887 <emphasis remap="it">large</emphasis> and <emphasis remap="it">modest</emphasis> very loosely here, since every
888 application will have a different perception of size). To update the system
889 without the use of the the shadow files, simply run <literal remap="tt">zebraidx</literal> with
890 the <literal remap="tt">-n</literal> option (note that you do not have to execute the
891 <emphasis remap="bf">commit</emphasis> command of <literal remap="tt">zebraidx</literal> when you temporarily disable the
892 use of the shadow registers in this fashion. Note also that, just as
893 when the shadow registers are not enabled, server processes will be
894 barred from accessing the main register while the update procedure
904 <chapter id="zebraidx">
905 <title>Running the Maintenance Interface (zebraidx)</title>
908 The following is a complete reference to the command line interface to
909 the <literal remap="tt">zebraidx</literal> application.
913 <emphasis remap="bf">Syntax</emphasis>
916 $ zebraidx [options] command [directory] ...
919 <emphasis remap="bf">Options</emphasis>
923 <term>-t <emphasis remap="it">type</emphasis></term>
926 Update all files as <emphasis remap="it">type</emphasis>. Currently, the
927 types supported are <literal remap="tt">text</literal> and <literal remap="tt">grs</literal><emphasis remap="it">.subtype</emphasis>. If no
928 <emphasis remap="it">subtype</emphasis> is provided for the GRS (General Record Structure) type,
929 the canonical input format is assumed (see section <xref linkend="local-representation"/>). Generally, it
930 is probably advisable to specify the record types in the
931 <literal remap="tt">zebra.cfg</literal> file
932 (see section <xref linkend="record-types"/>), to avoid
933 confusion at subsequent updates.
938 <term>-c <emphasis remap="it">config-file</emphasis></term>
941 Read the configuration file
942 <emphasis remap="it">config-file</emphasis> instead of <literal remap="tt">zebra.cfg</literal>.
947 <term>-g <emphasis remap="it">group</emphasis></term>
950 Update the files according to the group
951 settings for <emphasis remap="it">group</emphasis> (see section
952 <xref linkend="configuration-file"/>).
957 <term>-d <emphasis remap="it">database</emphasis></term>
960 The records located should be associated
961 with the database name <emphasis remap="it">database</emphasis> for access through the Z39.50
967 <term>-m <emphasis remap="it">mbytes</emphasis></term>
970 Use <emphasis remap="it">mbytes</emphasis> of megabytes before flushing
971 keys to background storage. This setting affects performance when
972 updating large databases.
980 Disable the use of shadow registers for this operation
981 (see section <xref linkend="shadow-registers"/>).
989 Show analysis of the indexing process. The maintenance
990 program works in a read-only mode and doesn't change the state
991 of the index. This options is very useful when you wish to test a
1005 <term>-v <emphasis remap="it">level</emphasis></term>
1008 Set the log level to <emphasis remap="it">level</emphasis>. <emphasis remap="it">level</emphasis>
1009 should be one of <literal remap="tt">none</literal>, <literal remap="tt">debug</literal>, and <literal remap="tt">all</literal>.
1017 <emphasis remap="bf">Commands</emphasis>
1021 <term>Update <emphasis remap="it">directory</emphasis></term>
1024 Update the register with the files
1025 contained in <emphasis remap="it">directory</emphasis>. If no directory is provided, a list of
1026 files is read from <literal remap="tt">stdin</literal>.
1027 See section <xref linkend="administration"/>.
1032 <term>Delete <emphasis remap="it">directory</emphasis></term>
1035 Remove the records corresponding to
1036 the files found under <emphasis remap="it">directory</emphasis> from the register.
1044 Write the changes resulting from the last <emphasis remap="bf">update</emphasis>
1045 commands to the register. This command is only available if the use of
1046 shadow register files is enabled (see section
1047 <xref linkend="shadow-registers"/>).
1056 <chapter id="server">
1057 <title>The Z39.50 Server</title>
1059 <sect1 id="zebrasrv">
1060 <title>Running the Z39.50 Server (zebrasrv)</title>
1063 <emphasis remap="bf">Syntax</emphasis>
1066 zebrasrv [options] [listener-address ...]
1072 <emphasis remap="bf">Options</emphasis>
1076 <term>-a <emphasis remap="it">APDU file</emphasis></term>
1079 Specify a file for dumping PDUs (for diagnostic purposes).
1080 The special name "-" sends output to <literal>stderr</literal>.
1085 <term>-c <emphasis remap="it">config-file</emphasis></term>
1088 Read configuration information from <emphasis remap="it">config-file</emphasis>. The default configuration is <literal remap="tt">./zebra.cfg</literal>.
1096 Don't fork on connection requests. This can be useful for
1097 symbolic-level debugging. The server can only accept a single
1098 connection in this mode.
1106 Use the SR protocol.
1114 Use the Z39.50 protocol (default). These two options complement
1115 eachother. You can use both multiple times on the same command
1116 line, between listener-specifications (see below). This way, you
1117 can set up the server to listen for connections in both protocols
1118 concurrently, on different local ports.
1123 <term>-l <emphasis remap="it">logfile</emphasis></term>
1126 Specify an output file for the diagnostic
1127 messages. The default is to write this information to <literal remap="tt">stderr</literal>.
1132 <term>-v <emphasis remap="it">log-level</emphasis></term>
1135 The log level. Use a comma-separated list of members of the set
1136 {fatal,debug,warn,log,all,none}.
1141 <term>-u <emphasis remap="it">username</emphasis></term>
1144 Set user ID. Sets the real UID of the server process to that of the
1145 given <emphasis remap="it">username</emphasis>. It's useful if you aren't comfortable with having the
1146 server run as root, but you need to start it as such to bind a
1152 <term>-w <emphasis remap="it">working-directory</emphasis></term>
1155 Change working directory.
1163 Run under the Internet superserver, <literal remap="tt">inetd</literal>. Make
1164 sure you use the logfile option <literal remap="tt">-l</literal> in conjunction with this
1165 mode and specify the <literal remap="tt">-l</literal> option before any other options.
1170 <term>-t <emphasis remap="it">timeout</emphasis></term>
1173 Set the idle session timeout (default 60 minutes).
1178 <term>-k <emphasis remap="it">kilobytes</emphasis></term>
1181 Set the (approximate) maximum size of
1182 present response messages. Default is 1024 Kb (1 Mb).
1190 A <emphasis remap="it">listener-address</emphasis> consists of a transport mode followed by a
1191 colon (:) followed by a listener address. The transport mode is
1192 either <literal remap="tt">osi</literal> or <literal remap="tt">tcp</literal>.
1196 For TCP, an address has the form
1202 hostname | IP-number [: portnumber]
1208 The port number defaults to 210 (standard Z39.50 port).
1212 For OSI (only available if the server is compiled with XTI/mOSI
1213 support enabled), the address form is
1219 [t-selector /] hostname | IP-number [: portnumber]
1225 The transport selector is given as a string of hex digits (with an even
1226 number of digits). The default port number is 102 (RFC1006 port).
1238 osi:0402/dbserver.osiworld.com:3000
1244 In both cases, the special hostname "@" is mapped to
1245 the address INADDR_ANY, which causes the server to listen on any local
1246 interface. To start the server listening on the registered ports for
1247 Z39.50 and SR over OSI/RFC1006, and to drop root privileges once the
1248 ports are bound, execute the server like this (from a root shell):
1254 zebrasrv -u daemon tcp:@ -s osi:@
1260 You can replace <literal remap="tt">daemon</literal> with another user, eg. your own account, or
1261 a dedicated IR server account.
1265 The default behavior for <literal remap="tt">zebrasrv</literal> is to establish a single TCP/IP
1266 listener, for the Z39.50 protocol, on port 9999.
1271 <sect1 id="protocol-support">
1272 <title>Z39.50 Protocol Support and Behavior</title>
1275 <title>Initialization</title>
1278 During initialization, the server will negotiate to version 3 of the
1279 Z39.50 protocol, and the option bits for Search, Present, Scan,
1280 NamedResultSets, and concurrentOperations will be set, if requested by
1281 the client. The maximum PDU size is negotiated down to a maximum of
1288 <title>Search</title>
1291 The supported query type are 1 and 101. All operators are currently
1292 supported with the restriction that only proximity units of type "word" are
1293 supported for the proximity operator.
1294 Queries can be arbitrarily complex.
1295 Named result sets are supported, and result sets can be used as operands
1296 without limitations.
1297 Searches may span multiple databases.
1301 The server has full support for piggy-backed present requests (see
1302 also the following section).
1306 <emphasis remap="bf">Use</emphasis> attributes are interpreted according to the attribute sets which
1307 have been loaded in the <literal remap="tt">zebra.cfg</literal> file, and are matched against
1308 specific fields as specified in the <literal remap="tt">.abs</literal> file which describes the
1309 profile of the records which have been loaded. If no <emphasis remap="bf">Use</emphasis>
1310 attribute is provided, a default of Bib-1 <emphasis remap="bf">Any</emphasis> is assumed.
1314 If a <emphasis remap="bf">Structure</emphasis> attribute of <emphasis remap="bf">Phrase</emphasis> is used in conjunction with a
1315 <emphasis remap="bf">Completeness</emphasis> attribute of <emphasis remap="bf">Complete (Sub)field</emphasis>, the term is
1316 matched against the contents of the phrase (long word) register, if one
1317 exists for the given <emphasis remap="bf">Use</emphasis> attribute.
1318 A phrase register is created for those fields in the <literal remap="tt">.abs</literal>
1319 file that contains a <literal remap="tt">p</literal>-specifier.
1323 If <emphasis remap="bf">Structure</emphasis>=<emphasis remap="bf">Phrase</emphasis> is used in conjunction with
1324 <emphasis remap="bf">Incomplete Field</emphasis> - the default value for <emphasis remap="bf">Completeness</emphasis>, the
1325 search is directed against the normal word registers, but if the term
1326 contains multiple words, the term will only match if all of the words
1327 are found immediately adjacent, and in the given order.
1328 The word search is performed on those fields that are indexed as
1329 type <literal remap="tt">w</literal> in the <literal remap="tt">.abs</literal> file.
1333 If the <emphasis remap="bf">Structure</emphasis> attribute is <emphasis remap="bf">Word List</emphasis>,
1334 <emphasis remap="bf">Free-form Text</emphasis>, or <emphasis remap="bf">Document Text</emphasis>, the term is treated as a
1335 natural-language, relevance-ranked query.
1336 This search type uses the word register, i.e. those fields
1337 that are indexed as type <literal remap="tt">w</literal> in the <literal remap="tt">.abs</literal> file.
1341 If the <emphasis remap="bf">Structure</emphasis> attribute is <emphasis remap="bf">Numeric String</emphasis> the
1342 term is treated as an integer. The search is performed on those
1343 fields that are indexed as type <literal remap="tt">n</literal> in the <literal remap="tt">.abs</literal> file.
1347 If the <emphasis remap="bf">Structure</emphasis> attribute is <emphasis remap="bf">URx</emphasis> the
1348 term is treated as a URX (URL) entity. The search is performed on those
1349 fields that are indexed as type <literal remap="tt">u</literal> in the <literal remap="tt">.abs</literal> file.
1353 If the <emphasis remap="bf">Structure</emphasis> attribute is <emphasis remap="bf">Local Number</emphasis> the
1354 term is treated as native Zebra Record Identifier.
1358 If the <emphasis remap="bf">Relation</emphasis> attribute is <emphasis remap="bf">Equals</emphasis> (default), the term is
1359 matched in a normal fashion (modulo truncation and processing of
1360 individual words, if required). If <emphasis remap="bf">Relation</emphasis> is <emphasis remap="bf">Less Than</emphasis>,
1361 <emphasis remap="bf">Less Than or Equal</emphasis>, <emphasis remap="bf">Greater than</emphasis>, or <emphasis remap="bf">Greater than or
1362 Equal</emphasis>, the term is assumed to be numerical, and a standard regular
1363 expression is constructed to match the given expression. If
1364 <emphasis remap="bf">Relation</emphasis> is <emphasis remap="bf">Relevance</emphasis>, the standard natural-language query
1365 processor is invoked.
1369 For the <emphasis remap="bf">Truncation</emphasis> attribute, <emphasis remap="bf">No Truncation</emphasis> is the default.
1370 <emphasis remap="bf">Left Truncation</emphasis> is not supported. <emphasis remap="bf">Process #</emphasis> is supported, as
1371 is <emphasis remap="bf">Regxp-1</emphasis>. <emphasis remap="bf">Regxp-2</emphasis> enables the fault-tolerant (fuzzy)
1372 search. As a default, a single error (deletion, insertion,
1373 replacement) is accepted when terms are matched against the register
1378 <title>Regular expressions</title>
1381 Each term in a query is interpreted as a regular expression if
1382 the truncation value is either <emphasis remap="bf">Regxp-1</emphasis> (102) or <emphasis remap="bf">Regxp-2</emphasis> (103).
1383 Both query types follow the same syntax with the operands:
1390 Matches the character <emphasis remap="it">x</emphasis>.
1398 Matches any character.
1403 <term><literal remap="tt">[</literal>..<literal remap="tt">]</literal></term>
1406 Matches the set of characters specified;
1407 such as <literal remap="tt">[abc]</literal> or <literal remap="tt">[a-c]</literal>.
1419 Matches <emphasis remap="it">x</emphasis> zero or more times. Priority: high.
1427 Matches <emphasis remap="it">x</emphasis> one or more times. Priority: high.
1435 Matches <emphasis remap="it">x</emphasis> once or twice. Priority: high.
1443 Matches <emphasis remap="it">x</emphasis>, then <emphasis remap="it">y</emphasis>. Priority: medium.
1448 <term>x|y</term>
1451 Matches either <emphasis remap="it">x</emphasis> or <emphasis remap="it">y</emphasis>. Priority: low.
1456 The order of evaluation may be changed by using parentheses.
1460 If the first character of the <emphasis remap="bf">Regxp-2</emphasis> query is a plus character
1461 (<literal remap="tt">+</literal>) it marks the beginning of a section with non-standard
1462 specifiers. The next plus character marks the end of the section.
1463 Currently Zebra only supports one specifier, the error tolerance,
1464 which consists one digit.
1468 Since the plus operator is normally a suffix operator the addition to
1469 the query syntax doesn't violate the syntax for standard regular
1476 <title>Query examples</title>
1479 Phrase search for <emphasis remap="bf">information retrieval</emphasis> in the title-register:
1482 @attr 1=4 "information retrieval"
1488 Ranked search for the same thing:
1491 @attr 1=4 @attr 2=102 "Information retrieval"
1497 Phrase search with a regular expression:
1500 @attr 1=4 @attr 5=102 "informat.* retrieval"
1506 Ranked search with a regular expression:
1509 @attr 1=4 @attr 5=102 @attr 2=102 "informat.* retrieval"
1515 In the GILS schema (<literal remap="tt">gils.abs</literal>), the west-bounding-coordinate is
1516 indexed as type <literal remap="tt">n</literal>, and is therefore searched by specifying
1517 <emphasis remap="bf">structure</emphasis>=<emphasis remap="bf">Numeric String</emphasis>.
1518 To match all those records with west-bounding-coordinate greater
1519 than -114 we use the following query:
1522 @attr 4=109 @attr 2=5 @attr gils 1=2038 -114
1532 <title>Present</title>
1535 The present facility is supported in a standard fashion. The requested
1536 record syntax is matched against the ones supported by the profile of
1537 each record retrieved. If no record syntax is given, SUTRS is the
1538 default. The requested element set name, again, is matched against any
1539 provided by the relevant record profiles.
1548 The attribute combinations provided with the termListAndStartPoint are
1549 processed in the same way as operands in a query (see above).
1550 Currently, only the term and the globalOccurrences are returned with
1551 the termInfo structure.
1560 Z39.50 specifies three diffent types of sort criterias.
1561 Of these Zebra supports the attribute specification type in which
1562 case the use attribute specifies the "Sort register".
1563 Sort registers are created for those fields that are of type "sort" in
1564 the default.idx file.
1565 The corresponding character mapping file in default.idx specifies the
1566 ordinal of each character used in the actual sort.
1570 Z39.50 allows the client to specify sorting on one or more input
1571 result sets and one output result set.
1572 Zebra supports sorting on one result set only which may or may not
1573 be the same as the output result set.
1579 <title>Close</title>
1582 If a Close PDU is received, the server will respond with a Close PDU
1583 with reason=FINISHED, no matter which protocol version was negotiated
1584 during initialization. If the protocol version is 3 or more, the
1585 server will generate a Close PDU under certain circumstances,
1586 including a session timeout (60 minutes by default), and certain kinds of
1587 protocol errors. Once a Close PDU has been sent, the protocol
1588 association is considered broken, and the transport connection will be
1589 closed immediately upon receipt of further data, or following a short
1599 <chapter id="record-model">
1600 <title>The Record Model</title>
1603 The Zebra system is designed to support a wide range of data management
1604 applications. The system can be configured to handle virtually any
1605 kind of structured data. Each record in the system is associated with
1606 a <emphasis remap="it">record schema</emphasis> which lends context to the data elements of the
1607 record. Any number of record schema can coexist in the system.
1608 Although it may be wise to use only a single schema within
1609 one database, the system poses no such restrictions.
1613 The record model described in this chapter applies to the fundamental,
1615 record type <literal remap="tt">grs</literal> as introduced in
1616 section <xref linkend="record-types"/>.
1620 Records pass through three different states during processing in the
1630 When records are accessed by the system, they are represented
1631 in their local, or native format. This might be SGML or HTML files,
1632 News or Mail archives, MARC records. If the system doesn't already
1633 know how to read the type of data you need to store, you can set up an
1634 input filter by preparing conversion rules based on regular
1635 expressions and possibly augmented by a flexible scripting language (Tcl). The input filter
1636 produces as output an internal representation:
1643 When records are processed by the system, they are represented
1644 in a tree-structure, constructed by tagged data elements hanging off a
1645 root node. The tagged elements may contain data or yet more tagged
1646 elements in a recursive structure. The system performs various
1647 actions on this tree structure (indexing, element selection, schema
1655 Before transmitting records to the client, they are first
1656 converted from the internal structure to a form suitable for exchange
1657 over the network - according to the Z39.50 standard.
1665 <sect1 id="local-representation">
1666 <title>Local Representation</title>
1669 As mentioned earlier, Zebra places few restrictions on the type of
1670 data that you can index and manage. Generally, whatever the form of
1671 the data, it is parsed by an input filter specific to that format, and
1672 turned into an internal structure that Zebra knows how to handle. This
1673 process takes place whenever the record is accessed - for indexing and
1678 The RecordType parameter in the <literal remap="tt">zebra.cfg</literal> file, or the <literal remap="tt">-t</literal>
1679 option to the indexer tells Zebra how to process input records. Two
1680 basic types of processing are available - raw text and structured
1681 data. Raw text is just that, and it is selected by providing the
1682 argument <emphasis remap="bf">text</emphasis> to Zebra. Structured records are all handled
1683 internally using the basic mechanisms described in the subsequent
1684 sections. Zebra can read structured records in many different formats.
1685 How this is done is governed by additional parameters after the
1686 "grs" keyboard, separated by "." characters.
1690 Three basic subtypes to the <emphasis remap="bf">grs</emphasis> type are currently available:
1697 <term>grs.sgml</term>
1700 This is the canonical input format —
1701 described below. It is a simple SGML-like syntax.
1706 <term>grs.regx.<emphasis remap="it">filter</emphasis></term>
1709 This enables a user-supplied input
1710 filter. The mechanisms of these filters are described below.
1715 <term>grs.marc.<emphasis remap="it">abstract syntax</emphasis></term>
1718 This allows Zebra to read
1719 records in the ISO2709 (MARC) encoding standard. In this case, the
1720 last paramemeter <emphasis remap="it">abstract syntax</emphasis> names the .abs file (see below)
1721 which describes the specific MARC structure of the input record as
1722 well as the indexing rules.
1730 <title>Canonical Input Format</title>
1733 Although input data can take any form, it is sometimes useful to
1734 describe the record processing capabilities of the system in terms of
1735 a single, canonical input format that gives access to the full
1736 spectrum of structure and flexibility in the system. In Zebra, this
1737 canonical format is an "SGML-like" syntax.
1741 To use the canonical format specify <literal remap="tt">grs.sgml</literal> as the record
1746 Consider a record describing an information resource (such a record is
1747 sometimes known as a <emphasis remap="it">locator record</emphasis>). It might contain a field
1748 describing the distributor of the information resource, which might in
1749 turn be partitioned into various fields providing details about the
1750 distributor, like this:
1756 <Distributor>
1757 <Name> USGS/WRD </Name>
1758 <Organization> USGS/WRD </Organization>
1759 <Street-Address>
1760 U.S. GEOLOGICAL SURVEY, 505 MARQUETTE, NW
1761 </Street-Address>
1762 <City> ALBUQUERQUE </City>
1763 <State> NM </State>
1764 <Zip-Code> 87102 </Zip-Code>
1765 <Country> USA </Country>
1766 <Telephone> (505) 766-5560 </Telephone>
1767 </Distributor>
1773 <emphasis remap="it">NOTE: The indentation used above is used to illustrate how Zebra
1774 interprets the markup. The indentation, in itself, has no
1775 significance to the parser for the canonical input format, which
1776 discards superfluous whitespace.</emphasis>
1780 The keywords surrounded by <...> are <emphasis remap="it">tags</emphasis>, while the
1781 sections of text in between are the <emphasis remap="it">data elements</emphasis>. A data element
1782 is characterized by its location in the tree that is made up by the
1783 nested elements. Each element is terminated by a closing tag -
1784 beginning with <literal remap="tt"><</literal>/, and containing the same symbolic tag-name as
1785 the corresponding opening tag. The general closing tag - <literal remap="tt"><</literal>>/ -
1786 terminates the element started by the last opening tag. The
1787 structuring of elements is significant. The element <emphasis remap="bf">Telephone</emphasis>,
1788 for instance, may be indexed and presented to the client differently,
1789 depending on whether it appears inside the <emphasis remap="bf">Distributor</emphasis> element,
1790 or some other, structured data element such a <emphasis remap="bf">Supplier</emphasis> element.
1794 <title>Record Root</title>
1797 The first tag in a record describes the root node of the tree that
1798 makes up the total record. In the canonical input format, the root tag
1799 should contain the name of the schema that lends context to the
1800 elements of the record (see section
1801 <xref linkend="internal-representation"/>).
1802 The following is a GILS record that
1803 contains only a single element (strictly speaking, that makes it an
1804 illegal GILS record, since the GILS profile includes several mandatory
1805 elements - Zebra does not validate the contents of a record against
1806 the Z39.50 profile, however - it merely attempts to match up elements
1807 of a local representation with the given schema):
1814 <title>Zen and the Art of Motorcycle Maintenance</title>
1823 <title>Variants</title>
1826 Zebra allows you to provide individual data elements in a number of
1827 <emphasis remap="it">variant forms</emphasis>. Examples of variant forms are textual data
1828 elements which might appear in different languages, and images which
1829 may appear in different formats or layouts. The variant system in
1831 essentially a representation of the variant mechanism of
1836 The following is an example of a title element which occurs in two
1837 different languages.
1844 <var lang lang "eng">
1845 Zen and the Art of Motorcycle Maintenance</>
1846 <var lang lang "dan">
1847 Zen og Kunsten at Vedligeholde en Motorcykel</>
1854 The syntax of the <emphasis remap="it">variant element</emphasis> is <literal remap="tt"><var class
1855 type value></literal>. The available values for the <emphasis remap="it">class</emphasis> and
1856 <emphasis remap="it">type</emphasis> fields are given by the variant set that is associated with the
1857 current schema (see section <xref linkend="variant-set"/>).
1861 Variant elements are terminated by the general end-tag </>, by
1862 the variant end-tag </var>, by the appearance of another variant
1863 tag with the same <emphasis remap="it">class</emphasis> and <emphasis remap="it">value</emphasis> settings, or by the
1864 appearance of another, normal tag. In other words, the end-tags for
1865 the variants used in the example above could have been saved.
1869 Variant elements can be nested. The element
1876 <var lang lang "eng"><var body iana "text/plain">
1877 Zen and the Art of Motorcycle Maintenance
1884 Associates two variant components to the variant list for the title
1889 Given the nesting rules described above, we could write
1896 <var body iana "text/plain>
1897 <var lang lang "eng">
1898 Zen and the Art of Motorcycle Maintenance
1899 <var lang lang "dan">
1900 Zen og Kunsten at Vedligeholde en Motorcykel
1907 The title element above comes in two variants. Both have the IANA body
1908 type "text/plain", but one is in English, and the other in
1909 Danish. The client, using the element selection mechanism of Z39.50,
1910 can retrieve information about the available variant forms of data
1911 elements, or it can select specific variants based on the requirements
1920 <title>Input Filters</title>
1923 In order to handle general input formats, Zebra allows the
1924 operator to define filters which read individual records in their native format
1925 and produce an internal representation that the system can
1930 Input filters are ASCII files, generally with the suffix <literal remap="tt">.flt</literal>.
1931 The system looks for the files in the directories given in the
1932 <emphasis remap="bf">profilePath</emphasis> setting in the <literal remap="tt">zebra.cfg</literal> files. The record type
1933 for the filter is <literal remap="tt">grs.regx.</literal><emphasis remap="it">filter-filename</emphasis>
1934 (fundamental type <literal remap="tt">grs</literal>, file read type <literal remap="tt">regx</literal>, argument
1935 <emphasis remap="it">filter-filename</emphasis>).
1939 Generally, an input filter consists of a sequence of rules, where each
1940 rule consists of a sequence of expressions, followed by an action. The
1941 expressions are evaluated against the contents of the input record,
1942 and the actions normally contribute to the generation of an internal
1943 representation of the record.
1947 An expression can be either of the following:
1957 The action associated with this expression is evaluated
1958 exactly once in the lifetime of the application, before any records
1959 are read. It can be used in conjunction with an action that
1960 initializes tables or other resources that are used in the processing
1969 Matches the beginning of the record. It can be used to
1970 initialize variables, etc. Typically, the <emphasis remap="bf">BEGIN</emphasis> rule is also used
1971 to establish the root node of the record.
1979 Matches the end of the record - when all of the contents
1980 of the record has been processed.
1985 <term>/pattern/</term>
1988 Matches a string of characters from the input
1997 This keyword may only be used between two patterns. It
1998 matches everything between (not including) those patterns.
2006 The expression asssociated with this pattern is evaluated
2007 once, before the application terminates. It can be used to release
2008 system resources - typically ones allocated in the <emphasis remap="bf">INIT</emphasis> step.
2016 An action is surrounded by curly braces ({...}), and consists of a
2017 sequence of statements. Statements may be separated by newlines or
2018 semicolons (;). Within actions, the strings that matched the
2019 expressions immediately preceding the action can be referred to as
2020 $0, $1, $2, etc.
2024 The available statements are:
2031 <term>begin <emphasis remap="it">type [parameter ... ]</emphasis></term>
2035 data element. The type is one of the following:
2042 Begin a new record. The followingparameter should be the
2043 name of the schema that describes the structure of the record, eg.
2044 <literal remap="tt">gils</literal> or <literal remap="tt">wais</literal> (see below). The <literal remap="tt">begin record</literal> call should
2046 any other use of the <emphasis remap="bf">begin</emphasis> statement.
2051 <term>element</term>
2054 Begin a new tagged element. The parameter is the
2055 name of the tag. If the tag is not matched anywhere in the tagsets
2056 referenced by the current schema, it is treated as a local string
2062 <term>variant</term>
2065 Begin a new node in a variant tree. The parameters are
2066 <emphasis remap="it">class type value</emphasis>.
2078 Create a data element. The concatenated arguments make
2079 up the value of the data element. The option <literal remap="tt">-text</literal> signals that
2080 the layout (whitespace) of the data should be retained for
2081 transmission. The option <literal remap="tt">-element</literal> <emphasis remap="it">tag</emphasis> wraps the data up in
2082 the <emphasis remap="it">tag</emphasis>. The use of the <literal remap="tt">-element</literal> option is equivalent to
2083 preceding the command with a <emphasis remap="bf">begin element</emphasis> command, and following
2084 it with the <emphasis remap="bf">end</emphasis> command.
2089 <term>end <emphasis remap="it">[type]</emphasis></term>
2092 Close a tagged element. If no parameter is given,
2093 the last element on the stack is terminated. The first parameter, if
2094 any, is a type name, similar to the <emphasis remap="bf">begin</emphasis> statement. For the
2095 <emphasis remap="bf">element</emphasis> type, a tag name can be provided to terminate a specific tag.
2103 The following input filter reads a Usenet news file, producing a
2104 record in the WAIS schema. Note that the body of a news posting is
2105 separated from the list of headers by a blank line (or rather a
2106 sequence of two newline characters.
2112 BEGIN { begin record wais }
2114 /^From:/ BODY /$/ { data -element name $1 }
2115 /^Subject:/ BODY /$/ { data -element title $1 }
2116 /^Date:/ BODY /$/ { data -element lastModified $1 }
2118 begin element bodyOfDisplay
2119 begin variant body iana "text/plain"
2128 If Zebra is compiled with support for Tcl (Tool Command Language)
2129 enabled, the statements described above are supplemented with a complete
2130 scripting environment, including control structures (conditional
2131 expressions and loop constructs), and powerful string manipulation
2132 mechanisms for modifying the elements of a record. Tcl is a popular
2133 scripting environment, with several tutorials available both online
2138 <emphasis remap="it">NOTE: Tcl support is not currently available, but will be
2139 included with one of the next alpha or beta releases.</emphasis>
2143 <emphasis remap="it">NOTE: Variant support is not currently available in the input
2144 filter, but will be included with one of the next alpha or beta
2145 releases.</emphasis>
2152 <sect1 id="internal-representation">
2153 <title>Internal Representation</title>
2156 When records are manipulated by the system, they're represented in a
2157 tree-structure, with data elements at the leaf nodes, and tags or
2158 variant components at the non-leaf nodes. The root-node identifies the
2159 schema that lends context to the tagging and structuring of the
2160 record. Imagine a simple record, consisting of a 'title' element and
2161 an 'author' element:
2167 TITLE "Zen and the Art of Motorcycle Maintenance"
2169 AUTHOR "Robert Pirsig"
2175 A slightly more complex record would have the author element consist
2176 of two elements, a surname and a first name:
2182 TITLE "Zen and the Art of Motorcycle Maintenance"
2192 The root of the record will refer to the record schema that describes
2193 the structuring of this particular record. The schema defines the
2194 element tags (TITLE, FIRST-NAME, etc.) that may occur in the record, as
2195 well as the structuring (SURNAME should appear below AUTHOR, etc.). In
2196 addition, the schema establishes element set names that are used by
2197 the client to request a subset of the elements of a given record. The
2198 schema may also establish rules for converting the record to a
2199 different schema, by stating, for each element, a mapping to a
2204 <title>Tagged Elements</title>
2207 A data element is characterized by its tag, and its position in the
2208 structure of the record. For instance, while the tag "telephone
2209 number" may be used different places in a record, we may need to
2210 distinguish between these occurrences, both for searching and
2211 presentation purposes. For instance, while the phone numbers for the
2212 "customer" and the "service provider" are both
2213 representatives for the same type of resource (a telephone number), it
2214 is essential that they be kept separate. The record schema provides
2215 the structure of the record, and names each data element (defined by
2216 the sequence of tags - the tag path - by which the element can be
2217 reached from the root of the record).
2223 <title>Variants</title>
2226 The children of a tag node may be either more tag nodes, a data node
2227 (possibly accompanied by tag nodes),
2228 or a tree of variant nodes. The children of variant nodes are either
2229 more variant nodes or a data node (possibly accompanied by more
2230 variant nodes). Each leaf node, which is normally a
2231 data node, corresponds to a <emphasis remap="it">variant form</emphasis> of the tagged element
2232 identified by the tag which parents the variant tree. The following
2233 title element occurs in two different languages:
2239 VARIANT LANG=ENG "War and Peace"
2241 VARIANT LANG=DAN "Krig og Fred"
2247 Which of the two elements are transmitted to the client by the server
2248 depends on the specifications provided by the client, if any.
2252 In practice, each variant node is associated with a triple of class,
2253 type, value, corresponding to the variant mechanism of Z39.50.
2259 <title>Data Elements</title>
2262 Data nodes have no children (they are always leaf nodes in the record
2267 <emphasis remap="it">NOTE: Documentation needs extension here about types of nodes - numerical,
2268 textual, etc., plus the various types of inclusion notes.</emphasis>
2275 <sect1 id="data-model">
2276 <title>Configuring Your Data Model</title>
2279 The following sections describe the configuration files that govern
2280 the internal management of data records. The system searches for the files
2281 in the directories specified by the <emphasis remap="bf">profilePath</emphasis> setting in the
2282 <literal remap="tt">zebra.cfg</literal> file.
2286 <title>The Abstract Syntax</title>
2289 The abstract syntax definition (also known as an Abstract Record
2290 Structure, or ARS) is the focal point of the
2291 record schema description. For a given schema, the ABS file may state any
2292 or all of the following:
2301 The object identifier of the Z39.50 schema associated
2302 with the ARS, so that it can be referred to by the client.
2309 The attribute set (which can possibly be a compound of multiple
2310 sets) which applies in the profile. This is used when indexing and
2311 searching the records belonging to the given profile.
2318 The Tag set (again, this can consist of several different sets).
2319 This is used when reading the records from a file, to recognize the
2320 different tags, and when transmitting the record to the client -
2321 mapping the tags to their numerical representation, if they are
2329 The variant set which is used in the profile. This provides a
2330 vocabulary for specifying the <emphasis remap="it">forms</emphasis> of data that appear inside
2338 Element set names, which are a shorthand way for the client to
2339 ask for a subset of the data elements contained in a record. Element
2340 set names, in the retrieval module, are mapped to <emphasis remap="it">element
2341 specifications</emphasis>, which contain information equivalent to the
2342 <emphasis remap="it">Espec-1</emphasis> syntax of Z39.50.
2349 Map tables, which may specify mappings to <emphasis remap="it">other</emphasis> database
2350 profiles, if desired.
2357 Possibly, a set of rules describing the mapping of elements to a
2358 MARC representation.
2365 A list of element descriptions (this is the actual ARS of the
2366 schema, in Z39.50 terms), which lists the ways in which the various
2367 tags can be used and organized hierarchically.
2376 Several of the entries above simply refer to other files, which
2377 describe the given objects.
2383 <title>The Configuration Files</title>
2386 This section describes the syntax and use of the various tables which
2387 are used by the retrieval module.
2391 The number of different file types may appear daunting at first, but
2392 each type corresponds fairly clearly to a single aspect of the Z39.50
2393 retrieval facilities. Further, the average database administrator,
2394 who is simply reusing an existing profile for which tables already
2395 exist, shouldn't have to worry too much about the contents of these tables.
2399 Generally, the files are simple ASCII files, which can be maintained
2400 using any text editor. Blank lines, and lines beginning with a (#) are
2401 ignored. Any characters on a line followed by a (#) are also ignored.
2403 lines contain <emphasis remap="it">directives</emphasis>, which provide some setting or value
2404 to the system. Generally, settings are characterized by a single
2405 keyword, identifying the setting, followed by a number of parameters.
2406 Some settings are repeatable (r), while others may occur only once in a
2407 file. Some settings are optional (o), whicle others again are
2414 <title>The Abstract Syntax (.abs) Files</title>
2417 The name of this file type is slightly misleading in Z39.50 terms,
2418 since, apart from the actual abstract syntax of the profile, it also
2419 includes most of the other definitions that go into a database
2424 When a record in the canonical, SGML-like format is read from a file
2425 or from the database, the first tag of the file should reference the
2426 profile that governs the layout of the record. If the first tag of the
2427 record is, say, <literal remap="tt"><gils></literal>, the system will look for the profile
2428 definition in the file <literal remap="tt">gils.abs</literal>. Profile definitions are cached,
2429 so they only have to be read once during the lifespan of the current
2434 When writing your own input filters, the <emphasis remap="bf">record-begin</emphasis> command
2435 introduces the profile, and should always be called first thing when
2436 introducing a new record.
2440 The file may contain the following directives:
2447 <term>name <emphasis remap="it">symbolic-name</emphasis></term>
2450 (m) This provides a shorthand name or
2451 description for the profile. Mostly useful for diagnostic purposes.
2456 <term>reference <emphasis remap="it">OID-name</emphasis></term>
2459 (m) The reference name of the OID for
2460 the profile. The reference names can be found in the <emphasis remap="bf">util</emphasis>
2461 module of <emphasis remap="bf">YAZ</emphasis>.
2466 <term>attset <emphasis remap="it">filename</emphasis></term>
2469 (m) The attribute set that is used for
2470 indexing and searching records belonging to this profile.
2475 <term>tagset <emphasis remap="it">filename</emphasis></term>
2478 (o) The tag set (if any) that describe
2479 that fields of the records.
2484 <term>varset <emphasis remap="it">filename</emphasis></term>
2487 (o) The variant set used in the profile.
2492 <term>maptab <emphasis remap="it">filename</emphasis></term>
2495 (o,r) This points to a
2496 conversion table that might be used if the client asks for the record
2497 in a different schema from the native one.
2499 </listitem></varlistentry>
2501 <term>marc <emphasis remap="it">filename</emphasis></term>
2504 (o) Points to a file containing parameters
2505 for representing the record contents in the ISO2709 syntax. Read the
2506 description of the MARC representation facility below.
2508 </listitem></varlistentry>
2510 <term>esetname <emphasis remap="it">name filename</emphasis></term>
2513 (o,r) Associates the
2514 given element set name with an element selection file. If an (@) is
2515 given in place of the filename, this corresponds to a null mapping for
2516 the given element set name.
2518 </listitem></varlistentry>
2520 <term>any <emphasis remap="it">tags</emphasis></term>
2523 (o) This directive specifies a list of
2524 attributes which should be appended to the attribute list given for each
2525 element. The effect is to make every single element in the abstract
2526 syntax searchable by way of the given attributes. This directive
2527 provides an efficient way of supporting free-text searching across all
2528 elements. However, it does increase the size of the index
2529 significantly. The attributes can be qualified with a structure, as in
2530 the <emphasis remap="bf">elm</emphasis> directive below.
2532 </listitem></varlistentry>
2534 <term>elm <emphasis remap="it">path name attributes</emphasis></term>
2537 (o,r) Adds an element
2538 to the abstract record syntax of the schema. The <emphasis remap="it">path</emphasis> follows the
2539 syntax which is suggested by the Z39.50 document - that is, a sequence
2540 of tags separated by slashes (/). Each tag is given as a
2541 comma-separated pair of tag type and -value surrounded by parenthesis.
2542 The <emphasis remap="it">name</emphasis> is the name of the element, and the <emphasis remap="it">attributes</emphasis>
2543 specifies which attributes to use when indexing the element in a
2544 comma-separated list. A ! in
2545 place of the attribute name is equivalent to specifying an attribute
2546 name identical to the element name. A - in place of the attribute name
2547 specifies that no indexing is to take place for the given element. The
2548 attributes can be qualified with <emphasis remap="it">field types</emphasis> to specify which
2549 character set should govern the indexing procedure for that field. The
2550 same data element may be indexed into several different fields, using
2551 different character set definitions. See the section
2552 <xref linkend="field-structure-and-character-sets"/>.
2553 The default field type is "w" for
2554 <emphasis remap="it">word</emphasis>.
2556 </listitem></varlistentry>
2561 <emphasis remap="it">NOTE: The mechanism for controlling indexing is not adequate for
2562 complex databases, and will probably be moved into a separate
2563 configuration table eventually.</emphasis>
2567 The following is an excerpt from the abstract syntax file for the GILS
2575 reference GILS-schema
2580 maptab gils-usmarc.map
2584 esetname VARIANT gils-variant.est # for WAIS-compliance
2585 esetname B gils-b.est
2586 esetname G gils-g.est
2591 elm (1,14) localControlNumber Local-number
2592 elm (1,16) dateOfLastModification Date/time-last-modified
2593 elm (2,1) title w:!,p:!
2594 elm (4,1) controlIdentifier Identifier-standard
2595 elm (2,6) abstract Abstract
2596 elm (4,51) purpose !
2597 elm (4,52) originator -
2598 elm (4,53) accessConstraints !
2599 elm (4,54) useConstraints !
2600 elm (4,70) availability -
2601 elm (4,70)/(4,90) distributor -
2602 elm (4,70)/(4,90)/(2,7) distributorName !
2603 elm (4,70)/(4,90)/(2,10 distributorOrganization !
2604 elm (4,70)/(4,90)/(4,2) distributorStreetAddress !
2605 elm (4,70)/(4,90)/(4,3) distributorCity !
2612 <sect2 id="attset-files">
2613 <title>The Attribute Set (.att) Files</title>
2616 This file type describes the <emphasis remap="bf">Use</emphasis> elements of an attribute set.
2617 It contains the following directives.
2624 <term>name <emphasis remap="it">symbolic-name</emphasis></term>
2627 (m) This provides a shorthand name or
2628 description for the attribute set. Mostly useful for diagnostic purposes.
2630 </listitem></varlistentry>
2632 <term>reference <emphasis remap="it">OID-name</emphasis></term>
2635 (m) The reference name of the OID for
2636 the attribute set. The reference names can be found in the <emphasis remap="bf">util</emphasis>
2637 module of <emphasis remap="bf">YAZ</emphasis>.
2639 </listitem></varlistentry>
2641 <term>ordinal <emphasis remap="it">integer</emphasis></term>
2644 (m) This value will be used to represent the
2645 attribute set in the index. Care should be taken that each attribute
2646 set has a unique ordinal value.
2648 </listitem></varlistentry>
2650 <term>include <emphasis remap="it">filename</emphasis></term>
2653 (o,r) This directive is used to
2654 include another attribute set as a part of the current one. This is
2655 used when a new attribute set is defined as an extension to another
2656 set. For instance, many new attribute sets are defined as extensions
2657 to the <emphasis remap="bf">bib-1</emphasis> set. This is an important feature of the retrieval
2658 system of Z39.50, as it ensures the highest possible level of
2659 interoperability, as those access points of your database which are
2660 derived from the external set (say, bib-1) can be used even by clients
2661 who are unaware of the new set.
2663 </listitem></varlistentry>
2665 <term>att <emphasis remap="it">att-value att-name [local-value]</emphasis></term>
2669 repeatable directive introduces a new attribute to the set. The
2670 attribute value is stored in the index (unless a <emphasis remap="it">local-value</emphasis> is
2671 given, in which case this is stored). The name is used to refer to the
2672 attribute from the <emphasis remap="it">abstract syntax</emphasis>.
2674 </listitem></varlistentry>
2679 This is an excerpt from the GILS attribute set definition. Notice how
2680 the file describing the <emphasis remap="it">bib-1</emphasis> attribute set is referenced.
2687 reference GILS-attset
2691 att 2001 distributorName
2692 att 2002 indextermsControlled
2694 att 2004 accessConstraints
2695 att 2005 useConstraints
2703 <title>The Tag Set (.tag) Files</title>
2706 This file type defines the tagset of the profile, possibly by
2707 referencing other tag sets (most tag sets, for instance, will include
2708 tagsetG and tagsetM from the Z39.50 specification. The file may
2709 contain the following directives.
2716 <term>name <emphasis remap="it">symbolic-name</emphasis></term>
2719 (m) This provides a shorthand name or
2720 description for the tag set. Mostly useful for diagnostic purposes.
2722 </listitem></varlistentry>
2724 <term>reference <emphasis remap="it">OID-name</emphasis></term>
2727 (o) The reference name of the OID for
2728 the tag set. The reference names can be found in the <emphasis remap="bf">util</emphasis>
2729 module of <emphasis remap="bf">YAZ</emphasis>. The directive is optional, since not all tag sets
2730 are registered outside of their schema.
2732 </listitem></varlistentry>
2734 <term>type <emphasis remap="it">integer</emphasis></term>
2737 (m) The type number of the tagset within the schema
2738 profile (note: this specification really should belong to the .abs
2739 file. This will be fixed in a future release).
2741 </listitem></varlistentry>
2743 <term>include <emphasis remap="it">filename</emphasis></term>
2746 (o,r) This directive is used
2747 to include the definitions of other tag sets into the current one.
2749 </listitem></varlistentry>
2751 <term>tag <emphasis remap="it">number names type</emphasis></term>
2754 (o,r) Introduces a new
2755 tag to the set. The <emphasis remap="it">number</emphasis> is the tag number as used in the protocol
2756 (there is currently no mechanism for specifying string tags at this
2757 point, but this would be quick work to add). The <emphasis remap="it">names</emphasis> parameter
2758 is a list of names by which the tag should be recognized in the input
2759 file format. The names should be separated by slashes (/). The
2760 <emphasis remap="it">type</emphasis> is th recommended datatype of the tag. It should be one of
2828 </listitem></varlistentry>
2833 The following is an excerpt from the TagsetG definition file.
2845 tag 3 publicationPlace string
2846 tag 4 publicationDate string
2847 tag 5 documentId string
2848 tag 6 abstract string
2850 tag 8 date generalizedtime
2851 tag 9 bodyOfDisplay string
2852 tag 10 organization string
2859 <sect2 id="variant-set">
2860 <title>The Variant Set (.var) Files</title>
2863 The variant set file is a straightforward representation of the
2864 variant set definitions associated with the protocol. At present, only
2865 the <emphasis remap="it">Variant-1</emphasis> set is known.
2869 These are the directives allowed in the file.
2876 <term>name <emphasis remap="it">symbolic-name</emphasis></term>
2879 (m) This provides a shorthand name or
2880 description for the variant set. Mostly useful for diagnostic purposes.
2882 </listitem></varlistentry>
2884 <term>reference <emphasis remap="it">OID-name</emphasis></term>
2887 (o) The reference name of the OID for
2888 the variant set, if one is required. The reference names can be found
2889 in the <emphasis remap="bf">util</emphasis> module of <emphasis remap="bf">YAZ</emphasis>.
2891 </listitem></varlistentry>
2893 <term>class <emphasis remap="it">integer class-name</emphasis></term>
2896 (m,r) Introduces a new
2897 class to the variant set.
2899 </listitem></varlistentry>
2901 <term>type <emphasis remap="it">integer type-name datatype</emphasis></term>
2905 new type to the current class (the one introduced by the most recent
2906 <emphasis remap="bf">class</emphasis> directive). The type names belong to the same name space as
2907 the one used in the tag set definition file.
2909 </listitem></varlistentry>
2914 The following is an excerpt from the file describing the variant set
2915 <emphasis remap="it">Variant-1</emphasis>.
2926 type 1 variantId octetstring
2931 type 2 z39.50 string
2940 <title>The Element Set (.est) Files</title>
2943 The element set specification files describe a selection of a subset
2944 of the elements of a database record. The element selection mechanism
2945 is equivalent to the one supplied by the <emphasis remap="it">Espec-1</emphasis> syntax of the
2946 Z39.50 specification. In fact, the internal representation of an
2947 element set specification is identical to the <emphasis remap="it">Espec-1</emphasis> structure,
2948 and we'll refer you to the description of that structure for most of
2949 the detailed semantics of the directives below.
2953 <emphasis remap="it">NOTE: Not all of the Espec-1 functionality has been implemented yet.
2954 The fields that are mentioned below all work as expected, unless
2955 otherwise is noted.</emphasis>
2959 The directives available in the element set file are as follows:
2966 <term>defaultVariantSetId <emphasis remap="it">OID-name</emphasis></term>
2969 (o) If variants are used in
2970 the following, this should provide the name of the variantset used
2971 (it's not currently possible to specify a different set in the
2972 individual variant request). In almost all cases (certainly all
2973 profiles known to us), the name <literal remap="tt">Variant-1</literal> should be given here.
2975 </listitem></varlistentry>
2977 <term>defaultVariantRequest <emphasis remap="it">variant-request</emphasis></term>
2981 provides a default variant request for
2982 use when the individual element requests (see below) do not contain a
2983 variant request. Variant requests consist of a blank-separated list of
2984 variant components. A variant compont is a comma-separated,
2985 parenthesized triple of variant class, type, and value (the two former
2986 values being represented as integers). The value can currently only be
2987 entered as a string (this will change to depend on the definition of
2988 the variant in question). The special value (@) is interpreted as a
2989 null value, however.
2991 </listitem></varlistentry>
2993 <term>simpleElement <emphasis remap="it">path ['variant' variant-request]</emphasis></term>
2996 (o,r) This corresponds to a simple element request in <emphasis remap="it">Espec-1</emphasis>. The
2997 path consists of a sequence of tag-selectors, where each of these can
3007 A simple tag, consisting of a comma-separated type-value pair in
3008 parenthesis, possibly followed by a colon (:) followed by an
3009 occurrences-specification (see below). The tag-value can be a number
3010 or a string. If the first character is an apostrophe ('), this forces
3011 the value to be interpreted as a string, even if it appears to be numerical.
3018 A WildThing, represented as a question mark (?), possibly
3019 followed by a colon (:) followed by an occurrences specification (see
3027 A WildPath, represented as an asterisk (*). Note that the last
3028 element of the path should not be a wildPath (wildpaths don't work in
3038 The occurrences-specification can be either the string <literal remap="tt">all</literal>, the
3039 string <literal remap="tt">last</literal>, or an explicit value-range. The value-range is
3040 represented as an integer (the starting point), possibly followed by a
3041 plus (+) and a second integer (the number of elements, default being
3046 The variant-request has the same syntax as the defaultVariantRequest
3047 above. Note that it may sometimes be useful to give an empty variant
3048 request, simply to disable the default for a specific set of fields
3049 (we aren't certain if this is proper <emphasis remap="it">Espec-1</emphasis>, but it works in
3050 this implementation).
3052 </listitem></varlistentry>
3057 The following is an example of an element specification belonging to
3064 simpleelement (1,10)
3065 simpleelement (1,12)
3067 simpleelement (1,14)
3069 simpleelement (4,52)
3076 <sect2 id="schema-mapping">
3077 <title>The Schema Mapping (.map) Files</title>
3080 Sometimes, the client might want to receive a database record in
3081 a schema that differs from the native schema of the record. For
3082 instance, a client might only know how to process WAIS records, while
3083 the database record is represented in a more specific schema, such as
3084 GILS. In this module, a mapping of data to one of the MARC formats is
3085 also thought of as a schema mapping (mapping the elements of the
3086 record into fields consistent with the given MARC specification, prior
3087 to actually converting the data to the ISO2709). This use of the
3088 object identifier for USMARC as a schema identifier represents an
3089 overloading of the OID which might not be entirely proper. However,
3090 it represents the dual role of schema and record syntax which
3091 is assumed by the MARC family in Z39.50.
3095 <emphasis remap="it">NOTE: The schema-mapping functions are so far limited to a
3096 straightforward mapping of elements. This should be extended with
3097 mechanisms for conversions of the element contents, and conditional
3098 mappings of elements based on the record contents.</emphasis>
3102 These are the directives of the schema mapping file format:
3109 <term>targetName <emphasis remap="it">name</emphasis></term>
3112 (m) A symbolic name for the target schema
3113 of the table. Useful mostly for diagnostic purposes.
3115 </listitem></varlistentry>
3117 <term>targetRef <emphasis remap="it">OID-name</emphasis></term>
3120 (m) An OID name for the target schema.
3121 This is used, for instance, by a server receiving a request to present
3122 a record in a different schema from the native one. The name, again,
3123 is found in the <emphasis remap="bf">oid</emphasis> module of <emphasis remap="bf">YAZ</emphasis>.
3125 </listitem></varlistentry>
3127 <term>map <emphasis remap="it">element-name target-path</emphasis></term>
3131 an element mapping rule to the table.
3133 </listitem></varlistentry>
3140 <title>The MARC (ISO2709) Representation (.mar) Files</title>
3143 This file provides rules for representing a record in the ISO2709
3144 format. The rules pertain mostly to the values of the constant-length
3145 header of the record.
3149 <emphasis remap="it">NOTE: This will be described better. We're in the process of
3150 re-evaluating and most likely changing the way that MARC records are
3151 handled by the system.</emphasis>
3156 <sect2 id="field-structure-and-character-sets">
3157 <title>Field Structure and Character Sets
3161 In order to provide a flexible approach to national character set
3162 handling, Zebra allows the administrator to configure the set up the
3163 system to handle any 8-bit character set — including sets that
3164 require multi-octet diacritics or other multi-octet characters. The
3165 definition of a character set includes a specification of the
3166 permissible values, their sort order (this affects the display in the
3167 SCAN function), and relationships between upper- and lowercase
3168 characters. Finally, the definition includes the specification of
3169 space characters for the set.
3173 The operator can define different character sets for different fields,
3174 typical examples being standard text fields, numerical fields, and
3175 special-purpose fields such as WWW-style linkages (URx).
3179 The field types, and hence character sets, are associated with data
3180 elements by the .abs files (see above). The file <literal remap="tt">default.idx</literal>
3181 provides the association between field type codes (as used in the .abs
3182 files) and the character map files (with the .chr suffix). The format
3183 of the .idx file is as follows
3190 <term>index <emphasis remap="it">field type code</emphasis></term>
3193 This directive introduces a new
3194 search index code. The argument is a one-character code to be used in the
3195 .abs files to select this particular index type. An index, roughly,
3196 corresponds to a particular structure attribute during search. Refer
3197 to section <xref linkend="search"/>.
3199 </listitem></varlistentry>
3201 <term>sort <emphasis remap="it">field code type</emphasis></term>
3204 This directive introduces a
3205 sort index. The argument is a one-character code to be used in the
3206 .abs fie to select this particular index type. The corresponding
3207 use attribute must be used in the sort request to refer to this
3208 particular sort index. The corresponding character map (see below)
3209 is used in the sort process.
3211 </listitem></varlistentry>
3213 <term>completeness <emphasis remap="it">boolean</emphasis></term>
3216 This directive enables or disables
3217 complete field indexing. The value of the <emphasis remap="it">boolean</emphasis> should be 0
3218 (disable) or 1. If completeness is enabled, the index entry will
3219 contain the complete contents of the field (up to a limit), with words
3220 (non-space characters) separated by single space characters
3221 (normalized to " " on display). When completeness is
3222 disabled, each word is indexed as a separate entry. Complete subfield
3223 indexing is most useful for fields which are typically browsed (eg.
3224 titles, authors, or subjects), or instances where a match on a
3225 complete subfield is essential (eg. exact title searching). For fields
3226 where completeness is disabled, the search engine will interpret a
3227 search containing space characters as a word proximity search.
3229 </listitem></varlistentry>
3231 <term>charmap <emphasis remap="it">filename</emphasis></term>
3234 This is the filename of the character
3235 map to be used for this index for field type.
3237 </listitem></varlistentry>
3242 The contents of the character map files are structured as follows:
3249 <term>lowercase <emphasis remap="it">value-set</emphasis></term>
3252 This directive introduces the basic
3253 value set of the field type. The format is an ordered list (without
3254 spaces) of the characters which may occur in "words" of
3255 the given type. The order of the entries in the list determines the
3256 sort order of the index. In addition to single characters, the
3257 following combinations are legal:
3266 Backslashes may be used to introduce three-digit octal, or
3267 two-digit hex representations of single characters (preceded by <literal remap="tt">x</literal>).
3268 In addition, the combinations
3269 \\, \\r, \\n, \\t, \\s (space — remember that real space-characters
3270 may ot occur in the value definition), and \\ are recognised,
3271 with their usual interpretation.
3278 Curly braces {} may be used to enclose ranges of single
3279 characters (possibly using the escape convention described in the
3280 preceding point), eg. {a-z} to entroduce the standard range of ASCII
3281 characters. Note that the interpretation of such a range depends on
3282 the concrete representation in your local, physical character set.
3289 paranthesises () may be used to enclose multi-byte characters -
3290 eg. diacritics or special national combinations (eg. Spanish
3291 "ll"). When found in the input stream (or a search term),
3292 these characters are viewed and sorted as a single character, with a
3293 sorting value depending on the position of the group in the value
3301 </listitem></varlistentry>
3303 <term>uppercase <emphasis remap="it">value-set</emphasis></term>
3306 This directive introduces the
3307 upper-case equivalencis to the value set (if any). The number and
3308 order of the entries in the list should be the same as in the
3309 <literal remap="tt">lowercase</literal> directive.
3311 </listitem></varlistentry>
3313 <term>space <emphasis remap="it">value-set</emphasis></term>
3316 This directive introduces the character
3317 which separate words in the input stream. Depending on the
3318 completeness mode of the field in question, these characters either
3319 terminate an index entry, or delimit individual "words" in
3320 the input stream. The order of the elements is not significant —
3321 otherwise the representation is the same as for the <literal remap="tt">upercase</literal> and
3322 <literal remap="tt">lowercase</literal> directives.
3324 </listitem></varlistentry>
3326 <term>map <emphasis remap="it">value-set</emphasis> <emphasis remap="it">target</emphasis></term>
3329 This directive introduces a
3330 mapping between each of the members of the value-set on the left to
3331 the character on the right. The character on the right must occur in
3332 the value set (the <literal remap="tt">lowercase</literal> directive) of the character set, but
3333 it may be a paranthesis-enclosed multi-octet character. This directive
3334 may be used to map diacritics to their base characters, or to map
3335 HTML-style character-representations to their natural form, etc.
3337 </listitem></varlistentry>
3345 <sect1 id="formats">
3346 <title>Exchange Formats</title>
3349 Converting records from the internal structure to en exchange format
3350 is largely an automatic process. Currently, the following exchange
3351 formats are supported:
3360 GRS-1. The internal representation is based on GRS-1, so the
3361 conversion here is straightforward. The system will create
3362 applied variant and supported variant lists as required, if a record
3363 contains variant information.
3370 SUTRS. Again, the mapping is fairly straighforward. Indentation
3371 is used to show the hierarchical structure of the record. All
3372 "GRS" type records support both the GRS-1 and SUTRS
3380 ISO2709-based formats (USMARC, etc.). Only records with a
3381 two-level structure (corresponding to fields and subfields) can be
3382 directly mapped to ISO2709. For records with a different structuring
3383 (eg., GILS), the representation in a structure like USMARC involves a
3384 schema-mapping (see section <xref linkend="schema-mapping"/>), to an
3385 "implied" USMARC schema (implied,
3386 because there is no formal schema which specifies the use of the
3387 USMARC fields outside of ISO2709). The resultant, two-level record is
3388 then mapped directly from the internal representation to ISO2709. See
3389 the GILS schema definition files for a detailed example of this
3397 Explain. This representation is only available for records
3398 belonging to the Explain schema.
3405 Summary. This ASN-1 based structure is only available for records
3406 belonging to the Summary schema - or schema which provide a mapping
3407 to this schema (see the description of the schema mapping facility
3415 SOIF. Support for this syntax is experimental, and is currently
3416 keyed to a private Index Data OID (1.2.840.10003.5.1000.81.2). All
3417 abstract syntaxes can be mapped to the SOIF format, although nested
3418 elements are represented by concatenation of the tag names at each