1 <chapter id="administration">
2 <!-- $Id: administration.xml,v 1.4 2002-04-09 19:20:22 adam Exp $ -->
3 <title>Administrating Zebra</title>
6 Unlike many simpler retrieval systems, Zebra supports safe, incremental
7 updates to an existing index.
11 Normally, when Zebra modifies the index it reads a number of records
13 Depending on your specifications and on the contents of each record
14 one the following events take place for each record:
21 The record is indexed as if it never occurred before.
22 Either the Zebra system doesn't know how to identify the record or
23 Zebra can identify the record but didn't find it to be already indexed.
31 The record has already been indexed.
32 In this case either the contents of the record or the location
33 (file) of the record indicates that it has been indexed before.
41 The record is deleted from the index. As in the
42 update-case it must be able to identify the record.
50 Please note that in both the modify- and delete- case the Zebra
51 indexer must be able to generate a unique key that identifies the record
52 in question (more on this below).
56 To administrate the Zebra retrieval system, you run the
57 <literal>zebraidx</literal> program.
58 This program supports a number of options which are preceded by a dash,
59 and a few commands (not preceded by dash).
63 Both the Zebra administrative tool and the Z39.50 server share a
64 set of index files and a global configuration file.
65 The name of the configuration file defaults to
66 <literal>zebra.cfg</literal>.
67 The configuration file includes specifications on how to index
68 various kinds of records and where the other configuration files
69 are located. <literal>zebrasrv</literal> and <literal>zebraidx</literal>
70 <emphasis>must</emphasis> be run in the directory where the
71 configuration file lives unless you indicate the location of the
72 configuration file by option <literal>-c</literal>.
75 <sect1 id="record-types">
76 <title>Record Types</title>
79 Indexing is a per-record process, in which either insert/modify/delete
80 will occur. Before a record is indexed search keys are extracted from
81 whatever might be the layout the original record (sgml,html,text, etc..).
82 The Zebra system currently supports two fundamental types of records:
83 structured and simple text.
84 To specify a particular extraction process, use either the
85 command line option <literal>-t</literal> or specify a
86 <literal>recordType</literal> setting in the configuration file.
91 <sect1 id="configuration-file">
92 <title>The Zebra Configuration File</title>
95 The Zebra configuration file, read by <literal>zebraidx</literal> and
96 <literal>zebrasrv</literal> defaults to <literal>zebra.cfg</literal>
97 unless specified by <literal>-c</literal> option.
101 You can edit the configuration file with a normal text editor.
102 parameter names and values are separated by colons in the file. Lines
103 starting with a hash sign (<literal>#</literal>) are
108 If you manage different sets of records that share common
109 characteristics, you can organize the configuration settings for each
111 When <literal>zebraidx</literal> is run and you wish to address a
112 given group you specify the group name with the <literal>-g</literal>
114 In this case settings that have the group name as their prefix
115 will be used by <literal>zebraidx</literal>.
116 If no <literal>-g</literal> option is specified, the settings
117 without prefix are used.
121 In the configuration file, the group name is placed before the option
122 name itself, separated by a dot (.). For instance, to set the record type
123 for group <literal>public</literal> to <literal>grs.sgml</literal>
124 (the SGML-like format for structured records) you would write:
129 public.recordType: grs.sgml
134 To set the default value of the record type to <literal>text</literal>
145 The available configuration settings are summarized below. They will be
146 explained further in the following sections.
154 <emphasis>group</emphasis>
155 .recordType[<emphasis>.name</emphasis>]:
156 <replaceable>type</replaceable>
160 Specifies how records with the file extension
161 <emphasis>name</emphasis> should be handled by the indexer.
162 This option may also be specified as a command line option
163 (<literal>-t</literal>). Note that if you do not specify a
164 <emphasis>name</emphasis>, the setting applies to all files.
165 In general, the record type specifier consists of the elements (each
166 element separated by dot), <emphasis>fundamental-type</emphasis>,
167 <emphasis>file-read-type</emphasis> and arguments. Currently, two
168 fundamental types exist, <literal>text</literal> and
169 <literal>grs</literal>.
174 <term><emphasis>group</emphasis>.recordId:
175 <replaceable>record-id-spec</replaceable></term>
178 Specifies how the records are to be identified when updated. See
179 <xref linkend="locating-records"/>.
184 <term><emphasis>group</emphasis>.database:
185 <replaceable>database</replaceable></term>
188 Specifies the Z39.50 database name.
193 <term><emphasis>group</emphasis>.storeKeys:
194 <replaceable>boolean</replaceable></term>
197 Specifies whether key information should be saved for a given
198 group of records. If you plan to update/delete this type of
199 records later this should be specified as 1; otherwise it
200 should be 0 (default), to save register space.
201 See <xref linkend="file-ids"/>.
206 <term><emphasis>group</emphasis>.storeData:
207 <replaceable>boolean</replaceable></term>
210 Specifies whether the records should be stored internally
211 in the Zebra system files.
212 If you want to maintain the raw records yourself,
213 this option should be false (0).
214 If you want Zebra to take care of the records for you, it
220 <term>register: <replaceable>register-location</replaceable></term>
223 Specifies the location of the various register files that Zebra uses
224 to represent your databases.
225 See <xref linkend="register-location"/>.
230 <term>shadow: <replaceable>register-location</replaceable></term>
233 Enables the <emphasis>safe update</emphasis> facility of Zebra, and
234 tells the system where to place the required, temporary files.
235 See <xref linkend="shadow-registers"/>.
240 <term>lockDir: <replaceable>directory</replaceable></term>
243 Directory in which various lock files are stored.
248 <term>keyTmpDir: <replaceable>directory</replaceable></term>
251 Directory in which temporary files used during zebraidx' update
257 <term>setTmpDir: <replaceable>directory</replaceable></term>
260 Specifies the directory that the server uses for temporary result sets.
261 If not specified <literal>/tmp</literal> will be used.
266 <term>profilePath: <literal>path</literal></term>
269 Specifies a path of profile specification files.
270 The path is composed of one or more directories separated by
271 colon. Similar to PATH for UNIX systems.
276 <term>attset: <replaceable>filename</replaceable></term>
279 Specifies the filename(s) of attribute set files for use in
280 searching. At least the Bib-1 set should be loaded
281 (<literal>bib1.att</literal>).
282 The <literal>profilePath</literal> setting is used to look for
284 See <xref linkend="attset-files"/>
289 <term>memMax: <replaceable>size</replaceable></term>
292 Specifies <replaceable>size</replaceable> of internal memory
293 to use for the zebraidx program.
294 The amount is given in megabytes - default is 4 (4 MB).
300 <term>root: <replaceable>dir</replaceable></term>
303 Specifies a directory base for Zebra. All relative paths
304 given (in profilePath, register, shadow) are based on this
305 directory. This setting is useful if if you Zebra server
306 is running in a different directory from where
307 <literal>zebra.cfg</literal> is located.
317 <sect1 id="locating-records">
318 <title>Locating Records</title>
321 The default behavior of the Zebra system is to reference the
322 records from their original location, i.e. where they were found when you
323 ran <literal>zebraidx</literal>.
324 That is, when a client wishes to retrieve a record
325 following a search operation, the files are accessed from the place
326 where you originally put them - if you remove the files (without
327 running <literal>zebraidx</literal> again, the client
328 will receive a diagnostic message.
332 If your input files are not permanent - for example if you retrieve
333 your records from an outside source, or if they were temporarily
334 mounted on a CD-ROM drive,
335 you may want Zebra to make an internal copy of them. To do this,
336 you specify 1 (true) in the <literal>storeData</literal> setting. When
337 the Z39.50 server retrieves the records they will be read from the
338 internal file structures of the system.
343 <sect1 id="simple-indexing">
344 <title>Indexing with no Record IDs (Simple Indexing)</title>
347 If you have a set of records that are not expected to change over time
348 you may can build your database without record IDs.
349 This indexing method uses less space than the other methods and
354 To use this method, you simply omit the <literal>recordId</literal> entry
355 for the group of files that you index. To add a set of records you use
356 <literal>zebraidx</literal> with the <literal>update</literal> command. The
357 <literal>update</literal> command will always add all of the records that it
358 encounters to the index - whether they have already been indexed or
359 not. If the set of indexed files change, you should delete all of the
360 index files, and build a new index from scratch.
364 Consider a system in which you have a group of text files called
365 <literal>simple</literal>.
366 That group of records should belong to a Z39.50 database called
367 <literal>textbase</literal>.
368 The following <literal>zebra.cfg</literal> file will suffice:
373 profilePath: /usr/local/yaz
375 simple.recordType: text
376 simple.database: textbase
382 Since the existing records in an index can not be addressed by their
383 IDs, it is impossible to delete or modify records when using this method.
388 <sect1 id="file-ids">
389 <title>Indexing with File Record IDs</title>
392 If you have a set of files that regularly change over time: Old files
393 are deleted, new ones are added, or existing files are modified, you
394 can benefit from using the <emphasis>file ID</emphasis>
395 indexing methodology.
396 Examples of this type of database might include an index of WWW
397 resources, or a USENET news spool area.
398 Briefly speaking, the file key methodology uses the directory paths
399 of the individual records as a unique identifier for each record.
400 To perform indexing of a directory with file keys, again, you specify
401 the top-level directory after the <literal>update</literal> command.
402 The command will recursively traverse the directories and compare
403 each one with whatever have been indexed before in that same directory.
404 If a file is new (not in the previous version of the directory) it
405 is inserted into the registers; if a file was already indexed and
406 it has been modified since the last update, the index is also
407 modified; if a file has been removed since the last
408 visit, it is deleted from the index.
412 The resulting system is easy to administrate. To delete a record you
413 simply have to delete the corresponding file (say, with the
414 <literal>rm</literal> command). And to add records you create new
415 files (or directories with files). For your changes to take effect
416 in the register you must run <literal>zebraidx update</literal> with
417 the same directory root again. This mode of operation requires more
418 disk space than simpler indexing methods, but it makes it easier for
419 you to keep the index in sync with a frequently changing set of data.
420 If you combine this system with the <emphasis>safe update</emphasis>
421 facility (see below), you never have to take your server off-line for
422 maintenance or register updating purposes.
426 To enable indexing with pathname IDs, you must specify
427 <literal>file</literal> as the value of <literal>recordId</literal>
428 in the configuration file. In addition, you should set
429 <literal>storeKeys</literal> to <literal>1</literal>, since the Zebra
430 indexer must save additional information about the contents of each record
431 in order to modify the indices correctly at a later time.
435 For example, to update records of group <literal>esdd</literal>
437 <literal>/data1/records/</literal> you should type:
439 $ zebraidx -g esdd update /data1/records
444 The corresponding configuration file includes:
447 esdd.recordType: grs.sgml
453 <para>You cannot start out with a group of records with simple
454 indexing (no record IDs as in the previous section) and then later
455 enable file record Ids. Zebra must know from the first time that you
457 the files should be indexed with file record IDs.
462 You cannot explicitly delete records when using this method (using the
463 <literal>delete</literal> command to <literal>zebraidx</literal>. Instead
464 you have to delete the files from the file system (or move them to a
466 and then run <literal>zebraidx</literal> with the
467 <literal>update</literal> command.
471 <sect1 id="generic-ids">
472 <title>Indexing with General Record IDs</title>
475 When using this method you construct an (almost) arbitrary, internal
476 record key based on the contents of the record itself and other system
477 information. If you have a group of records that explicitly associates
478 an ID with each record, this method is convenient. For example, the
479 record format may contain a title or a ID-number - unique within the group.
480 In either case you specify the Z39.50 attribute set and use-attribute
481 location in which this information is stored, and the system looks at
482 that field to determine the identity of the record.
486 As before, the record ID is defined by the <literal>recordId</literal>
487 setting in the configuration file. The value of the record ID specification
488 consists of one or more tokens separated by whitespace. The resulting
489 ID is represented in the index by concatenating the tokens and
490 separating them by ASCII value (1).
494 There are three kinds of tokens:
498 <term>Internal record info</term>
501 The token refers to a key that is
502 extracted from the record. The syntax of this token is
503 <literal>(</literal> <emphasis>set</emphasis> <literal>,</literal>
504 <emphasis>use</emphasis> <literal>)</literal>,
505 where <emphasis>set</emphasis> is the
506 attribute set name <emphasis>use</emphasis> is the
507 name or value of the attribute.
512 <term>System variable</term>
515 The system variables are preceded by
520 and immediately followed by the system variable name, which
533 <term>database</term>
536 Current database specified.
553 <term>Constant string</term>
556 A string used as part of the ID — surrounded
557 by single- or double quotes.
565 For instance, the sample GILS records that come with the Zebra
566 distribution contain a unique ID in the data tagged Control-Identifier.
567 The data is mapped to the Bib-1 use attribute Identifier-standard
568 (code 1007). To use this field as a record id, specify
569 <literal>(bib1,Identifier-standard)</literal> as the value of the
570 <literal>recordId</literal> in the configuration file.
571 If you have other record types that uses the same field for a
572 different purpose, you might add the record type
573 (or group or database name) to the record id of the gils
574 records as well, to prevent matches with other types of records.
575 In this case the recordId might be set like this:
578 gils.recordId: $type (bib1,Identifier-standard)
584 (see <xref linkend="data-model"/>
585 for details of how the mapping between elements of your records and
586 searchable attributes is established).
590 As for the file record ID case described in the previous section,
591 updating your system is simply a matter of running
592 <literal>zebraidx</literal>
593 with the <literal>update</literal> command. However, the update with general
594 keys is considerably slower than with file record IDs, since all files
595 visited must be (re)read to discover their IDs.
599 As you might expect, when using the general record IDs
600 method, you can only add or modify existing records with the
601 <literal>update</literal> command.
602 If you wish to delete records, you must use the,
603 <literal>delete</literal> command, with a directory as a parameter.
604 This will remove all records that match the files below that root
610 <sect1 id="register-location">
611 <title>Register Location</title>
614 Normally, the index files that form dictionaries, inverted
615 files, record info, etc., are stored in the directory where you run
616 <literal>zebraidx</literal>. If you wish to store these, possibly large,
617 files somewhere else, you must add the <literal>register</literal>
618 entry to the <literal>zebra.cfg</literal> file.
619 Furthermore, the Zebra system allows its file
620 structures to span multiple file systems, which is useful for
621 managing very large databases.
625 The value of the <literal>register</literal> setting is a sequence
626 of tokens. Each token takes the form:
629 <emphasis>dir</emphasis><literal>:</literal><emphasis>size</emphasis>.
632 The <emphasis>dir</emphasis> specifies a directory in which index files
633 will be stored and the <emphasis>size</emphasis> specifies the maximum
634 size of all files in that directory. The Zebra indexer system fills
635 each directory in the order specified and use the next specified
636 directories as needed.
637 The <emphasis>size</emphasis> is an integer followed by a qualifier
638 code, <literal>M</literal> for megabytes,
639 <literal>k</literal> for kilobytes.
643 For instance, if you have allocated two disks for your register, and
644 the first disk is mounted
645 on <literal>/d1</literal> and has 200 MB of free space and the
646 second, mounted on <literal>/d2</literal> has 300 MB, you could
647 put this entry in your configuration file:
650 register: /d1:200M /d2:300M
656 Note that Zebra does not verify that the amount of space specified is
657 actually available on the directory (file system) specified - it is
658 your responsibility to ensure that enough space is available, and that
659 other applications do not attempt to use the free space. In a large
660 production system, it is recommended that you allocate one or more
661 file system exclusively to the Zebra register files.
666 <sect1 id="shadow-registers">
667 <title>Safe Updating - Using Shadow Registers</title>
670 <title>Description</title>
673 The Zebra server supports <emphasis>updating</emphasis> of the index
674 structures. That is, you can add, modify, or remove records from
675 databases managed by Zebra without rebuilding the entire index.
676 Since this process involves modifying structured files with various
677 references between blocks of data in the files, the update process
678 is inherently sensitive to system crashes, or to process interruptions:
679 Anything but a successfully completed update process will leave the
680 register files in an unknown state, and you will essentially have no
681 recourse but to re-index everything, or to restore the register files
682 from a backup medium.
683 Further, while the update process is active, users cannot be
684 allowed to access the system, as the contents of the register files
685 may change unpredictably.
689 You can solve these problems by enabling the shadow register system in
691 During the updating procedure, <literal>zebraidx</literal> will temporarily
692 write changes to the involved files in a set of "shadow
693 files", without modifying the files that are accessed by the
694 active server processes. If the update procedure is interrupted by a
695 system crash or a signal, you simply repeat the procedure - the
696 register files have not been changed or damaged, and the partially
697 written shadow files are automatically deleted before the new updating
702 At the end of the updating procedure (or in a separate operation, if
703 you so desire), the system enters a "commit mode". First,
704 any active server processes are forced to access those blocks that
705 have been changed from the shadow files rather than from the main
706 register files; the unmodified blocks are still accessed at their
707 normal location (the shadow files are not a complete copy of the
708 register files - they only contain those parts that have actually been
709 modified). If the commit process is interrupted at any point during the
710 commit process, the server processes will continue to access the
711 shadow files until you can repeat the commit procedure and complete
712 the writing of data to the main register files. You can perform
713 multiple update operations to the registers before you commit the
714 changes to the system files, or you can execute the commit operation
715 at the end of each update operation. When the commit phase has
716 completed successfully, any running server processes are instructed to
717 switch their operations to the new, operational register, and the
718 temporary shadow files are deleted.
724 <title>How to Use Shadow Register Files</title>
727 The first step is to allocate space on your system for the shadow
729 You do this by adding a <literal>shadow</literal> entry to the
730 <literal>zebra.cfg</literal> file.
731 The syntax of the <literal>shadow</literal> entry is exactly the
732 same as for the <literal>register</literal> entry
733 (see <xref linkend="register-location"/>).
734 The location of the shadow area should be
735 <emphasis>different</emphasis> from the location of the main register
736 area (if you have specified one - remember that if you provide no
737 <literal>register</literal> setting, the default register area is the
738 working directory of the server and indexing processes).
742 The following excerpt from a <literal>zebra.cfg</literal> file shows
743 one example of a setup that configures both the main register
744 location and the shadow file area.
745 Note that two directories or partitions have been set aside
746 for the shadow file area. You can specify any number of directories
747 for each of the file areas, but remember that there should be no
748 overlaps between the directories used for the main registers and the
749 shadow files, respectively.
756 shadow: /scratch1:100M /scratch2:200M
762 When shadow files are enabled, an extra command is available at the
763 <literal>zebraidx</literal> command line.
764 In order to make changes to the system take effect for the
765 users, you'll have to submit a "commit" command after a
766 (sequence of) update operation(s).
767 You can ask the indexer to commit the changes immediately
768 after the update operation:
774 $ zebraidx update /d1/records update /d2/more-records commit
780 Or you can execute multiple updates before committing the changes:
786 $ zebraidx -g books update /d1/records update /d2/more-records
787 $ zebraidx -g fun update /d3/fun-records
794 If one of the update operations above had been interrupted, the commit
795 operation on the last line would fail: <literal>zebraidx</literal>
796 will not let you commit changes that would destroy the running register.
797 You'll have to rerun all of the update operations since your last
798 commit operation, before you can commit the new changes.
802 Similarly, if the commit operation fails, <literal>zebraidx</literal>
803 will not let you start a new update operation before you have
804 successfully repeated the commit operation.
805 The server processes will keep accessing the shadow files rather
806 than the (possibly damaged) blocks of the main register files
807 until the commit operation has successfully completed.
811 You should be aware that update operations may take slightly longer
812 when the shadow register system is enabled, since more file access
813 operations are involved. Further, while the disk space required for
814 the shadow register data is modest for a small update operation, you
815 may prefer to disable the system if you are adding a very large number
816 of records to an already very large database (we use the terms
817 <emphasis>large</emphasis> and <emphasis>modest</emphasis>
818 very loosely here, since every application will have a
819 different perception of size).
820 To update the system without the use of the the shadow files,
821 simply run <literal>zebraidx</literal> with the <literal>-n</literal>
822 option (note that you do not have to execute the
823 <emphasis>commit</emphasis> command of <literal>zebraidx</literal>
824 when you temporarily disable the use of the shadow registers in
826 Note also that, just as when the shadow registers are not enabled,
827 server processes will be barred from accessing the main register
828 while the update procedure takes place.
836 <!-- Keep this comment at the end of the file
841 sgml-minimize-attributes:nil
842 sgml-always-quote-attributes:t
845 sgml-parent-document: "zebra.xml"
846 sgml-local-catalogs: nil
847 sgml-namecase-general:t