1 <chapter id="architecture">
2 <title>Overview of &zebra; Architecture</title>
4 <section id="architecture-representation">
5 <title>Local Representation</title>
8 As mentioned earlier, &zebra; places few restrictions on the type of
9 data that you can index and manage. Generally, whatever the form of
10 the data, it is parsed by an input filter specific to that format, and
11 turned into an internal structure that &zebra; knows how to handle. This
12 process takes place whenever the record is accessed - for indexing and
17 The RecordType parameter in the <literal>zebra.cfg</literal> file, or
18 the <literal>-t</literal> option to the indexer tells &zebra; how to
19 process input records.
20 Two basic types of processing are available - raw text and structured
21 data. Raw text is just that, and it is selected by providing the
22 argument <emphasis>text</emphasis> to &zebra;. Structured records are
23 all handled internally using the basic mechanisms described in the
25 &zebra; can read structured records in many different formats.
27 How this is done is governed by additional parameters after the
28 "grs" keyword, separated by "." characters.
33 <section id="architecture-maincomponents">
34 <title>Main Components</title>
36 The &zebra; system is designed to support a wide range of data management
37 applications. The system can be configured to handle virtually any
38 kind of structured data. Each record in the system is associated with
39 a <emphasis>record schema</emphasis> which lends context to the data
40 elements of the record.
41 Any number of record schemas can coexist in the system.
42 Although it may be wise to use only a single schema within
43 one database, the system poses no such restrictions.
46 The &zebra; indexer and information retrieval server consists of the
47 following main applications: the <command>zebraidx</command>
48 indexing maintenance utility, and the <command>zebrasrv</command>
49 information query and retrieval server. Both are using some of the
50 same main components, which are presented here.
53 The virtual Debian package <literal>idzebra-2.0</literal>
54 installs all the necessary packages to start
55 working with &zebra; - including utility programs, development libraries,
56 documentation and modules.
59 <section id="componentcore">
60 <title>Core &zebra; Libraries Containing Common Functionality</title>
62 The core &zebra; module is the meat of the <command>zebraidx</command>
63 indexing maintenance utility, and the <command>zebrasrv</command>
64 information query and retrieval server binaries. Shortly, the core
65 libraries are responsible for
68 <term>Dynamic Loading</term>
70 <para>of external filter modules, in case the application is
71 not compiled statically. These filter modules define indexing,
72 search and retrieval capabilities of the various input formats.
77 <term>Index Maintenance</term>
79 <para> &zebra; maintains Term Dictionaries and ISAM index
80 entries in inverted index structures kept on disk. These are
81 optimized for fast inset, update and delete, as well as good
87 <term>Search Evaluation</term>
89 <para>by execution of search requests expressed in &acro.pqf;/&acro.rpn;
90 data structures, which are handed over from
91 the &yaz; server frontend &acro.api;. Search evaluation includes
92 construction of hit lists according to boolean combinations
93 of simpler searches. Fast performance is achieved by careful
94 use of index structures, and by evaluation specific index hit
95 lists in correct order.
100 <term>Ranking and Sorting</term>
103 components call resorting/re-ranking algorithms on the hit
104 sets. These might also be pre-sorted not only using the
105 assigned document ID's, but also using assigned static rank
111 <term>Record Presentation</term>
113 <para>returns - possibly ranked - result sets, hit
114 numbers, and the like internal data to the &yaz; server backend &acro.api;
115 for shipping to the client. Each individual filter module
116 implements it's own specific presentation formats.
123 The Debian package <literal>libidzebra-2.0</literal>
124 contains all run-time libraries for &zebra;, the
125 documentation in PDF and HTML is found in
126 <literal>idzebra-2.0-doc</literal>, and
127 <literal>idzebra-2.0-common</literal>
128 includes common essential &zebra; configuration files.
133 <section id="componentindexer">
134 <title>&zebra; Indexer</title>
136 The <command>zebraidx</command>
137 indexing maintenance utility
138 loads external filter modules used for indexing data records of
139 different type, and creates, updates and drops databases and
140 indexes according to the rules defined in the filter modules.
143 The Debian package <literal>idzebra-2.0-utils</literal> contains
144 the <command>zebraidx</command> utility.
148 <section id="componentsearcher">
149 <title>&zebra; Searcher/Retriever</title>
151 This is the executable which runs the &acro.z3950;/&acro.sru;/&acro.srw; server and
152 glues together the core libraries and the filter modules to one
153 great Information Retrieval server application.
156 The Debian package <literal>idzebra-2.0-utils</literal> contains
157 the <command>zebrasrv</command> utility.
161 <section id="componentyazserver">
162 <title>&yaz; Server Frontend</title>
164 The &yaz; server frontend is
165 a full fledged stateful &acro.z3950; server taking client
166 connections, and forwarding search and scan requests to the
167 &zebra; core indexer.
170 In addition to &acro.z3950; requests, the &yaz; server frontend acts
171 as HTTP server, honoring
172 <ulink url="&url.sru;">&acro.sru; &acro.soap;</ulink>
174 &acro.sru; &acro.rest;
175 requests. Moreover, it can
177 <ulink url="&url.cql;">&acro.cql;</ulink>
179 <ulink url="&url.yaz.pqf;">&acro.pqf;</ulink>
181 correctly configured.
184 <ulink url="&url.yaz;">&yaz;</ulink>
186 toolkit that allows you to develop software using the
187 &acro.ansi; &acro.z3950;/ISO23950 standard for information retrieval.
188 It is packaged in the Debian packages
189 <literal>yaz</literal> and <literal>libyaz</literal>.
193 <section id="componentmodules">
194 <title>Record Models and Filter Modules</title>
196 The hard work of knowing <emphasis>what</emphasis> to index,
197 <emphasis>how</emphasis> to do it, and <emphasis>which</emphasis>
198 part of the records to send in a search/retrieve response is
200 various filter modules. It is their responsibility to define the
201 exact indexing and record display filtering rules.
204 The virtual Debian package
205 <literal>libidzebra-2.0-modules</literal> installs all base filter
209 <section id="componentmodulesdom">
210 <title>&acro.dom; &acro.xml; Record Model and Filter Module</title>
212 The &acro.dom; &acro.xml; filter uses a standard &acro.dom; &acro.xml; structure as
213 internal data model, and can thus parse, index, and display
214 any &acro.xml; document.
217 A parser for binary &acro.marc; records based on the ISO2709 library
218 standard is provided, it transforms these to the internal
219 &acro.marcxml; &acro.dom; representation.
222 The internal &acro.dom; &acro.xml; representation can be fed into four
223 different pipelines, consisting of arbitrarily many successive
224 &acro.xslt; transformations; these are for
226 <listitem><para>input parsing and initial
227 transformations,</para></listitem>
228 <listitem><para>indexing term extraction
229 transformations</para></listitem>
230 <listitem><para>transformations before internal document
231 storage, and </para></listitem>
232 <listitem><para>retrieve transformations from storage to output
233 format</para></listitem>
237 The &acro.dom; &acro.xml; filter pipelines use &acro.xslt; (and if supported on
238 your platform, even &acro.exslt;), it brings thus full &acro.xpath;
239 support to the indexing, storage and display rules of not only
240 &acro.xml; documents, but also binary &acro.marc; records.
243 Finally, the &acro.dom; &acro.xml; filter allows for static ranking at index
244 time, and to to sort hit lists according to predefined
248 Details on the experimental &acro.dom; &acro.xml; filter are found in
249 <xref linkend="record-model-domxml"/>.
252 The Debian package <literal>libidzebra-2.0-mod-dom</literal>
253 contains the &acro.dom; filter module.
257 <section id="componentmodulesalvis">
258 <title>ALVIS &acro.xml; Record Model and Filter Module</title>
261 The functionality of this record model has been improved and
262 replaced by the &acro.dom; &acro.xml; record model. See
263 <xref linkend="componentmodulesdom"/>.
268 The Alvis filter for &acro.xml; files is an &acro.xslt; based input
270 It indexes element and attribute content of any thinkable &acro.xml; format
271 using full &acro.xpath; support, a feature which the standard &zebra;
272 &acro.grs1; &acro.sgml; and &acro.xml; filters lacked. The indexed documents are
273 parsed into a standard &acro.xml; &acro.dom; tree, which restricts record size
274 according to availability of memory.
278 uses &acro.xslt; display stylesheets, which let
279 the &zebra; DB administrator associate multiple, different views on
280 the same &acro.xml; document type. These views are chosen on-the-fly in
284 In addition, the Alvis filter configuration is not bound to the
285 arcane &acro.bib1; &acro.z3950; library catalogue indexing traditions and
286 folklore, and is therefore easier to understand.
289 Finally, the Alvis filter allows for static ranking at index
290 time, and to to sort hit lists according to predefined
291 static ranks. This imposes no overhead at all, both
292 search and indexing perform still
293 <emphasis>O(1)</emphasis> irrespectively of document
294 collection size. This feature resembles Google's pre-ranking using
295 their PageRank algorithm.
298 Details on the experimental Alvis &acro.xslt; filter are found in
299 <xref linkend="record-model-alvisxslt"/>.
302 The Debian package <literal>libidzebra-2.0-mod-alvis</literal>
303 contains the Alvis filter module.
307 <section id="componentmodulesgrs">
308 <title>&acro.grs1; Record Model and Filter Modules</title>
311 The functionality of this record model has been improved and
312 replaced by the &acro.dom; &acro.xml; record model. See
313 <xref linkend="componentmodulesdom"/>.
317 The &acro.grs1; filter modules described in
318 <xref linkend="grs"/>
319 are all based on the &acro.z3950; specifications, and it is absolutely
320 mandatory to have the reference pages on &acro.bib1; attribute sets on
321 you hand when configuring &acro.grs1; filters. The GRS filters come in
322 different flavors, and a short introduction is needed here.
323 &acro.grs1; filters of various kind have also been called ABS filters due
324 to the <filename>*.abs</filename> configuration file suffix.
327 The <emphasis>grs.marc</emphasis> and
328 <emphasis>grs.marcxml</emphasis> filters are suited to parse and
329 index binary and &acro.xml; versions of traditional library &acro.marc; records
330 based on the ISO2709 standard. The Debian package for both
332 <literal>libidzebra-2.0-mod-grs-marc</literal>.
335 &acro.grs1; TCL scriptable filters for extensive user configuration come
336 in two flavors: a regular expression filter
337 <emphasis>grs.regx</emphasis> using TCL regular expressions, and
338 a general scriptable TCL filter called
339 <emphasis>grs.tcl</emphasis>
340 are both included in the
341 <literal>libidzebra-2.0-mod-grs-regx</literal> Debian package.
344 A general purpose &acro.sgml; filter is called
345 <emphasis>grs.sgml</emphasis>. This filter is not yet packaged,
346 but planned to be in the
347 <literal>libidzebra-2.0-mod-grs-sgml</literal> Debian package.
351 <literal>libidzebra-2.0-mod-grs-xml</literal> includes the
352 <emphasis>grs.xml</emphasis> filter which uses <ulink
353 url="&url.expat;">Expat</ulink> to
354 parse records in &acro.xml; and turn them into ID&zebra;'s internal &acro.grs1; node
355 trees. Have also a look at the Alvis &acro.xml;/&acro.xslt; filter described in
360 <section id="componentmodulestext">
361 <title>TEXT Record Model and Filter Module</title>
363 Plain ASCII text filter. TODO: add information here.
368 <section id="componentmodulessafari">
369 <title>SAFARI Record Model and Filter Module</title>
371 SAFARI filter module TODO: add information here.
381 <section id="architecture-workflow">
382 <title>Indexing and Retrieval Workflow</title>
385 Records pass through three different states during processing in the
395 When records are accessed by the system, they are represented
396 in their local, or native format. This might be &acro.sgml; or HTML files,
397 News or Mail archives, &acro.marc; records. If the system doesn't already
398 know how to read the type of data you need to store, you can set up an
399 input filter by preparing conversion rules based on regular
400 expressions and possibly augmented by a flexible scripting language
402 The input filter produces as output an internal representation,
410 When records are processed by the system, they are represented
411 in a tree-structure, constructed by tagged data elements hanging off a
412 root node. The tagged elements may contain data or yet more tagged
413 elements in a recursive structure. The system performs various
414 actions on this tree structure (indexing, element selection, schema
422 Before transmitting records to the client, they are first
423 converted from the internal structure to a form suitable for exchange
424 over the network - according to the &acro.z3950; standard.
433 <section id="special-retrieval">
434 <title>Retrieval of &zebra; internal record data</title>
436 Starting with <literal>&zebra;</literal> version 2.0.5 or newer, it is
437 possible to use a special element set which has the prefix
438 <literal>zebra::</literal>.
441 Using this element will, regardless of record type, return
442 &zebra;'s internal index structure/data for a record.
443 In particular, the regular record filters are not invoked when
445 This can in some cases make the retrieval faster than regular
446 retrieval operations (for &acro.marc;, &acro.xml; etc).
448 <table id="special-retrieval-types">
449 <title>Special Retrieval Elements</title>
453 <entry>Element Set</entry>
454 <entry>Description</entry>
455 <entry>Syntax</entry>
460 <entry><literal>zebra::meta::sysno</literal></entry>
461 <entry>Get &zebra; record system ID</entry>
462 <entry>&acro.xml; and &acro.sutrs;</entry>
465 <entry><literal>zebra::data</literal></entry>
466 <entry>Get raw record</entry>
470 <entry><literal>zebra::meta</literal></entry>
471 <entry>Get &zebra; record internal metadata</entry>
472 <entry>&acro.xml; and &acro.sutrs;</entry>
475 <entry><literal>zebra::index</literal></entry>
476 <entry>Get all indexed keys for record</entry>
477 <entry>&acro.xml; and &acro.sutrs;</entry>
481 <literal>zebra::index::</literal><replaceable>f</replaceable>
484 Get indexed keys for field <replaceable>f</replaceable> for record
486 <entry>&acro.xml; and &acro.sutrs;</entry>
490 <literal>zebra::index::</literal><replaceable>f</replaceable>:<replaceable>t</replaceable>
493 Get indexed keys for field <replaceable>f</replaceable>
494 and type <replaceable>t</replaceable> for record
496 <entry>&acro.xml; and &acro.sutrs;</entry>
500 <literal>zebra::snippet</literal>
503 Get snippet for record for one or more indexes (f1,f2,..).
504 This includes a phrase from the original
505 record at the point where a match occurs (for a query). By default
506 give terms before - and after are included in the snippet. The
507 matching terms are enclosed within element
508 <literal><s></literal>. The snippet facility requires
509 Zebra 2.0.16 or later.
511 <entry>&acro.xml; and &acro.sutrs;</entry>
515 <literal>zebra::facet::</literal><replaceable>f1</replaceable>:<replaceable>t1</replaceable>,<replaceable>f2</replaceable>:<replaceable>t2</replaceable>,..
518 Get facet of a result set. The facet result is returned
519 as if it was a normal record, while in reality is a
520 recap of most "important" terms in a result set for the fields
522 The facet facility first appeared in Zebra 2.0.20.
524 <entry>&acro.xml;</entry>
530 For example, to fetch the raw binary record data stored in the
531 zebra internal storage, or on the filesystem, the following
532 commands can be issued:
534 Z> f @attr 1=title my
536 Z> elements zebra::data
546 <literal>zebra::data</literal> element set name is
547 defined for any record syntax, but will always fetch
548 the raw record data in exactly the original form. No record syntax
549 specific transformations will be applied to the raw record data.
552 Also, &zebra; internal metadata about the record can be accessed:
554 Z> f @attr 1=title my
556 Z> elements zebra::meta::sysno
559 displays in <literal>&acro.xml;</literal> record syntax only internal
560 record system number, whereas
562 Z> f @attr 1=title my
564 Z> elements zebra::meta
567 displays all available metadata on the record. These include system
568 number, database name, indexed filename, filter used for indexing,
569 score and static ranking information and finally bytesize of record.
572 Sometimes, it is very hard to figure out what exactly has been
573 indexed how and in which indexes. Using the indexing stylesheet of
574 the Alvis filter, one can at least see which portion of the record
575 went into which index, but a similar aid does not exist for all
576 other indexing filters.
580 <literal>zebra::index</literal> element set names are provided to
581 access information on per record indexed fields. For example, the
584 Z> f @attr 1=title my
586 Z> elements zebra::index
589 will display all indexed tokens from all indexed fields of the
590 first record, and it will display in <literal>&acro.sutrs;</literal>
591 record syntax, whereas
593 Z> f @attr 1=title my
595 Z> elements zebra::index::title
597 Z> elements zebra::index::title:p
600 displays in <literal>&acro.xml;</literal> record syntax only the content
601 of the zebra string index <literal>title</literal>, or
602 even only the type <literal>p</literal> phrase indexed part of it.
606 Trying to access numeric <literal>&acro.bib1;</literal> use
607 attributes or trying to access non-existent zebra intern string
608 access points will result in a Diagnostic 25: Specified element set
609 'name not valid for specified database.
616 <!-- Keep this comment at the end of the file
621 sgml-minimize-attributes:nil
622 sgml-always-quote-attributes:t
625 sgml-parent-document: "idzebra.xml"
626 sgml-local-catalogs: nil
627 sgml-namecase-general:t