1 <chapter id="architecture">
2 <!-- $Id: architecture.xml,v 1.22 2007-05-24 13:44:09 adam Exp $ -->
3 <title>Overview of &zebra; Architecture</title>
5 <section id="architecture-representation">
6 <title>Local Representation</title>
9 As mentioned earlier, &zebra; places few restrictions on the type of
10 data that you can index and manage. Generally, whatever the form of
11 the data, it is parsed by an input filter specific to that format, and
12 turned into an internal structure that &zebra; knows how to handle. This
13 process takes place whenever the record is accessed - for indexing and
18 The RecordType parameter in the <literal>zebra.cfg</literal> file, or
19 the <literal>-t</literal> option to the indexer tells &zebra; how to
20 process input records.
21 Two basic types of processing are available - raw text and structured
22 data. Raw text is just that, and it is selected by providing the
23 argument <emphasis>text</emphasis> to &zebra;. Structured records are
24 all handled internally using the basic mechanisms described in the
26 &zebra; can read structured records in many different formats.
28 How this is done is governed by additional parameters after the
29 "grs" keyword, separated by "." characters.
34 <section id="architecture-maincomponents">
35 <title>Main Components</title>
37 The &zebra; system is designed to support a wide range of data management
38 applications. The system can be configured to handle virtually any
39 kind of structured data. Each record in the system is associated with
40 a <emphasis>record schema</emphasis> which lends context to the data
41 elements of the record.
42 Any number of record schemas can coexist in the system.
43 Although it may be wise to use only a single schema within
44 one database, the system poses no such restrictions.
47 The &zebra; indexer and information retrieval server consists of the
48 following main applications: the <command>zebraidx</command>
49 indexing maintenance utility, and the <command>zebrasrv</command>
50 information query and retrieval server. Both are using some of the
51 same main components, which are presented here.
54 The virtual Debian package <literal>idzebra-2.0</literal>
55 installs all the necessary packages to start
56 working with &zebra; - including utility programs, development libraries,
57 documentation and modules.
60 <section id="componentcore">
61 <title>Core &zebra; Libraries Containing Common Functionality</title>
63 The core &zebra; module is the meat of the <command>zebraidx</command>
64 indexing maintenance utility, and the <command>zebrasrv</command>
65 information query and retrieval server binaries. Shortly, the core
66 libraries are responsible for
69 <term>Dynamic Loading</term>
71 <para>of external filter modules, in case the application is
72 not compiled statically. These filter modules define indexing,
73 search and retrieval capabilities of the various input formats.
78 <term>Index Maintenance</term>
80 <para> &zebra; maintains Term Dictionaries and ISAM index
81 entries in inverted index structures kept on disk. These are
82 optimized for fast inset, update and delete, as well as good
88 <term>Search Evaluation</term>
90 <para>by execution of search requests expressed in &acro.pqf;/&acro.rpn;
91 data structures, which are handed over from
92 the &yaz; server frontend &acro.api;. Search evaluation includes
93 construction of hit lists according to boolean combinations
94 of simpler searches. Fast performance is achieved by careful
95 use of index structures, and by evaluation specific index hit
96 lists in correct order.
101 <term>Ranking and Sorting</term>
104 components call resorting/re-ranking algorithms on the hit
105 sets. These might also be pre-sorted not only using the
106 assigned document ID's, but also using assigned static rank
112 <term>Record Presentation</term>
114 <para>returns - possibly ranked - result sets, hit
115 numbers, and the like internal data to the &yaz; server backend &acro.api;
116 for shipping to the client. Each individual filter module
117 implements it's own specific presentation formats.
124 The Debian package <literal>libidzebra-2.0</literal>
125 contains all run-time libraries for &zebra;, the
126 documentation in PDF and HTML is found in
127 <literal>idzebra-2.0-doc</literal>, and
128 <literal>idzebra-2.0-common</literal>
129 includes common essential &zebra; configuration files.
134 <section id="componentindexer">
135 <title>&zebra; Indexer</title>
137 The <command>zebraidx</command>
138 indexing maintenance utility
139 loads external filter modules used for indexing data records of
140 different type, and creates, updates and drops databases and
141 indexes according to the rules defined in the filter modules.
144 The Debian package <literal>idzebra-2.0-utils</literal> contains
145 the <command>zebraidx</command> utility.
149 <section id="componentsearcher">
150 <title>&zebra; Searcher/Retriever</title>
152 This is the executable which runs the &acro.z3950;/&acro.sru;/&acro.srw; server and
153 glues together the core libraries and the filter modules to one
154 great Information Retrieval server application.
157 The Debian package <literal>idzebra-2.0-utils</literal> contains
158 the <command>zebrasrv</command> utility.
162 <section id="componentyazserver">
163 <title>&yaz; Server Frontend</title>
165 The &yaz; server frontend is
166 a full fledged stateful &acro.z3950; server taking client
167 connections, and forwarding search and scan requests to the
168 &zebra; core indexer.
171 In addition to &acro.z3950; requests, the &yaz; server frontend acts
172 as HTTP server, honoring
173 <ulink url="&url.srw;">&acro.sru; &acro.soap;</ulink>
175 <ulink url="&url.sru;">&acro.sru; &acro.rest;</ulink>
176 requests. Moreover, it can
178 <ulink url="&url.cql;">&acro.cql;</ulink>
180 <ulink url="&url.yaz.pqf;">&acro.pqf;</ulink>
182 correctly configured.
185 <ulink url="&url.yaz;">&yaz;</ulink>
187 toolkit that allows you to develop software using the
188 &acro.ansi; &acro.z3950;/ISO23950 standard for information retrieval.
189 It is packaged in the Debian packages
190 <literal>yaz</literal> and <literal>libyaz</literal>.
194 <section id="componentmodules">
195 <title>Record Models and Filter Modules</title>
197 The hard work of knowing <emphasis>what</emphasis> to index,
198 <emphasis>how</emphasis> to do it, and <emphasis>which</emphasis>
199 part of the records to send in a search/retrieve response is
201 various filter modules. It is their responsibility to define the
202 exact indexing and record display filtering rules.
205 The virtual Debian package
206 <literal>libidzebra-2.0-modules</literal> installs all base filter
210 <section id="componentmodulesdom">
211 <title>&acro.dom; &acro.xml; Record Model and Filter Module</title>
213 The &acro.dom; &acro.xml; filter uses a standard &acro.dom; &acro.xml; structure as
214 internal data model, and can thus parse, index, and display
215 any &acro.xml; document.
218 A parser for binary &acro.marc; records based on the ISO2709 library
219 standard is provided, it transforms these to the internal
220 &acro.marcxml; &acro.dom; representation.
223 The internal &acro.dom; &acro.xml; representation can be fed into four
224 different pipelines, consisting of arbitraily many sucessive
225 &acro.xslt; transformations; these are for
227 <listitem><para>input parsing and initial
228 transformations,</para></listitem>
229 <listitem><para>indexing term extraction
230 transformations</para></listitem>
231 <listitem><para>transformations before internal document
232 storage, and </para></listitem>
233 <listitem><para>retrieve transformations from storage to output
234 format</para></listitem>
238 The &acro.dom; &acro.xml; filter pipelines use &acro.xslt; (and if supported on
239 your platform, even &acro.exslt;), it brings thus full &acro.xpath;
240 support to the indexing, storage and display rules of not only
241 &acro.xml; documents, but also binary &acro.marc; records.
244 Finally, the &acro.dom; &acro.xml; filter allows for static ranking at index
245 time, and to to sort hit lists according to predefined
249 Details on the experimental &acro.dom; &acro.xml; filter are found in
250 <xref linkend="record-model-domxml"/>.
253 The Debian package <literal>libidzebra-2.0-mod-dom</literal>
254 contains the &acro.dom; filter module.
258 <section id="componentmodulesalvis">
259 <title>ALVIS &acro.xml; Record Model and Filter Module</title>
262 The functionality of this record model has been improved and
263 replaced by the &acro.dom; &acro.xml; record model. See
264 <xref linkend="componentmodulesdom"/>.
269 The Alvis filter for &acro.xml; files is an &acro.xslt; based input
271 It indexes element and attribute content of any thinkable &acro.xml; format
272 using full &acro.xpath; support, a feature which the standard &zebra;
273 &acro.grs1; &acro.sgml; and &acro.xml; filters lacked. The indexed documents are
274 parsed into a standard &acro.xml; &acro.dom; tree, which restricts record size
275 according to availability of memory.
279 uses &acro.xslt; display stylesheets, which let
280 the &zebra; DB administrator associate multiple, different views on
281 the same &acro.xml; document type. These views are chosen on-the-fly in
285 In addition, the Alvis filter configuration is not bound to the
286 arcane &acro.bib1; &acro.z3950; library catalogue indexing traditions and
287 folklore, and is therefore easier to understand.
290 Finally, the Alvis filter allows for static ranking at index
291 time, and to to sort hit lists according to predefined
292 static ranks. This imposes no overhead at all, both
293 search and indexing perform still
294 <emphasis>O(1)</emphasis> irrespectively of document
295 collection size. This feature resembles Googles pre-ranking using
296 their Pagerank algorithm.
299 Details on the experimental Alvis &acro.xslt; filter are found in
300 <xref linkend="record-model-alvisxslt"/>.
303 The Debian package <literal>libidzebra-2.0-mod-alvis</literal>
304 contains the Alvis filter module.
308 <section id="componentmodulesgrs">
309 <title>&acro.grs1; Record Model and Filter Modules</title>
312 The functionality of this record model has been improved and
313 replaced by the &acro.dom; &acro.xml; record model. See
314 <xref linkend="componentmodulesdom"/>.
318 The &acro.grs1; filter modules described in
319 <xref linkend="grs"/>
320 are all based on the &acro.z3950; specifications, and it is absolutely
321 mandatory to have the reference pages on &acro.bib1; attribute sets on
322 you hand when configuring &acro.grs1; filters. The GRS filters come in
323 different flavors, and a short introduction is needed here.
324 &acro.grs1; filters of various kind have also been called ABS filters due
325 to the <filename>*.abs</filename> configuration file suffix.
328 The <emphasis>grs.marc</emphasis> and
329 <emphasis>grs.marcxml</emphasis> filters are suited to parse and
330 index binary and &acro.xml; versions of traditional library &acro.marc; records
331 based on the ISO2709 standard. The Debian package for both
333 <literal>libidzebra-2.0-mod-grs-marc</literal>.
336 &acro.grs1; TCL scriptable filters for extensive user configuration come
337 in two flavors: a regular expression filter
338 <emphasis>grs.regx</emphasis> using TCL regular expressions, and
339 a general scriptable TCL filter called
340 <emphasis>grs.tcl</emphasis>
341 are both included in the
342 <literal>libidzebra-2.0-mod-grs-regx</literal> Debian package.
345 A general purpose &acro.sgml; filter is called
346 <emphasis>grs.sgml</emphasis>. This filter is not yet packaged,
347 but planned to be in the
348 <literal>libidzebra-2.0-mod-grs-sgml</literal> Debian package.
352 <literal>libidzebra-2.0-mod-grs-xml</literal> includes the
353 <emphasis>grs.xml</emphasis> filter which uses <ulink
354 url="&url.expat;">Expat</ulink> to
355 parse records in &acro.xml; and turn them into ID&zebra;'s internal &acro.grs1; node
356 trees. Have also a look at the Alvis &acro.xml;/&acro.xslt; filter described in
361 <section id="componentmodulestext">
362 <title>TEXT Record Model and Filter Module</title>
364 Plain ASCII text filter. TODO: add information here.
369 <section id="componentmodulessafari">
370 <title>SAFARI Record Model and Filter Module</title>
372 SAFARI filter module TODO: add information here.
382 <section id="architecture-workflow">
383 <title>Indexing and Retrieval Workflow</title>
386 Records pass through three different states during processing in the
396 When records are accessed by the system, they are represented
397 in their local, or native format. This might be &acro.sgml; or HTML files,
398 News or Mail archives, &acro.marc; records. If the system doesn't already
399 know how to read the type of data you need to store, you can set up an
400 input filter by preparing conversion rules based on regular
401 expressions and possibly augmented by a flexible scripting language
403 The input filter produces as output an internal representation,
411 When records are processed by the system, they are represented
412 in a tree-structure, constructed by tagged data elements hanging off a
413 root node. The tagged elements may contain data or yet more tagged
414 elements in a recursive structure. The system performs various
415 actions on this tree structure (indexing, element selection, schema
423 Before transmitting records to the client, they are first
424 converted from the internal structure to a form suitable for exchange
425 over the network - according to the &acro.z3950; standard.
434 <section id="special-retrieval">
435 <title>Retrieval of &zebra; internal record data</title>
437 Starting with <literal>&zebra;</literal> version 2.0.5 or newer, it is
438 possible to use a special element set which has the prefix
439 <literal>zebra::</literal>.
442 Using this element will, regardless of record type, return
443 &zebra;'s internal index structure/data for a record.
444 In particular, the regular record filters are not invoked when
446 This can in some cases make the retrival faster than regular
447 retrieval operations (for &acro.marc;, &acro.xml; etc).
449 <table id="special-retrieval-types">
450 <title>Special Retrieval Elements</title>
454 <entry>Element Set</entry>
455 <entry>Description</entry>
456 <entry>Syntax</entry>
461 <entry><literal>zebra::meta::sysno</literal></entry>
462 <entry>Get &zebra; record system ID</entry>
463 <entry>&acro.xml; and &acro.sutrs;</entry>
466 <entry><literal>zebra::data</literal></entry>
467 <entry>Get raw record</entry>
471 <entry><literal>zebra::meta</literal></entry>
472 <entry>Get &zebra; record internal metadata</entry>
473 <entry>&acro.xml; and &acro.sutrs;</entry>
476 <entry><literal>zebra::index</literal></entry>
477 <entry>Get all indexed keys for record</entry>
478 <entry>&acro.xml; and &acro.sutrs;</entry>
482 <literal>zebra::index::</literal><replaceable>f</replaceable>
485 Get indexed keys for field <replaceable>f</replaceable> for record
487 <entry>&acro.xml; and &acro.sutrs;</entry>
491 <literal>zebra::index::</literal><replaceable>f</replaceable>:<replaceable>t</replaceable>
494 Get indexed keys for field <replaceable>f</replaceable>
495 and type <replaceable>t</replaceable> for record
497 <entry>&acro.xml; and &acro.sutrs;</entry>
503 For example, to fetch the raw binary record data stored in the
504 zebra internal storage, or on the filesystem, the following
505 commands can be issued:
507 Z> f @attr 1=title my
509 Z> elements zebra::data
519 <literal>zebra::data</literal> element set name is
520 defined for any record syntax, but will always fetch
521 the raw record data in exactly the original form. No record syntax
522 specific transformations will be applied to the raw record data.
525 Also, &zebra; internal metadata about the record can be accessed:
527 Z> f @attr 1=title my
529 Z> elements zebra::meta::sysno
532 displays in <literal>&acro.xml;</literal> record syntax only internal
533 record system number, whereas
535 Z> f @attr 1=title my
537 Z> elements zebra::meta
540 displays all available metadata on the record. These include sytem
541 number, database name, indexed filename, filter used for indexing,
542 score and static ranking information and finally bytesize of record.
545 Sometimes, it is very hard to figure out what exactly has been
546 indexed how and in which indexes. Using the indexing stylesheet of
547 the Alvis filter, one can at least see which portion of the record
548 went into which index, but a similar aid does not exist for all
549 other indexing filters.
553 <literal>zebra::index</literal> element set names are provided to
554 access information on per record indexed fields. For example, the
557 Z> f @attr 1=title my
559 Z> elements zebra::index
562 will display all indexed tokens from all indexed fields of the
563 first record, and it will display in <literal>&acro.sutrs;</literal>
564 record syntax, whereas
566 Z> f @attr 1=title my
568 Z> elements zebra::index::title
570 Z> elements zebra::index::title:p
573 displays in <literal>&acro.xml;</literal> record syntax only the content
574 of the zebra string index <literal>title</literal>, or
575 even only the type <literal>p</literal> phrase indexed part of it.
579 Trying to access numeric <literal>&acro.bib1;</literal> use
580 attributes or trying to access non-existent zebra intern string
581 access points will result in a Diagnostic 25: Specified element set
582 'name not valid for specified database.
589 <!-- Keep this comment at the end of the file
594 sgml-minimize-attributes:nil
595 sgml-always-quote-attributes:t
598 sgml-parent-document: "zebra.xml"
599 sgml-local-catalogs: nil
600 sgml-namecase-general:t