1 <chapter id="architecture">
2 <!-- $Id: architecture.xml,v 1.12 2006-09-03 21:37:26 adam Exp $ -->
3 <title>Overview of Zebra Architecture</title>
6 <section id="architecture-representation">
7 <title>Local Representation</title>
10 As mentioned earlier, Zebra places few restrictions on the type of
11 data that you can index and manage. Generally, whatever the form of
12 the data, it is parsed by an input filter specific to that format, and
13 turned into an internal structure that Zebra knows how to handle. This
14 process takes place whenever the record is accessed - for indexing and
19 The RecordType parameter in the <literal>zebra.cfg</literal> file, or
20 the <literal>-t</literal> option to the indexer tells Zebra how to
21 process input records.
22 Two basic types of processing are available - raw text and structured
23 data. Raw text is just that, and it is selected by providing the
24 argument <emphasis>text</emphasis> to Zebra. Structured records are
25 all handled internally using the basic mechanisms described in the
27 Zebra can read structured records in many different formats.
29 How this is done is governed by additional parameters after the
30 "grs" keyword, separated by "." characters.
35 <section id="architecture-maincomponents">
36 <title>Main Components</title>
38 The Zebra system is designed to support a wide range of data management
39 applications. The system can be configured to handle virtually any
40 kind of structured data. Each record in the system is associated with
41 a <emphasis>record schema</emphasis> which lends context to the data
42 elements of the record.
43 Any number of record schemas can coexist in the system.
44 Although it may be wise to use only a single schema within
45 one database, the system poses no such restrictions.
48 The Zebra indexer and information retrieval server consists of the
49 following main applications: the <command>zebraidx</command>
50 indexing maintenance utility, and the <command>zebrasrv</command>
51 information query and retrieval server. Both are using some of the
52 same main components, which are presented here.
55 The virtual Debian package <literal>idzebra-2.0</literal>
56 installs all the necessary packages to start
57 working with Zebra - including utility programs, development libraries,
58 documentation and modules.
61 <section id="componentcore">
62 <title>Core Zebra Libraries Containing Common Functionality</title>
64 The core Zebra module is the meat of the <command>zebraidx</command>
65 indexing maintenance utility, and the <command>zebrasrv</command>
66 information query and retrieval server binaries. Shortly, the core
67 libraries are responsible for
70 <term>Dynamic Loading</term>
72 <para>of external filter modules, in case the application is
73 not compiled statically. These filter modules define indexing,
74 search and retrieval capabilities of the various input formats.
79 <term>Index Maintenance</term>
81 <para> Zebra maintains Term Dictionaries and ISAM index
82 entries in inverted index structures kept on disk. These are
83 optimized for fast inset, update and delete, as well as good
89 <term>Search Evaluation</term>
91 <para>by execution of search requests expressed in PQF/RPN
92 data structures, which are handed over from
93 the YAZ server frontend API. Search evaluation includes
94 construction of hit lists according to boolean combinations
95 of simpler searches. Fast performance is achieved by careful
96 use of index structures, and by evaluation specific index hit
97 lists in correct order.
102 <term>Ranking and Sorting</term>
105 components call resorting/re-ranking algorithms on the hit
106 sets. These might also be pre-sorted not only using the
107 assigned document ID's, but also using assigned static rank
113 <term>Record Presentation</term>
115 <para>returns - possibly ranked - result sets, hit
116 numbers, and the like internal data to the YAZ server backend API
117 for shipping to the client. Each individual filter module
118 implements it's own specific presentation formats.
125 The Debian package <literal>libidzebra-2.0</literal>
126 contains all run-time libraries for Zebra, the
127 documentation in PDF and HTML is found in
128 <literal>idzebra-2.0-doc</literal>, and
129 <literal>idzebra-2.0-common</literal>
130 includes common essential Zebra configuration files.
135 <section id="componentindexer">
136 <title>Zebra Indexer</title>
138 The <command>zebraidx</command>
139 indexing maintenance utility
140 loads external filter modules used for indexing data records of
141 different type, and creates, updates and drops databases and
142 indexes according to the rules defined in the filter modules.
145 The Debian package <literal>idzebra-2.0-utils</literal> contains
146 the <command>zebraidx</command> utility.
150 <section id="componentsearcher">
151 <title>Zebra Searcher/Retriever</title>
153 This is the executable which runs the Z39.50/SRU/SRW server and
154 glues together the core libraries and the filter modules to one
155 great Information Retrieval server application.
158 The Debian package <literal>idzebra-2.0-utils</literal> contains
159 the <command>zebrasrv</command> utility.
163 <section id="componentyazserver">
164 <title>YAZ Server Frontend</title>
166 The YAZ server frontend is
167 a full fledged stateful Z39.50 server taking client
168 connections, and forwarding search and scan requests to the
172 In addition to Z39.50 requests, the YAZ server frontend acts
173 as HTTP server, honoring
174 <ulink url="http://www.loc.gov/standards/sru/srw/">SRW</ulink>
176 <ulink url="&url.sru;">SRU</ulink>
177 REST requests. Moreover, it can
179 <ulink url="&url.cql;">CQL</ulink>
181 <ulink url="http://indexdata.com/yaz/doc/tools.tkl#PQF">PQF</ulink>
183 correctly configured.
186 <ulink url="http://www.indexdata.com/yaz">YAZ</ulink>
188 toolkit that allows you to develop software using the
189 ANSI Z39.50/ISO23950 standard for information retrieval.
190 It is packaged in the Debian packages
191 <literal>yaz</literal> and <literal>libyaz</literal>.
195 <section id="componentmodules">
196 <title>Record Models and Filter Modules</title>
198 The hard work of knowing <emphasis>what</emphasis> to index,
199 <emphasis>how</emphasis> to do it, and <emphasis>which</emphasis>
200 part of the records to send in a search/retrieve response is
202 various filter modules. It is their responsibility to define the
203 exact indexing and record display filtering rules.
206 The virtual Debian package
207 <literal>libidzebra-2.0-modules</literal> installs all base filter
212 <section id="componentmodulestext">
213 <title>TEXT Record Model and Filter Module</title>
215 Plain ASCII text filter. TODO: add information here.
219 <section id="componentmodulesgrs">
220 <title>GRS Record Model and Filter Modules</title>
222 The GRS filter modules described in
223 <xref linkend="grs"/>
224 are all based on the Z39.50 specifications, and it is absolutely
225 mandatory to have the reference pages on BIB-1 attribute sets on
226 you hand when configuring GRS filters. The GRS filters come in
227 different flavors, and a short introduction is needed here.
228 GRS filters of various kind have also been called ABS filters due
229 to the <filename>*.abs</filename> configuration file suffix.
232 The <emphasis>grs.marc</emphasis> and
233 <emphasis>grs.marcxml</emphasis> filters are suited to parse and
234 index binary and XML versions of traditional library MARC records
235 based on the ISO2709 standard. The Debian package for both
237 <literal>libidzebra-2.0-mod-grs-marc</literal>.
240 GRS TCL scriptable filters for extensive user configuration come
241 in two flavors: a regular expression filter
242 <emphasis>grs.regx</emphasis> using TCL regular expressions, and
243 a general scriptable TCL filter called
244 <emphasis>grs.tcl</emphasis>
245 are both included in the
246 <literal>libidzebra-2.0-mod-grs-regx</literal> Debian package.
249 A general purpose SGML filter is called
250 <emphasis>grs.sgml</emphasis>. This filter is not yet packaged,
251 but planned to be in the
252 <literal>libidzebra-2.0-mod-grs-sgml</literal> Debian package.
256 <literal>libidzebra-2.0-mod-grs-xml</literal> includes the
257 <emphasis>grs.xml</emphasis> filter which uses <ulink
258 url="http://expat.sourceforge.net/">Expat</ulink> to
259 parse records in XML and turn them into IDZebra's internal GRS node
260 trees. Have also a look at the Alvis XML/XSLT filter described in
265 <section id="componentmodulesalvis">
266 <title>ALVIS Record Model and Filter Module</title>
268 The Alvis filter for XML files is an XSLT based input
270 It indexes element and attribute content of any thinkable XML format
271 using full XPATH support, a feature which the standard Zebra
272 GRS SGML and XML filters lacked. The indexed documents are
273 parsed into a standard XML DOM tree, which restricts record size
274 according to availability of memory.
278 uses XSLT display stylesheets, which let
279 the Zebra DB administrator associate multiple, different views on
280 the same XML document type. These views are chosen on-the-fly in
284 In addition, the Alvis filter configuration is not bound to the
285 arcane BIB-1 Z39.50 library catalogue indexing traditions and
286 folklore, and is therefore easier to understand.
289 Finally, the Alvis filter allows for static ranking at index
290 time, and to to sort hit lists according to predefined
291 static ranks. This imposes no overhead at all, both
292 search and indexing perform still
293 <emphasis>O(1)</emphasis> irrespectively of document
294 collection size. This feature resembles Googles pre-ranking using
295 their Pagerank algorithm.
298 Details on the experimental Alvis XSLT filter are found in
299 <xref linkend="record-model-alvisxslt"/>.
302 The Debian package <literal>libidzebra-2.0-mod-alvis</literal>
303 contains the Alvis filter module.
308 <section id="componentmodulessafari">
309 <title>SAFARI Record Model and Filter Module</title>
311 SAFARI filter module TODO: add information here.
321 <section id="architecture-workflow">
322 <title>Indexing and Retrieval Workflow</title>
325 Records pass through three different states during processing in the
335 When records are accessed by the system, they are represented
336 in their local, or native format. This might be SGML or HTML files,
337 News or Mail archives, MARC records. If the system doesn't already
338 know how to read the type of data you need to store, you can set up an
339 input filter by preparing conversion rules based on regular
340 expressions and possibly augmented by a flexible scripting language
342 The input filter produces as output an internal representation,
350 When records are processed by the system, they are represented
351 in a tree-structure, constructed by tagged data elements hanging off a
352 root node. The tagged elements may contain data or yet more tagged
353 elements in a recursive structure. The system performs various
354 actions on this tree structure (indexing, element selection, schema
362 Before transmitting records to the client, they are first
363 converted from the internal structure to a form suitable for exchange
364 over the network - according to the Z39.50 standard.
375 <!-- Keep this comment at the end of the file
380 sgml-minimize-attributes:nil
381 sgml-always-quote-attributes:t
384 sgml-parent-document: "zebra.xml"
385 sgml-local-catalogs: nil
386 sgml-namecase-general:t