1 <chapter id="architecture">
2 <!-- $Id: architecture.xml,v 1.8 2006-04-25 12:26:26 marc Exp $ -->
3 <title>Overview of Zebra Architecture</title>
6 <sect1 id="architecture-representation">
7 <title>Local Representation</title>
10 As mentioned earlier, Zebra places few restrictions on the type of
11 data that you can index and manage. Generally, whatever the form of
12 the data, it is parsed by an input filter specific to that format, and
13 turned into an internal structure that Zebra knows how to handle. This
14 process takes place whenever the record is accessed - for indexing and
19 The RecordType parameter in the <literal>zebra.cfg</literal> file, or
20 the <literal>-t</literal> option to the indexer tells Zebra how to
21 process input records.
22 Two basic types of processing are available - raw text and structured
23 data. Raw text is just that, and it is selected by providing the
24 argument <emphasis>text</emphasis> to Zebra. Structured records are
25 all handled internally using the basic mechanisms described in the
27 Zebra can read structured records in many different formats.
29 How this is done is governed by additional parameters after the
30 "grs" keyword, separated by "." characters.
35 <sect1 id="architecture-maincomponents">
36 <title>Main Components</title>
38 The Zebra system is designed to support a wide range of data management
39 applications. The system can be configured to handle virtually any
40 kind of structured data. Each record in the system is associated with
41 a <emphasis>record schema</emphasis> which lends context to the data
42 elements of the record.
43 Any number of record schemas can coexist in the system.
44 Although it may be wise to use only a single schema within
45 one database, the system poses no such restrictions.
48 The Zebra indexer and information retrieval server consists of the
49 following main applications: the <command>zebraidx</command>
50 indexing maintenance utility, and the <command>zebrasrv</command>
51 information query and retrieval server. Both are using some of the
52 same main components, which are presented here.
55 The virtual Debian package <literal>idzebra1.4</literal>
56 installs all the necessary packages to start
57 working with Zebra - including utility programs, development libraries,
58 documentation and modules.
61 <sect2 id="componentcore">
62 <title>Core Zebra Libraries Containing Common Functionality</title>
64 The core Zebra module is the meat of the <command>zebraidx</command>
65 indexing maintenance utility, and the <command>zebrasrv</command>
66 information query and retrieval server binaries. Shortly, the core
67 libraries are responsible for
70 <term>Dynamic Loading</term>
72 <para>of external filter modules, in case the application is
73 not compiled statically. These filter modules define indexing,
74 search and retrieval capabilities of the various input formats.
79 <term>Index Maintenance</term>
81 <para> Zebra maintains Term Dictionaries and ISAM index
82 entries in inverted index structures kept on disk. These are
83 optimized for fast inset, update and delete, as well as good
89 <term>Search Evaluation</term>
91 <para>by execution of search requests expressed in PQF/RPN
92 data structures, which are handed over from
93 the YAZ server frontend API. Search evaluation includes
94 construction of hit lists according to boolean combinations
95 of simpler searches. Fast performance is achieved by careful
96 use of index structures, and by evaluation specific index hit
97 lists in correct order.
102 <term>Ranking and Sorting</term>
105 components call resorting/re-ranking algorithms on the hit
106 sets. These might also be pre-sorted not only using the
107 assigned document ID's, but also using assigned static rank
113 <term>Record Presentation</term>
115 <para>returns - possibly ranked - result sets, hit
116 numbers, and the like internal data to the YAZ server backend API
117 for shipping to the client. Each individual filter module
118 implements it's own specific presentation formats.
125 The Debian package <literal>libidzebra1.4</literal>
126 contains all run-time libraries for Zebra, the
127 documentation in PDF and HTML is found in
128 <literal>idzebra1.4-doc</literal>, and
129 <literal>idzebra1.4-common</literal>
130 includes common essential Zebra configuration files.
135 <sect2 id="componentindexer">
136 <title>Zebra Indexer</title>
138 The <command>zebraidx</command>
139 indexing maintenance utility
140 loads external filter modules used for indexing data records of
141 different type, and creates, updates and drops databases and
142 indexes according to the rules defined in the filter modules.
145 The Debian package <literal>idzebra1.4-utils</literal> contains
146 the <command>zebraidx</command> utility.
150 <sect2 id="componentsearcher">
151 <title>Zebra Searcher/Retriever</title>
153 This is the executable which runs the Z39.50/SRU/SRW server and
154 glues together the core libraries and the filter modules to one
155 great Information Retrieval server application.
158 The Debian package <literal>idzebra1.4-utils</literal> contains
159 the <command>zebrasrv</command> utility.
163 <sect2 id="componentyazserver">
164 <title>YAZ Server Frontend</title>
166 The YAZ server frontend is
167 a full fledged stateful Z39.50 server taking client
168 connections, and forwarding search and scan requests to the
172 In addition to Z39.50 requests, the YAZ server frontend acts
173 as HTTP server, honoring
174 <ulink url="http://www.loc.gov/standards/sru/srw/">SRW</ulink>
176 <ulink url="http://www.loc.gov/standards/sru/">SRU</ulink>
177 REST requests. Moreover, it can
179 <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>
181 <ulink url="http://indexdata.com/yaz/doc/tools.tkl#PQF">PQF</ulink>
183 correctly configured.
186 <ulink url="http://www.indexdata.com/yaz">YAZ</ulink>
188 toolkit that allows you to develop software using the
189 ANSI Z39.50/ISO23950 standard for information retrieval.
190 It is packaged in the Debian packages
191 <literal>yaz</literal> and <literal>libyaz</literal>.
195 <sect2 id="componentmodules">
196 <title>Record Models and Filter Modules</title>
198 The hard work of knowing <emphasis>what</emphasis> to index,
199 <emphasis>how</emphasis> to do it, and <emphasis>which</emphasis>
200 part of the records to send in a search/retrieve response is
202 various filter modules. It is their responsibility to define the
203 exact indexing and record display filtering rules.
206 The virtual Debian package
207 <literal>libidzebra1.4-modules</literal> installs all base filter
211 <sect3 id="componentmodulestext">
212 <title>TEXT Record Model and Filter Module</title>
214 Plain ASCII text filter. TODO: add information here.
216 <literal>text module missing as deb file<literal>
221 <sect3 id="componentmodulesgrs">
222 <title>GRS Record Model and Filter Modules</title>
224 The GRS filter modules described in
225 <xref linkend="record-model-grs"/>
226 are all based on the Z39.50 specifications, and it is absolutely
227 mandatory to have the reference pages on BIB-1 attribute sets on
228 you hand when configuring GRS filters. The GRS filters come in
229 different flavors, and a short introduction is needed here.
230 GRS filters of various kind have also been called ABS filters due
231 to the <filename>*.abs</filename> configuration file suffix.
234 The <emphasis>grs.danbib</emphasis> filter is developed for
236 DanBib is the Danish Union Catalogue hosted by DBC
237 (Danish Bibliographic Center). This filter is found in the
239 <literal>libidzebra1.4-mod-grs-danbib</literal>.
242 The <emphasis>grs.marc</emphasis> and
243 <emphasis>grs.marcxml</emphasis> filters are suited to parse and
244 index binary and XML versions of traditional library MARC records
245 based on the ISO2709 standard. The Debian package for both
247 <literal>libidzebra1.4-mod-grs-marc</literal>.
250 GRS TCL scriptable filters for extensive user configuration come
251 in two flavors: a regular expression filter
252 <emphasis>grs.regx</emphasis> using TCL regular expressions, and
253 a general scriptable TCL filter called
254 <emphasis>grs.tcl</emphasis>
255 are both included in the
256 <literal>libidzebra1.4-mod-grs-regx</literal> Debian package.
259 A general purpose SGML filter is called
260 <emphasis>grs.sgml</emphasis>. This filter is not yet packaged,
261 but planned to be in the
262 <literal>libidzebra1.4-mod-grs-sgml</literal> Debian package.
266 <literal>libidzebra1.4-mod-grs-xml</literal> includes the
267 <emphasis>grs.xml</emphasis> filter which uses <ulink
268 url="http://expat.sourceforge.net/">Expat</ulink> to
269 parse records in XML and turn them into IDZebra's internal GRS node
270 trees. Have also a look at the Alvis XML/XSLT filter described in
275 <sect3 id="componentmodulesalvis">
276 <title>ALVIS Record Model and Filter Module</title>
278 The Alvis filter for XML files is an XSLT based input
280 It indexes element and attribute content of any thinkable XML format
281 using full XPATH support, a feature which the standard Zebra
282 GRS SGML and XML filters lacked. The indexed documents are
283 parsed into a standard XML DOM tree, which restricts record size
284 according to availability of memory.
288 uses XSLT display stylesheets, which let
289 the Zebra DB administrator associate multiple, different views on
290 the same XML document type. These views are chosen on-the-fly in
294 In addition, the Alvis filter configuration is not bound to the
295 arcane BIB-1 Z39.50 library catalogue indexing traditions and
296 folklore, and is therefore easier to understand.
299 Finally, the Alvis filter allows for static ranking at index
300 time, and to to sort hit lists according to predefined
301 static ranks. This imposes no overhead at all, both
302 search and indexing perform still
303 <emphasis>O(1)</emphasis> irrespectively of document
304 collection size. This feature resembles Googles pre-ranking using
305 their Pagerank algorithm.
308 Details on the experimental Alvis XSLT filter are found in
309 <xref linkend="record-model-alvisxslt"/>.
312 The Debian package <literal>libidzebra1.4-mod-alvis</literal>
313 contains the Alvis filter module.
317 <sect3 id="componentmodulessafari">
318 <title>SAFARI Record Model and Filter Module</title>
320 SAFARI filter module TODO: add information here.
322 <literal>safari module missing as deb file<literal>
332 <sect1 id="architecture-workflow">
333 <title>Indexing and Retrieval Workflow</title>
336 Records pass through three different states during processing in the
346 When records are accessed by the system, they are represented
347 in their local, or native format. This might be SGML or HTML files,
348 News or Mail archives, MARC records. If the system doesn't already
349 know how to read the type of data you need to store, you can set up an
350 input filter by preparing conversion rules based on regular
351 expressions and possibly augmented by a flexible scripting language
353 The input filter produces as output an internal representation,
361 When records are processed by the system, they are represented
362 in a tree-structure, constructed by tagged data elements hanging off a
363 root node. The tagged elements may contain data or yet more tagged
364 elements in a recursive structure. The system performs various
365 actions on this tree structure (indexing, element selection, schema
373 Before transmitting records to the client, they are first
374 converted from the internal structure to a form suitable for exchange
375 over the network - according to the Z39.50 standard.
386 <sect1 id="architecture-querylanguage">
387 <title>Query Languages</title>
391 http://www.loc.gov/z3950/agency/document.html
393 PQF and BIB-1 stuff to be explained
394 <ulink url="http://www.loc.gov/z3950/agency/defns/bib1.html">
395 http://www.loc.gov/z3950/agency/defns/bib1.html</ulink>
397 <ulink url="http://www.loc.gov/z3950/agency/bib1.html">
398 http://www.loc.gov/z3950/agency/bib1.html</ulink>
400 http://www.loc.gov/z3950/agency/markup/13.html
406 These attribute types are recognized regardless of attribute set. Some are recognized for search, others for scan.
419 The embedded sort is a way to specify sort within a query - thus removing the need to send a Sort Request separately. It is both faster and does not require clients that deal with the Sort Facility.
421 The value after attribute type 7 is 1=ascending, 2=descending.. The attributes+term (APT) node is separate from the rest and must be @or'ed. The term associated with APT is the level .. 0=primary sort, 1=secondary sort etc.. Example:
423 Search for water, sort by title (ascending):
425 @or @attr 1=1016 water @attr 7=1 @attr 1=4 0
427 Search for water, sort by title ascending, then date descending:
429 @or @or @attr 1=1016 water @attr 7=1 @attr 1=4 0 @attr 7=2 @attr 1=30 1
433 The Term Set feature is a facility that allows a search to store hitting terms in a "pseudo" resultset; thus a search (as usual) + a scan-like facility. Requires a client that can do named result sets since the search generates two result sets. The value for attribute 8 is the name of a result set (string). The terms in term set are returned as SUTRS records.
435 Seach for u in title, right truncated.. Store result in result set named uset.
437 @attr 5=1 @attr 1=4 @attr 8=uset u
439 The model as one serious flaw.. We don't know the size of term set.
443 Rank weight is a way to pass a value to a ranking algorithm - so that one APT has one value - while another as a different one.
445 Search for utah in title with weight 30 as well as any with weight 20.
447 @attr 2=102 @or @attr 9=30 @attr 1=4 utah @attr 9=20 utah
451 Newer Zebra versions normally estemiates hit count for every APT (leaf) in the query tree. These hit counts are returned as part of the searchResult-1 facility.
453 By setting a limit for the APT we can make Zebra turn into approximate hit count when a certain hit count limit is reached. A value of zero means exact hit count.
455 We are intersted in exact hit count for a, but for b we allow estimates for 1000 and higher..
457 @and a @attr 9=1000 b
459 This facility clashes with rank weight! Fortunately this is a Zebra 1.4 thing so we can change this without upsetting anybody!
463 Zebra supports the searchResult-1 facility.
465 If attribute 10 is given, that specifies a subqueryId value returned as part of the search result. It is a way for a client to name an APT part of a query.
470 8 Result set narrow 1.3
475 If attribute 8 is given for scan, the value is the name of a result set. Each hit count in scan is @and'ed with the result set given.
479 The approx (as for search) is a way to enable approx hit counts for scan hit counts. However, it does NOT appear to work at the moment.
482 AdamDickmeiss - 19 Dec 2005
490 <!-- Keep this comment at the end of the file
495 sgml-minimize-attributes:nil
496 sgml-always-quote-attributes:t
499 sgml-parent-document: "zebra.xml"
500 sgml-local-catalogs: nil
501 sgml-namecase-general:t