1 <chapter id="architecture">
2 <!-- $Id: architecture.xml,v 1.2 2006-01-19 09:27:00 marc Exp $ -->
3 <title>Overview of Zebra Architecture</title>
6 <sect1 id="local-representation">
7 <title>Local Representation</title>
10 As mentioned earlier, Zebra places few restrictions on the type of
11 data that you can index and manage. Generally, whatever the form of
12 the data, it is parsed by an input filter specific to that format, and
13 turned into an internal structure that Zebra knows how to handle. This
14 process takes place whenever the record is accessed - for indexing and
19 The RecordType parameter in the <literal>zebra.cfg</literal> file, or
20 the <literal>-t</literal> option to the indexer tells Zebra how to
21 process input records.
22 Two basic types of processing are available - raw text and structured
23 data. Raw text is just that, and it is selected by providing the
24 argument <emphasis>text</emphasis> to Zebra. Structured records are
25 all handled internally using the basic mechanisms described in the
27 Zebra can read structured records in many different formats.
29 How this is done is governed by additional parameters after the
30 "grs" keyword, separated by "." characters.
36 <title>Indexing and Retrieval Workflow</title>
39 Records pass through three different states during processing in the
49 When records are accessed by the system, they are represented
50 in their local, or native format. This might be SGML or HTML files,
51 News or Mail archives, MARC records. If the system doesn't already
52 know how to read the type of data you need to store, you can set up an
53 input filter by preparing conversion rules based on regular
54 expressions and possibly augmented by a flexible scripting language
56 The input filter produces as output an internal representation,
64 When records are processed by the system, they are represented
65 in a tree-structure, constructed by tagged data elements hanging off a
66 root node. The tagged elements may contain data or yet more tagged
67 elements in a recursive structure. The system performs various
68 actions on this tree structure (indexing, element selection, schema
76 Before transmitting records to the client, they are first
77 converted from the internal structure to a form suitable for exchange
78 over the network - according to the Z39.50 standard.
88 <sect1 id="maincomponents">
89 <title>Main Components</title>
91 The Zebra system is designed to support a wide range of data management
92 applications. The system can be configured to handle virtually any
93 kind of structured data. Each record in the system is associated with
94 a <emphasis>record schema</emphasis> which lends context to the data
95 elements of the record.
96 Any number of record schemas can coexist in the system.
97 Although it may be wise to use only a single schema within
98 one database, the system poses no such restrictions.
101 The Zebra indexer and information retrieval server consists of the
102 following main applications: the <literal>zebraidx</literal>
103 indexing maintenance utility, and the <literal>zebrasrv</literal>
104 information query and retireval server. Both are using some of the
105 same main components, which are presented here.
108 This virtual package installs all the necessary packages to start
109 working with Zebra - including utility programs, development libraries,
110 documentation and modules.
111 <literal>idzebra1.4</literal>
114 <sect2 id="componentcore">
115 <title>Core Zebra Module Containing Common Functionality</title>
117 - loads external filter modules used for presenting
118 the recods in a search response.
119 - executes search requests in PQF/RPN, which are handed over from
120 the YAZ server frontend API
121 - calls resorting/reranking algorithms on the hit sets
122 - returns - possibly ranked - result sets, hit
123 numbers, and the like internal data to the YAZ server backend API.
126 This package contains all run-time libraries for Zebra.
127 <literal>libidzebra1.4</literal>
128 This package includes documentation for Zebra in PDF and HTML.
129 <literal>idzebra1.4-doc</literal>
130 This package includes common essential Zebra configuration files
131 <literal>idzebra1.4-common</literal>
136 <sect2 id="componentindexer">
137 <title>Zebra Indexer</title>
139 the core Zebra indexer which
140 - loads external filter modules used for indexing data records of
142 - creates, updates and drops databases and indexes
145 This package contains Zebra utilities such as the zebraidx indexer
146 utility and the zebrasrv server.
147 <literal>idzebra1.4-utils</literal>
151 <sect2 id="componentsearcher">
152 <title>Zebra Searcher/Retriever</title>
154 the core Zebra searcher/retriever which
157 This package contains Zebra utilities such as the zebraidx indexer
158 utility and the zebrasrv server, and their associated man pages.
159 <literal>idzebra1.4-utils</literal>
163 <sect2 id="componentyazserver">
164 <title>YAZ Server Frontend</title>
166 The YAZ server frontend is
167 a full fledged stateful Z39.50 server taking client
168 connections, and forwarding search and scan requests to the
172 In addition to Z39.50 requests, the YAZ server frontend acts
173 as HTTP server, honouring
174 <ulink url="http://www.loc.gov/standards/sru/srw/">SRW</ulink> SOAP requests, and <ulink url="http://www.loc.gov/standards/sru/">SRU</ulink> REST requests. Moreover, it can
175 translate inco ming <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink> queries to PQF/RPN queries, if
176 correctly configured.
179 YAZ is a toolkit that allows you to develop software using the
180 ANSI Z39.50/ISO23950 standard for information retrieval.
181 <ulink url="http://www.loc.gov/standards/sru/srw/">SRW</ulink>/ <ulink url="http://www.loc.gov/standards/sru/">SRU</ulink>
182 <literal>libyazthread.so</literal>
183 <literal>libyaz.so</literal>
184 <literal>libyaz</literal>
188 <sect2 id="componentmodules">
189 <title>Record Models and Filter Modules</title>
191 all filter modules which do indexing and record display filtering:
192 This virtual package contains all base IDZebra filter modules. EMPTY ???
193 <literal>libidzebra1.4-modules</literal>
196 <sect3 id="componentmodulestext">
197 <title>TEXT Record Model and Filter Module</title>
199 Plain ASCII text filter
201 <literal>text module missing as deb file<literal>
206 <sect3 id="componentmodulesgrs">
207 <title>GRS Record Model and Filter Modules</title>
209 <xref linkend="grs-record-model"/>
211 - grs.danbib GRS filters of various kind (*.abs files)
212 IDZebra filter grs.danbib (DBC DanBib records)
213 This package includes grs.danbib filter which parses DanBib records.
214 DanBib is the Danish Union Catalogue hosted by DBC
215 (Danish Bibliographic Centre).
216 <literal>libidzebra1.4-mod-grs-danbib</literal>
221 This package includes the grs.marc and grs.marcxml filters that allows
222 IDZebra to read MARC records based on ISO2709.
224 <literal>libidzebra1.4-mod-grs-marc</literal>
227 - grs.tcl GRS TCL scriptable filter
228 This package includes the grs.regx and grs.tcl filters.
229 <literal>libidzebra1.4-mod-grs-regx</literal>
233 <literal>libidzebra1.4-mod-grs-sgml not packaged yet ??</literal>
236 This package includes the grs.xml filter which uses <ulink url="http://expat.sourceforge.net/">Expat</ulink> to
237 parse records in XML and turn them into IDZebra's internal grs node.
238 <literal>libidzebra1.4-mod-grs-xml</literal>
242 <sect3 id="componentmodulesalvis">
243 <title>ALVIS Record Model and Filter Module</title>
245 - alvis Experimental Alvis XSLT filter
246 <literal>mod-alvis.so</literal>
247 <literal>libidzebra1.4-mod-alvis</literal>
251 <sect3 id="componentmodulessafari">
252 <title>SAFARI Record Model and Filter Module</title>
256 <literal>safari module missing as deb file<literal>
264 <sect2 id="componentconfig">
265 <title>Configuration Files</title>
267 - yazserver XML based config file
268 - core Zebra ascii based config files
269 - filter module config files in many flavours
270 - <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink> to PQF ascii based config file
279 <sect1 id="cqltopqf">
280 <title>Server Side <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink> To PQF Conversion</title>
282 The cql2pqf.txt yaz-client config file, which is also used in the
283 yaz-server <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>-to-PQF process, is used to to drive
284 org.z3950.zing.cql.<ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>Node's toPQF() back-end and the YAZ <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>-to-PQF
285 converter. This specifies the interpretation of various <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>
286 indexes, relations, etc. in terms of Type-1 query attributes.
288 This configuration file generates queries using BIB-1 attributes.
289 See http://www.loc.gov/z3950/agency/zing/cql/dc-indexes.html
290 for the Maintenance Agency's work-in-progress mapping of Dublin Core
291 indexes to Attribute Architecture (util, XD and BIB-2)
294 a) <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink> set prefixes are specified using the correct <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>/ <ulink url="http://www.loc.gov/standards/sru/srw/">SRW</ulink>/U
295 prefixes for the required index sets, or user-invented prefixes for
296 special index sets. An index set in <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink> is roughly speaking equivalent to a
297 namespace specifier in XML.
299 b) The default index set to be used if none explicitely mentioned
301 c) Index mapping definitions of the form
303 index.cql.all = 1=text
305 which means that the index "all" from the set "cql" is mapped on the
306 bib-1 RPN query "@attr 1=text" (where "text" is some existing index
307 in zebra, see indexing stylesheet)
309 d) Relation mapping from <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink> relations to bib-1 RPN "@attr 2= " stuff
311 e) Relation modifier mapping from <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink> relations to bib-1 RPN "@attr
314 f) Position attributes
316 g) structure attributes
318 h) truncation attributes
321 http://www.indexdata.com/yaz/doc/tools.tkl#tools.cql.map for config
330 <title>Static and Dynamic Ranking</title>
332 Zebra uses internally inverted indexes to look up term occurencies
333 in documents. Multiple queries from different indexes can be
334 combined by the binary boolean operations AND, OR and/or NOT (which
335 is in fact a binary AND NOT operation). To ensure fast query execution
336 speed, all indexes have to be sorted in the same order.
338 The indexes are normally sorted according to document ID in
339 ascending order, and any query which does not invoke a special
340 re-ranking function will therefore retrieve the result set in document ID
347 directive in the main core Zebra config file, the internal document
348 keys used for ordering are augmented by a preceeding integer, which
349 contains the static rank of a given document, and the index lists
351 - first by ascending static rank
352 - then by ascending document ID.
354 This implies that the default rank "0" is the best rank at the
355 beginning of the list, and "max int" is the worst static rank.
357 The "alvis" and the experimental "xslt" filters are providing a
358 directive to fetch static rank information out of the indexed XML
359 records, thus making _all_ hit sets orderd after ascending static
360 rank, and for those doc's which have the same static rank, ordered
361 after ascending doc ID.
362 If one wants to do a little fiddeling with the static rank order,
363 one has to invoke additional re-ranking/re-ordering using dynamic
364 reranking or score functions. These functions return positive
365 interger scores, where _highest_ score is best, which means that the
366 hit sets will be sorted according to _decending_ scores (in contrary
367 to the index lists which are sorted according to _ascending_ rank
368 number and document ID)
371 Those are defined in the zebra C source files
373 "rank-1" : zebra/index/rank1.c
374 default TF/IDF like zebra dynamic ranking
375 "rank-static" : zebra/index/rankstatic.c
376 do-nothing dummy static ranking (this is just to prove
377 that the static rank can be used in dynamic ranking functions)
378 "zvrank" : zebra/index/zvrank.c
379 many different dynamic TF/IDF ranking functions
381 The are in the zebra config file enabled by a directive like:
385 Notice that the "rank-1" and "zvrank" do not use the static rank
386 information in the list keys, and will produce the same ordering
387 with our without static ranking enabled.
389 The dummy "rank-static" reranking/scoring function returns just
390 score = max int - staticrank
391 in order to preserve the ordering of hit sets with and without it's
394 Obviously, one wants to make a new ranking function, which combines
395 static and dynamic ranking, which is left as an exercise for the
396 reader .. (Wray, this is your's ...)
403 yazserver frontend config file
407 Setup of listening ports, and virtual zebra servers.
408 Note path to server-side <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>-to-PQF config file, and to
409 <ulink url="http://www.loc.gov/standards/sru/srw/">SRW</ulink> explain config section.
411 The <directory> path is relative to the directory where zebra.init is placed
412 and is started up. The other pathes are relative to <directory>,
413 which in this case is the same.
415 see: http://www.indexdata.com/yaz/doc/server.vhosts.tkl
422 search like this (using client-side <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>-to-PQF conversion):
424 yaz-client -q db/cql2pqf.txt localhost:9999
427 > f text=(plant and soil)
439 search like this (using server-side <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>-to-PQF conversion):
440 (the only difference is "querytype cql" instead of
441 "querytype cql2rpn" and the call without specifying a local
444 yaz-client localhost:9999
447 > f text=(plant and soil)
458 NEW: static relevance ranking - see examples in alvis2index.xsl
460 > f text = /relevant (plant and soil)
464 > f title = /relevant a
470 <ulink url="http://www.loc.gov/standards/sru/srw/">SRW</ulink>/U searching
471 Surf into http://localhost:9999
473 firefox http://localhost:9999
475 gives you an explain record. Unfortunately, the data found in the
476 <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>-to-PQF text file must be added by hand-craft into the explain
477 section of the yazserver.xml file. Too bad, but this is all extreme
478 new alpha stuff, and a lot of work has yet to be done ..
480 Searching via <ulink url="http://www.loc.gov/standards/sru/">SRU</ulink>: surf into the URL (lines broken here - concat on
483 - see number of hits:
484 http://localhost:9999/?version=1.1&operation=searchRetrieve
485 &query=text=(plant%20and%20soil)
488 - fetch record 5-7 in DC format
489 http://localhost:9999/?version=1.1&operation=searchRetrieve
490 &query=text=(plant%20and%20soil)
491 &startRecord=5&maximumRecords=2&recordSchema=dc
494 - even search using PQF queries using the extended verb "x-pquery",
495 which is special to YAZ/Zebra
497 http://localhost:9999/?version=1.1&operation=searchRetrieve
498 &x-pquery=@attr%201=text%20@and%20plant%20soil
500 More info: read the fine manuals at http://www.loc.gov/z3950/agency/zing/srw/
502 Search via <ulink url="http://www.loc.gov/standards/sru/srw/">SRW</ulink>:
503 read the fine manual at
504 http://www.loc.gov/z3950/agency/zing/srw/
507 and so on. The list of available indexes is found in db/cql2pqf.txt
510 7) How do you add to the index attributes of any other type than "w"?
511 I mean, in the context of making <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink> queries. Let's say I want a date
512 attribute in there, so that one could do date > 20050101 in <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>.
514 Currently for example 'date-modified' is of type 'w'.
516 The 2-seconds-of-though solution:
520 <z:index name="date-modified" type="d">
522 select="acquisition/acquisitionData/modifiedDate"/>
525 But here's the catch...doesn't the use of the 'd' type require
526 structure type 'date' (@attr 4=5) in PQF? But then...how does that
527 reflect in the <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>->RPN/PQF mapping - does it really work if I just
528 change the type of an element in alvis2index.sl? I would think not...?
538 f @attr 4=5 @attr 1=date-modified 20050713
545 f @attr 4=5 @attr 1=date-modified 20050713
551 f date-modified=20050713
553 f date-modified=20050713
555 Search ERROR 121 4 1+0 RPN: @attrset Bib-1 @attr 5=100 @attr 6=1 @attr 3=3 @att
556 r 4=1 @attr 2=3 @attr "1=date-modified" 20050713
560 f date-modified eq 20050713
562 Search OK 23 3 1+0 RPN: @attrset Bib-1 @attr 5=100 @attr 6=1 @attr 3=3 @attr 4=5
563 @attr 2=3 @attr "1=date-modified" 20050713
569 E) EXTENDED SERVICE LIFE UPDATES
571 The extended services are not enabled by default in zebra - due to the
572 fact that they modify the system.
574 In order to allow anybody to update, use
578 Or, even better, allow only updates for a particular admin user. For
579 user 'admin', you could use:
583 And in passwordfile, specify users and passwords ..
586 We can now start a yaz-client admin session and create a database:
588 $ yaz-client localhost:9999 -u admin/secret
589 Authentication set to Open (admin/secret)
592 Connection accepted by v3 target.
594 Name : Zebra Information Server/GFS/YAZ
595 Version: Zebra 1.4.0/1.63/2.1.9
596 Options: search present delSet triggerResourceCtrl scan sort
597 extendedServices namedResultSets
601 Got extended services response
605 Now Default was created.. We can now insert an XML file (esdd0006.grs
606 from example/gils/records) and index it:
608 Z> update insert 1 esdd0006.grs
609 Got extended services response
613 The 3rd parameter.. 1 here .. is the opaque record id from Ext update.
614 It a record ID that _we_ assign to the record in question. If we do not
615 assign one the usual rules for match apply (recordId: from zebra.cfg).
617 Actually, we should have a way to specify "no opaque record id" for
618 yaz-client's update command.. We'll fix that.
623 Received SearchResponse.
624 Search was a success.
625 Number of hits: 1, setno 1
626 SearchResult-1: term=utah cnt=1
630 Let's delete the beast:
632 No last record (update ignored)
633 Z> update delete 1 esdd0006.grs
634 Got extended services response
639 Received SearchResponse.
640 Search was a success.
641 Number of hits: 0, setno 2
642 SearchResult-1: term=utah cnt=0
646 If shadow register is enabled you must run the adm-commit command in
647 order write your changes..
658 <!-- Keep this comment at the end of the file
663 sgml-minimize-attributes:nil
664 sgml-always-quote-attributes:t
667 sgml-parent-document: "zebra.xml"
668 sgml-local-catalogs: nil
669 sgml-namecase-general:t