doc/architecture.xml

   1  <chapter id="architecture">
   2   <!-- $Id: architecture.xml,v 1.8 2006-04-25 12:26:26 marc Exp $ -->
   3   <title>Overview of Zebra Architecture</title>
   4
   5
   6   <sect1 id="architecture-representation">
   7    <title>Local Representation</title>
   8
   9    <para>
  10     As mentioned earlier, Zebra places few restrictions on the type of
  11     data that you can index and manage. Generally, whatever the form of
  12     the data, it is parsed by an input filter specific to that format, and
  13     turned into an internal structure that Zebra knows how to handle. This
  14     process takes place whenever the record is accessed - for indexing and
  15     retrieval.
  16    </para>
  17
  18    <para>
  19     The RecordType parameter in the <literal>zebra.cfg</literal> file, or
  20     the <literal>-t</literal> option to the indexer tells Zebra how to
  21     process input records.
  22     Two basic types of processing are available - raw text and structured
  23     data. Raw text is just that, and it is selected by providing the
  24     argument <emphasis>text</emphasis> to Zebra. Structured records are
  25     all handled internally using the basic mechanisms described in the
  26     subsequent sections.
  27     Zebra can read structured records in many different formats.
  28     <!--
  29     How this is done is governed by additional parameters after the
  30     "grs" keyword, separated by "." characters.
  31     -->
  32    </para>
  33   </sect1>
  34
  35   <sect1 id="architecture-maincomponents">
  36    <title>Main Components</title>
  37    <para>
  38     The Zebra system is designed to support a wide range of data management
  39     applications. The system can be configured to handle virtually any
  40     kind of structured data. Each record in the system is associated with
  41     a <emphasis>record schema</emphasis> which lends context to the data
  42     elements of the record.
  43     Any number of record schemas can coexist in the system.
  44     Although it may be wise to use only a single schema within
  45     one database, the system poses no such restrictions.
  46    </para>
  47    <para>
  48     The Zebra indexer and information retrieval server consists of the
  49     following main applications: the <command>zebraidx</command>
  50     indexing maintenance utility, and the <command>zebrasrv</command>
  51     information query and retrieval server. Both are using some of the
  52     same main components, which are presented here.
  53    </para>
  54    <para>
  55     The virtual Debian package <literal>idzebra1.4</literal>
  56     installs all the necessary packages to start
  57     working with Zebra - including utility programs, development libraries,
  58     documentation and modules.
  59   </para>
  60
  61    <sect2 id="componentcore">
  62     <title>Core Zebra Libraries Containing Common Functionality</title>
  63     <para>
  64      The core Zebra module is the meat of the <command>zebraidx</command>
  65     indexing maintenance utility, and the <command>zebrasrv</command>
  66     information query and retrieval server binaries. Shortly, the core
  67     libraries are responsible for
  68      <variablelist>
  69       <varlistentry>
  70        <term>Dynamic Loading</term>
  71        <listitem>
  72         <para>of external filter modules, in case the application is
  73         not compiled statically. These filter modules define indexing,
  74         search and retrieval capabilities of the various input formats.
  75         </para>
  76        </listitem>
  77       </varlistentry>
  78       <varlistentry>
  79        <term>Index Maintenance</term>
  80        <listitem>
  81         <para> Zebra maintains Term Dictionaries and ISAM index
  82         entries in inverted index structures kept on disk. These are
  83         optimized for fast inset, update and delete, as well as good
  84         search performance.
  85         </para>
  86        </listitem>
  87       </varlistentry>
  88       <varlistentry>
  89        <term>Search Evaluation</term>
  90        <listitem>
  91         <para>by execution of search requests expressed in PQF/RPN
  92          data structures, which are handed over from
  93          the YAZ server frontend API. Search evaluation includes
  94          construction of hit lists according to boolean combinations
  95          of simpler searches. Fast performance is achieved by careful
  96          use of index structures, and by evaluation specific index hit
  97          lists in correct order.
  98         </para>
  99        </listitem>
 100       </varlistentry>
 101       <varlistentry>
 102        <term>Ranking and Sorting</term>
 103        <listitem>
 104         <para>
 105          components call resorting/re-ranking algorithms on the hit
 106          sets. These might also be pre-sorted not only using the
 107          assigned document ID's, but also using assigned static rank
 108          information.
 109         </para>
 110        </listitem>
 111       </varlistentry>
 112       <varlistentry>
 113        <term>Record Presentation</term>
 114        <listitem>
 115         <para>returns - possibly ranked - result sets, hit
 116          numbers, and the like internal data to the YAZ server backend API
 117          for shipping to the client. Each individual filter module
 118          implements it's own specific presentation formats.
 119         </para>
 120        </listitem>
 121       </varlistentry>
 122      </variablelist>
 123      </para>
 124     <para>
 125      The Debian package <literal>libidzebra1.4</literal>
 126      contains all run-time libraries for Zebra, the
 127      documentation in PDF and HTML is found in
 128      <literal>idzebra1.4-doc</literal>, and
 129      <literal>idzebra1.4-common</literal>
 130      includes common essential Zebra configuration files.
 131     </para>
 132    </sect2>
 133
 134
 135    <sect2 id="componentindexer">
 136     <title>Zebra Indexer</title>
 137     <para>
 138      The  <command>zebraidx</command>
 139      indexing maintenance utility
 140      loads external filter modules used for indexing data records of
 141      different type, and creates, updates and drops databases and
 142      indexes according to the rules defined in the filter modules.
 143     </para>
 144     <para>
 145      The Debian  package <literal>idzebra1.4-utils</literal> contains
 146      the  <command>zebraidx</command> utility.
 147     </para>
 148    </sect2>
 149
 150    <sect2 id="componentsearcher">
 151     <title>Zebra Searcher/Retriever</title>
 152     <para>
 153      This is the executable which runs the Z39.50/SRU/SRW server and
 154      glues together the core libraries and the filter modules to one
 155      great Information Retrieval server application.
 156     </para>
 157     <para>
 158      The Debian  package <literal>idzebra1.4-utils</literal> contains
 159      the  <command>zebrasrv</command> utility.
 160     </para>
 161    </sect2>
 162
 163    <sect2 id="componentyazserver">
 164     <title>YAZ Server Frontend</title>
 165     <para>
 166      The YAZ server frontend is
 167      a full fledged stateful Z39.50 server taking client
 168      connections, and forwarding search and scan requests to the
 169      Zebra core indexer.
 170     </para>
 171     <para>
 172      In addition to Z39.50 requests, the YAZ server frontend acts
 173      as HTTP server, honoring
 174       <ulink url="http://www.loc.gov/standards/sru/srw/">SRW</ulink>
 175      SOAP requests, and
 176      <ulink url="http://www.loc.gov/standards/sru/">SRU</ulink>
 177      REST requests. Moreover, it can
 178      translate incoming
 179      <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>
 180      queries to
 181      <ulink url="http://indexdata.com/yaz/doc/tools.tkl#PQF">PQF</ulink>
 182       queries, if
 183      correctly configured.
 184     </para>
 185     <para>
 186      <ulink url="http://www.indexdata.com/yaz">YAZ</ulink>
 187      is an Open Source
 188      toolkit that allows you to develop software using the
 189      ANSI Z39.50/ISO23950 standard for information retrieval.
 190      It is packaged in the Debian packages
 191      <literal>yaz</literal> and <literal>libyaz</literal>.
 192     </para>
 193    </sect2>
 194
 195    <sect2 id="componentmodules">
 196     <title>Record Models and Filter Modules</title>
 197     <para>
 198      The hard work of knowing <emphasis>what</emphasis> to index,
 199      <emphasis>how</emphasis> to do it, and <emphasis>which</emphasis>
 200      part of the records to send in a search/retrieve response is
 201      implemented in
 202      various filter modules. It is their responsibility to define the
 203      exact indexing and record display filtering rules.
 204      </para>
 205      <para>
 206      The virtual Debian package
 207      <literal>libidzebra1.4-modules</literal> installs all base filter
 208      modules.
 209     </para>
 210
 211    <sect3 id="componentmodulestext">
 212     <title>TEXT Record Model and Filter Module</title>
 213     <para>
 214       Plain ASCII text filter. TODO: add information here.
 215      <!--
 216      <literal>text module missing as deb file<literal>
 217      -->
 218     </para>
 219    </sect3>
 220
 221    <sect3 id="componentmodulesgrs">
 222     <title>GRS Record Model and Filter Modules</title>
 223     <para>
 224     The GRS filter modules described in
 225     <xref linkend="record-model-grs"/>
 226     are all based on the Z39.50 specifications, and it is absolutely
 227     mandatory to have the reference pages on BIB-1 attribute sets on
 228     you hand when configuring GRS filters. The GRS filters come in
 229     different flavors, and a short introduction is needed here.
 230     GRS filters of various kind have also been called ABS filters due
 231     to the <filename>*.abs</filename> configuration file suffix.
 232     </para>
 233     <para>
 234      The <emphasis>grs.danbib</emphasis> filter is developed for
 235       DBC DanBib records.
 236       DanBib is the Danish Union Catalogue hosted by DBC
 237       (Danish Bibliographic Center). This filter is found in the
 238       Debian package
 239      <literal>libidzebra1.4-mod-grs-danbib</literal>.
 240     </para>
 241     <para>
 242       The <emphasis>grs.marc</emphasis> and
 243       <emphasis>grs.marcxml</emphasis> filters are suited to parse and
 244       index binary and XML versions of traditional library MARC records
 245       based on the ISO2709 standard. The Debian package for both
 246       filters is
 247      <literal>libidzebra1.4-mod-grs-marc</literal>.
 248     </para>
 249     <para>
 250       GRS TCL scriptable filters for extensive user configuration come
 251      in two flavors: a regular expression filter
 252      <emphasis>grs.regx</emphasis> using TCL regular expressions, and
 253      a general scriptable TCL filter called
 254      <emphasis>grs.tcl</emphasis>
 255      are both included in the
 256      <literal>libidzebra1.4-mod-grs-regx</literal> Debian package.
 257     </para>
 258     <para>
 259       A general purpose SGML filter is called
 260      <emphasis>grs.sgml</emphasis>. This filter is not yet packaged,
 261      but planned to be in the
 262      <literal>libidzebra1.4-mod-grs-sgml</literal> Debian package.
 263     </para>
 264     <para>
 265       The Debian  package
 266       <literal>libidzebra1.4-mod-grs-xml</literal> includes the
 267       <emphasis>grs.xml</emphasis> filter which uses <ulink
 268       url="http://expat.sourceforge.net/">Expat</ulink> to
 269       parse records in XML and turn them into IDZebra's internal GRS node
 270       trees. Have also a look at the Alvis XML/XSLT filter described in
 271       the next session.
 272     </para>
 273    </sect3>
 274
 275    <sect3 id="componentmodulesalvis">
 276     <title>ALVIS Record Model and Filter Module</title>
 277      <para>
 278       The Alvis filter for XML files is an XSLT based input
 279       filter.
 280       It indexes element and attribute content of any thinkable XML format
 281       using full XPATH support, a feature which the standard Zebra
 282       GRS SGML and XML filters lacked. The indexed documents are
 283       parsed into a standard XML DOM tree, which restricts record size
 284       according to availability of memory.
 285     </para>
 286     <para>
 287       The Alvis filter
 288       uses XSLT display stylesheets, which let
 289       the Zebra DB administrator associate multiple, different views on
 290       the same XML document type. These views are chosen on-the-fly in
 291       search time.
 292      </para>
 293     <para>
 294       In addition, the Alvis filter configuration is not bound to the
 295       arcane  BIB-1 Z39.50 library catalogue indexing traditions and
 296       folklore, and is therefore easier to understand.
 297     </para>
 298     <para>
 299       Finally, the Alvis  filter allows for static ranking at index
 300       time, and to to sort hit lists according to predefined
 301       static ranks. This imposes no overhead at all, both
 302       search and indexing perform still
 303       <emphasis>O(1)</emphasis> irrespectively of document
 304       collection size. This feature resembles Googles pre-ranking using
 305       their Pagerank algorithm.
 306     </para>
 307     <para>
 308       Details on the experimental Alvis XSLT filter are found in
 309       <xref linkend="record-model-alvisxslt"/>.
 310       </para>
 311      <para>
 312       The Debian package <literal>libidzebra1.4-mod-alvis</literal>
 313       contains the Alvis filter module.
 314      </para>
 315     </sect3>
 316
 317    <sect3 id="componentmodulessafari">
 318     <title>SAFARI Record Model and Filter Module</title>
 319     <para>
 320      SAFARI filter module TODO: add information here.
 321      <!--
 322      <literal>safari module missing as deb file<literal>
 323      -->
 324     </para>
 325    </sect3>
 326
 327    </sect2>
 328
 329   </sect1>
 330
 331
 332   <sect1 id="architecture-workflow">
 333    <title>Indexing and Retrieval Workflow</title>
 334
 335   <para>
 336    Records pass through three different states during processing in the
 337    system.
 338   </para>
 339
 340   <para>
 341
 342    <itemizedlist>
 343     <listitem>
 344
 345      <para>
 346       When records are accessed by the system, they are represented
 347       in their local, or native format. This might be SGML or HTML files,
 348       News or Mail archives, MARC records. If the system doesn't already
 349       know how to read the type of data you need to store, you can set up an
 350       input filter by preparing conversion rules based on regular
 351       expressions and possibly augmented by a flexible scripting language
 352       (Tcl).
 353       The input filter produces as output an internal representation,
 354       a tree structure.
 355
 356      </para>
 357     </listitem>
 358     <listitem>
 359
 360      <para>
 361       When records are processed by the system, they are represented
 362       in a tree-structure, constructed by tagged data elements hanging off a
 363       root node. The tagged elements may contain data or yet more tagged
 364       elements in a recursive structure. The system performs various
 365       actions on this tree structure (indexing, element selection, schema
 366       mapping, etc.),
 367
 368      </para>
 369     </listitem>
 370     <listitem>
 371
 372      <para>
 373       Before transmitting records to the client, they are first
 374       converted from the internal structure to a form suitable for exchange
 375       over the network - according to the Z39.50 standard.
 376      </para>
 377     </listitem>
 378
 379    </itemizedlist>
 380
 381   </para>
 382   </sect1>
 383
 384
 385 <!--
 386   <sect1 id="architecture-querylanguage">
 387    <title>Query Languages</title>
 388
 389    <para>
 390
 391 http://www.loc.gov/z3950/agency/document.html
 392
 393     PQF and BIB-1 stuff to be explained
 394     <ulink url="http://www.loc.gov/z3950/agency/defns/bib1.html">
 395      http://www.loc.gov/z3950/agency/defns/bib1.html</ulink>
 396
 397      <ulink url="http://www.loc.gov/z3950/agency/bib1.html">
 398      http://www.loc.gov/z3950/agency/bib1.html</ulink>
 399
 400      http://www.loc.gov/z3950/agency/markup/13.html
 401
 402   </para>
 403   </sect1>
 404
 405
 406 These attribute types are recognized regardless of attribute set. Some are recognized for search, others for scan.
 407
 408 Search
 409
 410 Type    Name    Version
 411 7       Embedded Sort   1.1
 412 8       Term Set        1.1
 413 9       Rank weight     1.1
 414 9       Approx Limit    1.4
 415 10      Term Ref        1.4
 416
 417 Embedded Sort
 418
 419 The embedded sort is a way to specify sort within a query - thus removing the need to send a Sort Request separately. It is both faster and does not require clients that deal with the Sort Facility.
 420
 421 The value after attribute type 7 is 1=ascending, 2=descending.. The attributes+term (APT) node is separate from the rest and must be @or'ed. The term associated with APT is the level .. 0=primary sort, 1=secondary sort etc.. Example:
 422
 423 Search for water, sort by title (ascending):
 424
 425   @or @attr 1=1016 water @attr 7=1 @attr 1=4 0
 426
 427 Search for water, sort by title ascending, then date descending:
 428
 429   @or @or @attr 1=1016 water @attr 7=1 @attr 1=4 0 @attr 7=2 @attr 1=30 1
 430
 431 Term Set
 432
 433 The Term Set feature is a facility that allows a search to store hitting terms in a "pseudo" resultset; thus a search (as usual) + a scan-like facility. Requires a client that can do named result sets since the search generates two result sets. The value for attribute 8 is the name of a result set (string). The terms in term set are returned as SUTRS records.
 434
 435 Seach for u in title, right truncated.. Store result in result set named uset.
 436
 437   @attr 5=1 @attr 1=4 @attr 8=uset u
 438
 439 The model as one serious flaw.. We don't know the size of term set.
 440
 441 Rank weight
 442
 443 Rank weight is a way to pass a value to a ranking algorithm - so that one APT has one value - while another as a different one.
 444
 445 Search for utah in title with weight 30 as well as any with weight 20.
 446
 447   @attr 2=102 @or @attr 9=30 @attr 1=4 utah @attr 9=20 utah
 448
 449 Approx Limit
 450
 451 Newer Zebra versions normally estemiates hit count for every APT (leaf) in the query tree. These hit counts are returned as part of the searchResult-1 facility.
 452
 453 By setting a limit for the APT we can make Zebra turn into approximate hit count when a certain hit count limit is reached. A value of zero means exact hit count.
 454
 455 We are intersted in exact hit count for a, but for b we allow estimates for 1000 and higher..
 456
 457   @and a @attr 9=1000 b
 458
 459 This facility clashes with rank weight! Fortunately this is a Zebra 1.4 thing so we can change this without upsetting anybody!
 460
 461 Term Ref
 462
 463 Zebra supports the searchResult-1 facility.
 464
 465 If attribute 10 is given, that specifies a subqueryId value returned as part of the search result. It is a way for a client to name an APT part of a query.
 466
 467 Scan
 468
 469 Type    Name    Version
 470 8       Result set narrow       1.3
 471 9       Approx Limit    1.4
 472
 473 Result set narrow
 474
 475 If attribute 8 is given for scan, the value is the name of a result set. Each hit count in scan is @and'ed with the result set given.
 476
 477 Approx limit
 478
 479 The approx (as for search) is a way to enable approx hit counts for scan hit counts. However, it does NOT appear to work at the moment.
 480
 481
 482  AdamDickmeiss - 19 Dec 2005
 483
 484
 485 -->
 486
 487
 488  </chapter>
 489
 490  <!-- Keep this comment at the end of the file
 491  Local variables:
 492  mode: sgml
 493  sgml-omittag:t
 494  sgml-shorttag:t
 495  sgml-minimize-attributes:nil
 496  sgml-always-quote-attributes:t
 497  sgml-indent-step:1
 498  sgml-indent-data:t
 499  sgml-parent-document: "zebra.xml"
 500  sgml-local-catalogs: nil
 501  sgml-namecase-general:t
 502  End:
 503  -->