doc/architecture.xml

   1  <chapter id="architecture">
   2   <title>Overview of &zebra; Architecture</title>
   3
   4   <section id="architecture-representation">
   5    <title>Local Representation</title>
   6
   7    <para>
   8     As mentioned earlier, &zebra; places few restrictions on the type of
   9     data that you can index and manage. Generally, whatever the form of
  10     the data, it is parsed by an input filter specific to that format, and
  11     turned into an internal structure that &zebra; knows how to handle. This
  12     process takes place whenever the record is accessed - for indexing and
  13     retrieval.
  14    </para>
  15
  16    <para>
  17     The RecordType parameter in the <literal>zebra.cfg</literal> file, or
  18     the <literal>-t</literal> option to the indexer tells &zebra; how to
  19     process input records.
  20     Two basic types of processing are available - raw text and structured
  21     data. Raw text is just that, and it is selected by providing the
  22     argument <emphasis>text</emphasis> to &zebra;. Structured records are
  23     all handled internally using the basic mechanisms described in the
  24     subsequent sections.
  25     &zebra; can read structured records in many different formats.
  26     <!--
  27     How this is done is governed by additional parameters after the
  28     "grs" keyword, separated by "." characters.
  29     -->
  30    </para>
  31   </section>
  32
  33   <section id="architecture-maincomponents">
  34    <title>Main Components</title>
  35    <para>
  36     The &zebra; system is designed to support a wide range of data management
  37     applications. The system can be configured to handle virtually any
  38     kind of structured data. Each record in the system is associated with
  39     a <emphasis>record schema</emphasis> which lends context to the data
  40     elements of the record.
  41     Any number of record schemas can coexist in the system.
  42     Although it may be wise to use only a single schema within
  43     one database, the system poses no such restrictions.
  44    </para>
  45    <para>
  46     The &zebra; indexer and information retrieval server consists of the
  47     following main applications: the <command>zebraidx</command>
  48     indexing maintenance utility, and the <command>zebrasrv</command>
  49     information query and retrieval server. Both are using some of the
  50     same main components, which are presented here.
  51    </para>
  52    <para>
  53     The virtual Debian package <literal>idzebra-2.0</literal>
  54     installs all the necessary packages to start
  55     working with &zebra; - including utility programs, development libraries,
  56     documentation and modules.
  57   </para>
  58
  59    <section id="componentcore">
  60     <title>Core &zebra; Libraries Containing Common Functionality</title>
  61     <para>
  62      The core &zebra; module is the meat of the <command>zebraidx</command>
  63     indexing maintenance utility, and the <command>zebrasrv</command>
  64     information query and retrieval server binaries. Shortly, the core
  65     libraries are responsible for
  66      <variablelist>
  67       <varlistentry>
  68        <term>Dynamic Loading</term>
  69        <listitem>
  70         <para>of external filter modules, in case the application is
  71         not compiled statically. These filter modules define indexing,
  72         search and retrieval capabilities of the various input formats.
  73         </para>
  74        </listitem>
  75       </varlistentry>
  76       <varlistentry>
  77        <term>Index Maintenance</term>
  78        <listitem>
  79         <para> &zebra; maintains Term Dictionaries and ISAM index
  80         entries in inverted index structures kept on disk. These are
  81         optimized for fast inset, update and delete, as well as good
  82         search performance.
  83         </para>
  84        </listitem>
  85       </varlistentry>
  86       <varlistentry>
  87        <term>Search Evaluation</term>
  88        <listitem>
  89         <para>by execution of search requests expressed in &acro.pqf;/&acro.rpn;
  90          data structures, which are handed over from
  91          the &yaz; server frontend &acro.api;. Search evaluation includes
  92          construction of hit lists according to boolean combinations
  93          of simpler searches. Fast performance is achieved by careful
  94          use of index structures, and by evaluation specific index hit
  95          lists in correct order.
  96         </para>
  97        </listitem>
  98       </varlistentry>
  99       <varlistentry>
 100        <term>Ranking and Sorting</term>
 101        <listitem>
 102         <para>
 103          components call resorting/re-ranking algorithms on the hit
 104          sets. These might also be pre-sorted not only using the
 105          assigned document ID's, but also using assigned static rank
 106          information.
 107         </para>
 108        </listitem>
 109       </varlistentry>
 110       <varlistentry>
 111        <term>Record Presentation</term>
 112        <listitem>
 113         <para>returns - possibly ranked - result sets, hit
 114          numbers, and the like internal data to the &yaz; server backend &acro.api;
 115          for shipping to the client. Each individual filter module
 116          implements it's own specific presentation formats.
 117         </para>
 118        </listitem>
 119       </varlistentry>
 120      </variablelist>
 121      </para>
 122     <para>
 123      The Debian package <literal>libidzebra-2.0</literal>
 124      contains all run-time libraries for &zebra;, the
 125      documentation in PDF and HTML is found in
 126      <literal>idzebra-2.0-doc</literal>, and
 127      <literal>idzebra-2.0-common</literal>
 128      includes common essential &zebra; configuration files.
 129     </para>
 130    </section>
 131
 132
 133    <section id="componentindexer">
 134     <title>&zebra; Indexer</title>
 135     <para>
 136      The  <command>zebraidx</command>
 137      indexing maintenance utility
 138      loads external filter modules used for indexing data records of
 139      different type, and creates, updates and drops databases and
 140      indexes according to the rules defined in the filter modules.
 141     </para>
 142     <para>
 143      The Debian  package <literal>idzebra-2.0-utils</literal> contains
 144      the  <command>zebraidx</command> utility.
 145     </para>
 146    </section>
 147
 148    <section id="componentsearcher">
 149     <title>&zebra; Searcher/Retriever</title>
 150     <para>
 151      This is the executable which runs the &acro.z3950;/&acro.sru;/&acro.srw; server and
 152      glues together the core libraries and the filter modules to one
 153      great Information Retrieval server application.
 154     </para>
 155     <para>
 156      The Debian  package <literal>idzebra-2.0-utils</literal> contains
 157      the  <command>zebrasrv</command> utility.
 158     </para>
 159    </section>
 160
 161    <section id="componentyazserver">
 162     <title>&yaz; Server Frontend</title>
 163     <para>
 164      The &yaz; server frontend is
 165      a full fledged stateful &acro.z3950; server taking client
 166      connections, and forwarding search and scan requests to the
 167      &zebra; core indexer.
 168     </para>
 169     <para>
 170      In addition to &acro.z3950; requests, the &yaz; server frontend acts
 171      as HTTP server, honoring
 172       <ulink url="&url.sru;">&acro.sru; &acro.soap;</ulink>
 173      requests, and
 174      &acro.sru; &acro.rest;
 175      requests. Moreover, it can
 176      translate incoming
 177      <ulink url="&url.cql;">&acro.cql;</ulink>
 178      queries to
 179      <ulink url="&url.yaz.pqf;">&acro.pqf;</ulink>
 180       queries, if
 181      correctly configured.
 182     </para>
 183     <para>
 184      <ulink url="&url.yaz;">&yaz;</ulink>
 185      is an Open Source
 186      toolkit that allows you to develop software using the
 187      &acro.ansi; &acro.z3950;/ISO23950 standard for information retrieval.
 188      It is packaged in the Debian packages
 189      <literal>yaz</literal> and <literal>libyaz</literal>.
 190     </para>
 191    </section>
 192
 193    <section id="componentmodules">
 194     <title>Record Models and Filter Modules</title>
 195     <para>
 196      The hard work of knowing <emphasis>what</emphasis> to index,
 197      <emphasis>how</emphasis> to do it, and <emphasis>which</emphasis>
 198      part of the records to send in a search/retrieve response is
 199      implemented in
 200      various filter modules. It is their responsibility to define the
 201      exact indexing and record display filtering rules.
 202      </para>
 203      <para>
 204      The virtual Debian package
 205      <literal>libidzebra-2.0-modules</literal> installs all base filter
 206      modules.
 207     </para>
 208
 209    <section id="componentmodulesdom">
 210     <title>&acro.dom; &acro.xml; Record Model and Filter Module</title>
 211      <para>
 212       The &acro.dom; &acro.xml; filter uses a standard &acro.dom; &acro.xml; structure as
 213       internal data model, and can thus parse, index, and display
 214       any &acro.xml; document.
 215     </para>
 216     <para>
 217       A parser for binary &acro.marc; records based on the ISO2709 library
 218       standard is provided, it transforms these to the internal
 219       &acro.marcxml; &acro.dom; representation.
 220     </para>
 221     <para>
 222       The internal &acro.dom; &acro.xml; representation can be fed into four
 223       different pipelines, consisting of arbitrarily many successive
 224       &acro.xslt; transformations; these are for
 225      <itemizedlist>
 226        <listitem><para>input parsing and initial
 227           transformations,</para></listitem>
 228        <listitem><para>indexing term extraction
 229           transformations</para></listitem>
 230        <listitem><para>transformations before internal document
 231           storage, and </para></listitem>
 232        <listitem><para>retrieve transformations from storage to output
 233           format</para></listitem>
 234       </itemizedlist>
 235     </para>
 236     <para>
 237       The &acro.dom; &acro.xml; filter pipelines use &acro.xslt; (and if  supported on
 238       your platform, even &acro.exslt;), it brings thus full &acro.xpath;
 239       support to the indexing, storage and display rules of not only
 240       &acro.xml; documents, but also binary &acro.marc; records.
 241     </para>
 242     <para>
 243       Finally, the &acro.dom; &acro.xml; filter allows for static ranking at index
 244       time, and to to sort hit lists according to predefined
 245       static ranks.
 246     </para>
 247     <para>
 248       Details on the experimental &acro.dom; &acro.xml; filter are found in
 249       <xref linkend="record-model-domxml"/>.
 250       </para>
 251      <para>
 252       The Debian package <literal>libidzebra-2.0-mod-dom</literal>
 253       contains the &acro.dom; filter module.
 254      </para>
 255     </section>
 256
 257    <section id="componentmodulesalvis">
 258     <title>ALVIS &acro.xml; Record Model and Filter Module</title>
 259      <note>
 260       <para>
 261         The functionality of this record model has been improved and
 262         replaced by the &acro.dom; &acro.xml; record model. See
 263         <xref linkend="componentmodulesdom"/>.
 264       </para>
 265      </note>
 266
 267      <para>
 268       The Alvis filter for &acro.xml; files is an &acro.xslt; based input
 269       filter.
 270       It indexes element and attribute content of any thinkable &acro.xml; format
 271       using full &acro.xpath; support, a feature which the standard &zebra;
 272       &acro.grs1; &acro.sgml; and &acro.xml; filters lacked. The indexed documents are
 273       parsed into a standard &acro.xml; &acro.dom; tree, which restricts record size
 274       according to availability of memory.
 275     </para>
 276     <para>
 277       The Alvis filter
 278       uses &acro.xslt; display stylesheets, which let
 279       the &zebra; DB administrator associate multiple, different views on
 280       the same &acro.xml; document type. These views are chosen on-the-fly in
 281       search time.
 282      </para>
 283     <para>
 284       In addition, the Alvis filter configuration is not bound to the
 285       arcane  &acro.bib1; &acro.z3950; library catalogue indexing traditions and
 286       folklore, and is therefore easier to understand.
 287     </para>
 288     <para>
 289       Finally, the Alvis  filter allows for static ranking at index
 290       time, and to to sort hit lists according to predefined
 291       static ranks. This imposes no overhead at all, both
 292       search and indexing perform still
 293       <emphasis>O(1)</emphasis> irrespectively of document
 294       collection size. This feature resembles Google's pre-ranking using
 295       their PageRank algorithm.
 296     </para>
 297     <para>
 298       Details on the experimental Alvis &acro.xslt; filter are found in
 299       <xref linkend="record-model-alvisxslt"/>.
 300       </para>
 301      <para>
 302       The Debian package <literal>libidzebra-2.0-mod-alvis</literal>
 303       contains the Alvis filter module.
 304      </para>
 305     </section>
 306
 307    <section id="componentmodulesgrs">
 308     <title>&acro.grs1; Record Model and Filter Modules</title>
 309      <note>
 310       <para>
 311         The functionality of this record model has been improved and
 312         replaced by the &acro.dom; &acro.xml; record model. See
 313         <xref linkend="componentmodulesdom"/>.
 314       </para>
 315      </note>
 316     <para>
 317     The &acro.grs1; filter modules described in
 318     <xref linkend="grs"/>
 319     are all based on the &acro.z3950; specifications, and it is absolutely
 320     mandatory to have the reference pages on &acro.bib1; attribute sets on
 321     you hand when configuring &acro.grs1; filters. The GRS filters come in
 322     different flavors, and a short introduction is needed here.
 323     &acro.grs1; filters of various kind have also been called ABS filters due
 324     to the <filename>*.abs</filename> configuration file suffix.
 325     </para>
 326     <para>
 327       The <emphasis>grs.marc</emphasis> and
 328       <emphasis>grs.marcxml</emphasis> filters are suited to parse and
 329       index binary and &acro.xml; versions of traditional library &acro.marc; records
 330       based on the ISO2709 standard. The Debian package for both
 331       filters is
 332      <literal>libidzebra-2.0-mod-grs-marc</literal>.
 333     </para>
 334     <para>
 335       &acro.grs1; TCL scriptable filters for extensive user configuration come
 336      in two flavors: a regular expression filter
 337      <emphasis>grs.regx</emphasis> using TCL regular expressions, and
 338      a general scriptable TCL filter called
 339      <emphasis>grs.tcl</emphasis>
 340      are both included in the
 341      <literal>libidzebra-2.0-mod-grs-regx</literal> Debian package.
 342     </para>
 343     <para>
 344       A general purpose &acro.sgml; filter is called
 345      <emphasis>grs.sgml</emphasis>. This filter is not yet packaged,
 346      but planned to be in the
 347      <literal>libidzebra-2.0-mod-grs-sgml</literal> Debian package.
 348     </para>
 349     <para>
 350       The Debian  package
 351       <literal>libidzebra-2.0-mod-grs-xml</literal> includes the
 352       <emphasis>grs.xml</emphasis> filter which uses <ulink
 353       url="&url.expat;">Expat</ulink> to
 354       parse records in &acro.xml; and turn them into ID&zebra;'s internal &acro.grs1; node
 355       trees. Have also a look at the Alvis &acro.xml;/&acro.xslt; filter described in
 356       the next session.
 357     </para>
 358    </section>
 359
 360    <section id="componentmodulestext">
 361     <title>TEXT Record Model and Filter Module</title>
 362     <para>
 363       Plain ASCII text filter. TODO: add information here.
 364     </para>
 365    </section>
 366
 367     <!--
 368    <section id="componentmodulessafari">
 369     <title>SAFARI Record Model and Filter Module</title>
 370     <para>
 371      SAFARI filter module TODO: add information here.
 372     </para>
 373    </section>
 374     -->
 375
 376    </section>
 377
 378   </section>
 379
 380
 381   <section id="architecture-workflow">
 382    <title>Indexing and Retrieval Workflow</title>
 383
 384   <para>
 385    Records pass through three different states during processing in the
 386    system.
 387   </para>
 388
 389   <para>
 390
 391    <itemizedlist>
 392     <listitem>
 393
 394      <para>
 395       When records are accessed by the system, they are represented
 396       in their local, or native format. This might be &acro.sgml; or HTML files,
 397       News or Mail archives, &acro.marc; records. If the system doesn't already
 398       know how to read the type of data you need to store, you can set up an
 399       input filter by preparing conversion rules based on regular
 400       expressions and possibly augmented by a flexible scripting language
 401       (Tcl).
 402       The input filter produces as output an internal representation,
 403       a tree structure.
 404
 405      </para>
 406     </listitem>
 407     <listitem>
 408
 409      <para>
 410       When records are processed by the system, they are represented
 411       in a tree-structure, constructed by tagged data elements hanging off a
 412       root node. The tagged elements may contain data or yet more tagged
 413       elements in a recursive structure. The system performs various
 414       actions on this tree structure (indexing, element selection, schema
 415       mapping, etc.),
 416
 417      </para>
 418     </listitem>
 419     <listitem>
 420
 421      <para>
 422       Before transmitting records to the client, they are first
 423       converted from the internal structure to a form suitable for exchange
 424       over the network - according to the &acro.z3950; standard.
 425      </para>
 426     </listitem>
 427
 428    </itemizedlist>
 429
 430   </para>
 431   </section>
 432
 433   <section id="special-retrieval">
 434    <title>Retrieval of &zebra; internal record data</title>
 435    <para>
 436     Starting with <literal>&zebra;</literal> version 2.0.5 or newer, it is
 437     possible to use a special element set which has the prefix
 438     <literal>zebra::</literal>.
 439    </para>
 440    <para>
 441     Using this element will, regardless of record type, return
 442     &zebra;'s internal index structure/data for a record.
 443     In particular, the regular record filters are not invoked when
 444     these are in use.
 445     This can in some cases make the retrieval faster than regular
 446     retrieval operations (for &acro.marc;, &acro.xml; etc).
 447    </para>
 448    <table id="special-retrieval-types">
 449     <title>Special Retrieval Elements</title>
 450     <tgroup cols="2">
 451      <thead>
 452       <row>
 453        <entry>Element Set</entry>
 454        <entry>Description</entry>
 455        <entry>Syntax</entry>
 456       </row>
 457      </thead>
 458      <tbody>
 459       <row>
 460        <entry><literal>zebra::meta::sysno</literal></entry>
 461        <entry>Get &zebra; record system ID</entry>
 462        <entry>&acro.xml; and &acro.sutrs;</entry>
 463       </row>
 464       <row>
 465        <entry><literal>zebra::data</literal></entry>
 466        <entry>Get raw record</entry>
 467        <entry>all</entry>
 468       </row>
 469       <row>
 470        <entry><literal>zebra::meta</literal></entry>
 471        <entry>Get &zebra; record internal metadata</entry>
 472        <entry>&acro.xml; and &acro.sutrs;</entry>
 473       </row>
 474       <row>
 475        <entry><literal>zebra::index</literal></entry>
 476        <entry>Get all indexed keys for record</entry>
 477        <entry>&acro.xml; and &acro.sutrs;</entry>
 478       </row>
 479       <row>
 480        <entry>
 481         <literal>zebra::index::</literal><replaceable>f</replaceable>
 482        </entry>
 483        <entry>
 484         Get indexed keys for field <replaceable>f</replaceable> for record
 485        </entry>
 486        <entry>&acro.xml; and &acro.sutrs;</entry>
 487       </row>
 488       <row>
 489        <entry>
 490         <literal>zebra::index::</literal><replaceable>f</replaceable>:<replaceable>t</replaceable>
 491        </entry>
 492        <entry>
 493         Get indexed keys for field <replaceable>f</replaceable>
 494           and type <replaceable>t</replaceable> for record
 495        </entry>
 496        <entry>&acro.xml; and &acro.sutrs;</entry>
 497       </row>
 498       <row>
 499        <entry>
 500         <literal>zebra::snippet</literal>
 501        </entry>
 502        <entry>
 503         Get snippet for record for one or more indexes (f1,f2,..).
 504         This includes a phrase from the original
 505         record at the point where a match occurs (for a query). By default
 506         give terms before - and after are included in the snippet. The
 507         matching terms are enclosed within element
 508         <literal>&lt;s&gt;</literal>. The snippet facility requires
 509         Zebra 2.0.16 or later.
 510        </entry>
 511        <entry>&acro.xml; and &acro.sutrs;</entry>
 512       </row>
 513       <row>
 514        <entry>
 515         <literal>zebra::facet::</literal><replaceable>f1</replaceable>:<replaceable>t1</replaceable>,<replaceable>f2</replaceable>:<replaceable>t2</replaceable>,..
 516        </entry>
 517        <entry>
 518         Get facet of a result set. The facet result is returned
 519         as if it was a normal record, while in reality is a
 520         recap of most "important" terms in a result set for the fields
 521         given.
 522         The facet facility first appeared in Zebra 2.0.20.
 523        </entry>
 524        <entry>&acro.xml;</entry>
 525       </row>
 526      </tbody>
 527     </tgroup>
 528    </table>
 529    <para>
 530     For example, to fetch the raw binary record data stored in the
 531     zebra internal storage, or on the filesystem, the following
 532     commands can be issued:
 533     <screen>
 534       Z> f @attr 1=title my
 535       Z> format xml
 536       Z> elements zebra::data
 537       Z> s 1+1
 538       Z> format sutrs
 539       Z> s 1+1
 540       Z> format usmarc
 541       Z> s 1+1
 542     </screen>
 543     </para>
 544    <para>
 545     The special
 546     <literal>zebra::data</literal> element set name is
 547     defined for any record syntax, but will always fetch
 548     the raw record data in exactly the original form. No record syntax
 549     specific transformations will be applied to the raw record data.
 550    </para>
 551    <para>
 552     Also, &zebra; internal metadata about the record can be accessed:
 553     <screen>
 554       Z> f @attr 1=title my
 555       Z> format xml
 556       Z> elements zebra::meta::sysno
 557       Z> s 1+1
 558     </screen>
 559     displays in <literal>&acro.xml;</literal> record syntax only internal
 560     record system number, whereas
 561     <screen>
 562       Z> f @attr 1=title my
 563       Z> format xml
 564       Z> elements zebra::meta
 565       Z> s 1+1
 566     </screen>
 567     displays all available metadata on the record. These include system
 568     number, database name,  indexed filename,  filter used for indexing,
 569     score and static ranking information and finally bytesize of record.
 570    </para>
 571    <para>
 572     Sometimes, it is very hard to figure out what exactly has been
 573     indexed how and in which indexes. Using the indexing stylesheet of
 574     the Alvis filter, one can at least see which portion of the record
 575     went into which index, but a similar aid does not exist for all
 576     other indexing filters.
 577    </para>
 578    <para>
 579     The special
 580     <literal>zebra::index</literal> element set names are provided to
 581     access information on per record indexed fields. For example, the
 582     queries
 583     <screen>
 584       Z> f @attr 1=title my
 585       Z> format sutrs
 586       Z> elements zebra::index
 587       Z> s 1+1
 588     </screen>
 589     will display all indexed tokens from all indexed fields of the
 590     first record, and it will display in <literal>&acro.sutrs;</literal>
 591     record syntax, whereas
 592     <screen>
 593       Z> f @attr 1=title my
 594       Z> format xml
 595       Z> elements zebra::index::title
 596       Z> s 1+1
 597       Z> elements zebra::index::title:p
 598       Z> s 1+1
 599     </screen>
 600     displays in <literal>&acro.xml;</literal> record syntax only the content
 601       of the zebra string index <literal>title</literal>, or
 602       even only the type <literal>p</literal> phrase indexed part of it.
 603    </para>
 604    <note>
 605     <para>
 606      Trying to access numeric <literal>&acro.bib1;</literal> use
 607      attributes or trying to access non-existent zebra intern string
 608      access points will result in a Diagnostic 25: Specified element set
 609      'name not valid for specified database.
 610     </para>
 611    </note>
 612   </section>
 613
 614  </chapter>
 615
 616  <!-- Keep this comment at the end of the file
 617  Local variables:
 618  mode: sgml
 619  sgml-omittag:t
 620  sgml-shorttag:t
 621  sgml-minimize-attributes:nil
 622  sgml-always-quote-attributes:t
 623  sgml-indent-step:1
 624  sgml-indent-data:t
 625  sgml-parent-document: "idzebra.xml"
 626  sgml-local-catalogs: nil
 627  sgml-namecase-general:t
 628  End:
 629  -->