doc/architecture.xml

   1  <chapter id="architecture">
   2   <!-- $Id: architecture.xml,v 1.2 2006-01-19 09:27:00 marc Exp $ -->
   3   <title>Overview of Zebra Architecture</title>
   4
   5
   6   <sect1 id="local-representation">
   7    <title>Local Representation</title>
   8
   9    <para>
  10     As mentioned earlier, Zebra places few restrictions on the type of
  11     data that you can index and manage. Generally, whatever the form of
  12     the data, it is parsed by an input filter specific to that format, and
  13     turned into an internal structure that Zebra knows how to handle. This
  14     process takes place whenever the record is accessed - for indexing and
  15     retrieval.
  16    </para>
  17
  18    <para>
  19     The RecordType parameter in the <literal>zebra.cfg</literal> file, or
  20     the <literal>-t</literal> option to the indexer tells Zebra how to
  21     process input records.
  22     Two basic types of processing are available - raw text and structured
  23     data. Raw text is just that, and it is selected by providing the
  24     argument <emphasis>text</emphasis> to Zebra. Structured records are
  25     all handled internally using the basic mechanisms described in the
  26     subsequent sections.
  27     Zebra can read structured records in many different formats.
  28     <!--
  29     How this is done is governed by additional parameters after the
  30     "grs" keyword, separated by "." characters.
  31     -->
  32    </para>
  33   </sect1>
  34
  35   <sect1 id="workflow">
  36    <title>Indexing and Retrieval Workflow</title>
  37
  38   <para>
  39    Records pass through three different states during processing in the
  40    system.
  41   </para>
  42
  43   <para>
  44
  45    <itemizedlist>
  46     <listitem>
  47
  48      <para>
  49       When records are accessed by the system, they are represented
  50       in their local, or native format. This might be SGML or HTML files,
  51       News or Mail archives, MARC records. If the system doesn't already
  52       know how to read the type of data you need to store, you can set up an
  53       input filter by preparing conversion rules based on regular
  54       expressions and possibly augmented by a flexible scripting language
  55       (Tcl).
  56       The input filter produces as output an internal representation,
  57       a tree structure.
  58
  59      </para>
  60     </listitem>
  61     <listitem>
  62
  63      <para>
  64       When records are processed by the system, they are represented
  65       in a tree-structure, constructed by tagged data elements hanging off a
  66       root node. The tagged elements may contain data or yet more tagged
  67       elements in a recursive structure. The system performs various
  68       actions on this tree structure (indexing, element selection, schema
  69       mapping, etc.),
  70
  71      </para>
  72     </listitem>
  73     <listitem>
  74
  75      <para>
  76       Before transmitting records to the client, they are first
  77       converted from the internal structure to a form suitable for exchange
  78       over the network - according to the Z39.50 standard.
  79      </para>
  80     </listitem>
  81
  82    </itemizedlist>
  83
  84   </para>
  85   </sect1>
  86
  87
  88   <sect1 id="maincomponents">
  89    <title>Main Components</title>
  90    <para>
  91     The Zebra system is designed to support a wide range of data management
  92     applications. The system can be configured to handle virtually any
  93     kind of structured data. Each record in the system is associated with
  94     a <emphasis>record schema</emphasis> which lends context to the data
  95     elements of the record.
  96     Any number of record schemas can coexist in the system.
  97     Although it may be wise to use only a single schema within
  98     one database, the system poses no such restrictions.
  99    </para>
 100    <para>
 101     The Zebra indexer and information retrieval server consists of the
 102     following main applications: the <literal>zebraidx</literal>
 103     indexing maintenance utility, and the <literal>zebrasrv</literal>
 104     information query and retireval server. Both are using some of the
 105     same main components, which are presented here.
 106    </para>
 107    <para>
 108     This virtual package installs all the necessary packages to start
 109     working with Zebra - including utility programs, development libraries,
 110     documentation and modules.
 111      <literal>idzebra1.4</literal>
 112   </para>
 113
 114    <sect2 id="componentcore">
 115     <title>Core Zebra Module Containing Common Functionality</title>
 116     <para>
 117      - loads external filter modules used for presenting
 118      the recods in a search response.
 119      - executes search requests in PQF/RPN, which are handed over from
 120      the YAZ server frontend API
 121      - calls resorting/reranking algorithms on the hit sets
 122      - returns - possibly ranked - result sets, hit
 123      numbers, and the like internal data to the YAZ server backend API.
 124     </para>
 125     <para>
 126      This package contains all run-time libraries for Zebra.
 127      <literal>libidzebra1.4</literal>
 128      This package includes documentation for Zebra in PDF and HTML.
 129      <literal>idzebra1.4-doc</literal>
 130      This package includes common essential Zebra configuration files
 131      <literal>idzebra1.4-common</literal>
 132     </para>
 133    </sect2>
 134
 135
 136    <sect2 id="componentindexer">
 137     <title>Zebra Indexer</title>
 138     <para>
 139      the core Zebra indexer which
 140      - loads external filter modules used for indexing data records of
 141      different type.
 142      - creates, updates and drops databases and indexes
 143     </para>
 144     <para>
 145      This package contains Zebra utilities such as the zebraidx indexer
 146      utility and the zebrasrv server.
 147      <literal>idzebra1.4-utils</literal>
 148     </para>
 149    </sect2>
 150
 151    <sect2 id="componentsearcher">
 152     <title>Zebra Searcher/Retriever</title>
 153     <para>
 154      the core Zebra searcher/retriever which
 155     </para>
 156     <para>
 157      This package contains Zebra utilities such as the zebraidx indexer
 158      utility and the zebrasrv server, and their associated man pages.
 159      <literal>idzebra1.4-utils</literal>
 160     </para>
 161    </sect2>
 162
 163    <sect2 id="componentyazserver">
 164     <title>YAZ Server Frontend</title>
 165     <para>
 166      The YAZ server frontend is
 167      a full fledged stateful Z39.50 server taking client
 168      connections, and forwarding search and scan requests to the
 169      Zebra core indexer.
 170     </para>
 171     <para>
 172      In addition to Z39.50 requests, the YAZ server frontend acts
 173      as HTTP server, honouring
 174       <ulink url="http://www.loc.gov/standards/sru/srw/">SRW</ulink> SOAP requests, and  <ulink url="http://www.loc.gov/standards/sru/">SRU</ulink> REST requests. Moreover, it can
 175      translate inco ming <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink> queries to PQF/RPN queries, if
 176      correctly configured.
 177     </para>
 178     <para>
 179     YAZ is a toolkit that allows you to develop software using the
 180     ANSI Z39.50/ISO23950 standard for information retrieval.
 181      <ulink url="http://www.loc.gov/standards/sru/srw/">SRW</ulink>/ <ulink url="http://www.loc.gov/standards/sru/">SRU</ulink>
 182     <literal>libyazthread.so</literal>
 183     <literal>libyaz.so</literal>
 184     <literal>libyaz</literal>
 185     </para>
 186    </sect2>
 187
 188    <sect2 id="componentmodules">
 189     <title>Record Models and Filter Modules</title>
 190     <para>
 191       all filter modules which do indexing and record display filtering:
 192 This virtual package contains all base IDZebra filter modules. EMPTY ???
 193      <literal>libidzebra1.4-modules</literal>
 194     </para>
 195
 196    <sect3 id="componentmodulestext">
 197     <title>TEXT Record Model and Filter Module</title>
 198     <para>
 199       Plain ASCII text filter
 200      <!--
 201      <literal>text module missing as deb file<literal>
 202      -->
 203     </para>
 204    </sect3>
 205
 206    <sect3 id="componentmodulesgrs">
 207     <title>GRS Record Model and Filter Modules</title>
 208     <para>
 209     <xref linkend="grs-record-model"/>
 210
 211      - grs.danbib     GRS filters of various kind (*.abs files)
 212 IDZebra filter grs.danbib (DBC DanBib records)
 213   This package includes grs.danbib filter which parses DanBib records.
 214   DanBib is the Danish Union Catalogue hosted by DBC
 215   (Danish Bibliographic Centre).
 216      <literal>libidzebra1.4-mod-grs-danbib</literal>
 217
 218
 219      - grs.marc
 220      - grs.marcxml
 221   This package includes the grs.marc and grs.marcxml filters that allows
 222   IDZebra to read MARC records based on ISO2709.
 223
 224      <literal>libidzebra1.4-mod-grs-marc</literal>
 225
 226      - grs.regx
 227      - grs.tcl        GRS TCL scriptable filter
 228   This package includes the grs.regx and grs.tcl filters.
 229      <literal>libidzebra1.4-mod-grs-regx</literal>
 230
 231
 232      - grs.sgml
 233      <literal>libidzebra1.4-mod-grs-sgml not packaged yet ??</literal>
 234
 235      - grs.xml
 236   This package includes the grs.xml filter which uses <ulink url="http://expat.sourceforge.net/">Expat</ulink> to
 237   parse records in XML and turn them into IDZebra's internal grs node.
 238      <literal>libidzebra1.4-mod-grs-xml</literal>
 239     </para>
 240    </sect3>
 241
 242    <sect3 id="componentmodulesalvis">
 243     <title>ALVIS Record Model and Filter Module</title>
 244      <para>
 245       - alvis          Experimental Alvis XSLT filter
 246       <literal>mod-alvis.so</literal>
 247       <literal>libidzebra1.4-mod-alvis</literal>
 248      </para>
 249     </sect3>
 250
 251    <sect3 id="componentmodulessafari">
 252     <title>SAFARI Record Model and Filter Module</title>
 253     <para>
 254      - safari
 255      <!--
 256      <literal>safari module missing as deb file<literal>
 257      -->
 258     </para>
 259    </sect3>
 260
 261    </sect2>
 262
 263    <!--
 264    <sect2 id="componentconfig">
 265     <title>Configuration Files</title>
 266     <para>
 267      - yazserver XML based config file
 268      - core Zebra ascii based config files
 269      - filter module config files in many flavours
 270      - <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink> to PQF ascii based config file
 271     </para>
 272    </sect2>
 273    -->
 274   </sect1>
 275
 276   <!--
 277
 278
 279   <sect1 id="cqltopqf">
 280    <title>Server Side <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink> To PQF Conversion</title>
 281    <para>
 282   The cql2pqf.txt yaz-client config file, which is also used in the
 283   yaz-server <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>-to-PQF process, is used to to drive
 284   org.z3950.zing.cql.<ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>Node's toPQF() back-end and the YAZ <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>-to-PQF
 285   converter.  This specifies the interpretation of various <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>
 286   indexes, relations, etc. in terms of Type-1 query attributes.
 287
 288   This configuration file generates queries using BIB-1 attributes.
 289   See http://www.loc.gov/z3950/agency/zing/cql/dc-indexes.html
 290   for the Maintenance Agency's work-in-progress mapping of Dublin Core
 291   indexes to Attribute Architecture (util, XD and BIB-2)
 292   attributes.
 293
 294   a) <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink> set prefixes  are specified using the correct <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>/ <ulink url="http://www.loc.gov/standards/sru/srw/">SRW</ulink>/U
 295   prefixes for the required index sets, or user-invented prefixes for
 296   special index sets. An index set in <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink> is roughly speaking equivalent to a
 297   namespace specifier in XML.
 298
 299   b) The default index set to be used if none explicitely mentioned
 300
 301   c) Index mapping definitions of the form
 302
 303       index.cql.all  = 1=text
 304
 305   which means that the index "all" from the set "cql" is mapped on the
 306   bib-1 RPN query "@attr 1=text" (where "text" is some existing index
 307   in zebra, see indexing stylesheet)
 308
 309   d) Relation mapping from <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink> relations to bib-1 RPN "@attr 2= " stuff
 310
 311   e) Relation modifier mapping from <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink> relations to bib-1 RPN "@attr
 312   2= " stuff
 313
 314   f) Position attributes
 315
 316   g) structure attributes
 317
 318   h) truncation attributes
 319
 320   See
 321   http://www.indexdata.com/yaz/doc/tools.tkl#tools.cql.map for config
 322   file details.
 323
 324
 325    </para>
 326   </sect1>
 327
 328
 329   <sect1 id="ranking">
 330    <title>Static and Dynamic Ranking</title>
 331    <para>
 332       Zebra uses internally inverted indexes to look up term occurencies
 333   in documents. Multiple queries from different indexes can be
 334   combined by the binary boolean operations AND, OR and/or NOT (which
 335   is in fact a binary AND NOT operation). To ensure fast query execution
 336   speed, all indexes have to be sorted in the same order.
 337
 338   The indexes are normally sorted according to document ID in
 339   ascending order, and any query which does not invoke a special
 340   re-ranking function will therefore retrieve the result set in document ID
 341   order.
 342
 343   If one defines the
 344
 345     staticrank: 1
 346
 347   directive in the main core Zebra config file, the internal document
 348   keys used for ordering are augmented by a preceeding integer, which
 349   contains the static rank of a given document, and the index lists
 350   are ordered
 351     - first by ascending static rank
 352     - then by ascending document ID.
 353
 354   This implies that the default rank "0" is the best rank at the
 355   beginning of the list, and "max int" is the worst static rank.
 356
 357   The "alvis" and the experimental "xslt" filters are providing a
 358   directive to fetch static rank information out of the indexed XML
 359   records, thus making _all_ hit sets orderd after ascending static
 360   rank, and for those doc's which have the same static rank, ordered
 361   after ascending doc ID.
 362   If one wants to do a little fiddeling with the static rank order,
 363   one has to invoke additional re-ranking/re-ordering using dynamic
 364   reranking or score functions. These functions return positive
 365   interger scores, where _highest_ score is best, which means that the
 366   hit sets will be sorted according to _decending_ scores (in contrary
 367   to the index lists which are sorted according to _ascending_ rank
 368   number and document ID)
 369
 370
 371   Those are defined in the zebra C source files
 372
 373    "rank-1" : zebra/index/rank1.c
 374               default TF/IDF like zebra dynamic ranking
 375    "rank-static" : zebra/index/rankstatic.c
 376               do-nothing dummy static ranking (this is just to prove
 377               that the static rank can be used in dynamic ranking functions)
 378    "zvrank" : zebra/index/zvrank.c
 379               many different dynamic TF/IDF ranking functions
 380
 381    The are in the zebra config file enabled by a directive like:
 382
 383    rank: rank-static
 384
 385    Notice that the "rank-1" and "zvrank" do not use the static rank
 386    information in the list keys, and will produce the same ordering
 387    with our without static ranking enabled.
 388
 389    The dummy "rank-static" reranking/scoring function returns just
 390      score = max int - staticrank
 391    in order to preserve the ordering of hit sets with and without it's
 392    call.
 393
 394    Obviously, one wants to make a new ranking function, which combines
 395    static and dynamic ranking, which is left as an exercise for the
 396    reader .. (Wray, this is your's ...)
 397
 398
 399    </para>
 400
 401
 402    <para>
 403     yazserver frontend config file
 404
 405   db/yazserver.xml
 406
 407   Setup of listening ports, and virtual zebra servers.
 408   Note path to server-side <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>-to-PQF config file, and to
 409    <ulink url="http://www.loc.gov/standards/sru/srw/">SRW</ulink> explain config section.
 410
 411   The <directory> path is relative to the directory where zebra.init is placed
 412   and is started up. The other pathes are relative to <directory>,
 413   which in this case is the same.
 414
 415   see: http://www.indexdata.com/yaz/doc/server.vhosts.tkl
 416
 417    </para>
 418
 419    <para>
 420  Z39.50 searching:
 421
 422   search like this (using client-side <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>-to-PQF conversion):
 423
 424   yaz-client -q db/cql2pqf.txt localhost:9999
 425   > format xml
 426   > querytype cql2rpn
 427   > f text=(plant and soil)
 428   > s 1
 429   > elements dc
 430   > s 1
 431   > elements index
 432   > s 1
 433   > elements alvis
 434   > s 1
 435   > elements snippet
 436   > s 1
 437
 438
 439   search like this (using server-side <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>-to-PQF conversion):
 440   (the only difference is "querytype cql" instead of
 441    "querytype cql2rpn" and the call without specifying a local
 442   conversion file)
 443
 444   yaz-client localhost:9999
 445  > format xml
 446   > querytype cql
 447   > f text=(plant and soil)
 448   > s 1
 449   > elements dc
 450   > s 1
 451   > elements index
 452   > s 1
 453   > elements alvis
 454   > s 1
 455   > elements snippet
 456   > s 1
 457
 458   NEW: static relevance ranking - see examples in alvis2index.xsl
 459
 460   > f text = /relevant (plant and soil)
 461   > elem dc
 462   > s 1
 463
 464   > f title = /relevant a
 465   > elem dc
 466   > s 1
 467
 468
 469
 470  <ulink url="http://www.loc.gov/standards/sru/srw/">SRW</ulink>/U searching
 471  Surf into http://localhost:9999
 472
 473  firefox http://localhost:9999
 474
 475  gives you an explain record. Unfortunately, the data found in the
 476  <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>-to-PQF text file must be added by hand-craft into the explain
 477  section of the yazserver.xml file. Too bad, but this is all extreme
 478  new alpha stuff, and a lot of work has yet to be done ..
 479
 480  Searching via  <ulink url="http://www.loc.gov/standards/sru/">SRU</ulink>: surf into the URL (lines broken here - concat on
 481  URL line)
 482
 483  - see number of hits:
 484  http://localhost:9999/?version=1.1&operation=searchRetrieve
 485                        &query=text=(plant%20and%20soil)
 486
 487
 488  - fetch record 5-7 in DC format
 489  http://localhost:9999/?version=1.1&operation=searchRetrieve
 490                        &query=text=(plant%20and%20soil)
 491                        &startRecord=5&maximumRecords=2&recordSchema=dc
 492
 493
 494  - even search using PQF queries using the extended verb "x-pquery",
 495    which is special to YAZ/Zebra
 496
 497  http://localhost:9999/?version=1.1&operation=searchRetrieve
 498                        &x-pquery=@attr%201=text%20@and%20plant%20soil
 499
 500  More info: read the fine manuals at http://www.loc.gov/z3950/agency/zing/srw/
 501 278,280d299
 502  Search via  <ulink url="http://www.loc.gov/standards/sru/srw/">SRW</ulink>:
 503  read the fine manual at
 504  http://www.loc.gov/z3950/agency/zing/srw/
 505
 506
 507 and so on. The list of available indexes is found in db/cql2pqf.txt
 508
 509
 510 7) How do you add to the index attributes of any other type than "w"?
 511 I mean, in the context of making <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink> queries. Let's say I want a date
 512 attribute in there, so that one could do date > 20050101 in <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>.
 513
 514 Currently for example 'date-modified' is of type 'w'.
 515
 516 The 2-seconds-of-though solution:
 517
 518      in alvis2index.sl:
 519
 520   <z:index name="date-modified" type="d">
 521       <xsl:value-of
 522            select="acquisition/acquisitionData/modifiedDate"/>
 523     </z:index>
 524
 525 But here's the catch...doesn't the use of the 'd' type require
 526 structure type 'date' (@attr 4=5) in PQF? But then...how does that
 527 reflect in the <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>->RPN/PQF mapping - does it really work if I just
 528 change the type of an element in alvis2index.sl? I would think not...?
 529
 530
 531
 532
 533               Kimmo
 534
 535
 536 Either do:
 537
 538    f @attr 4=5 @attr 1=date-modified 20050713
 539
 540 or do
 541
 542
 543 Either do:
 544
 545    f @attr 4=5 @attr 1=date-modified 20050713
 546
 547 or do
 548
 549 querytype cql
 550
 551  f date-modified=20050713
 552
 553  f date-modified=20050713
 554
 555  Search ERROR 121 4 1+0 RPN: @attrset Bib-1 @attr 5=100 @attr 6=1 @attr 3=3 @att
 556 r 4=1 @attr 2=3 @attr "1=date-modified" 20050713
 557
 558
 559
 560  f date-modified eq 20050713
 561
 562 Search OK 23 3 1+0 RPN: @attrset Bib-1 @attr 5=100 @attr 6=1 @attr 3=3 @attr 4=5
 563  @attr 2=3 @attr "1=date-modified" 20050713
 564
 565
 566    </para>
 567
 568    <para>
 569 E) EXTENDED SERVICE LIFE UPDATES
 570
 571 The extended services are not enabled by default in zebra - due to the
 572 fact that they modify the system.
 573
 574 In order to allow anybody to update, use
 575 perm.anonymous: rw
 576 in zebra.cfg.
 577
 578 Or, even better, allow only updates for a particular admin user. For
 579 user 'admin', you could use:
 580 perm.admin: rw
 581 passwd: passwordfile
 582
 583 And in passwordfile, specify users and passwords ..
 584 admin:secret
 585
 586 We can now start a yaz-client admin session and create a database:
 587
 588 $ yaz-client localhost:9999 -u admin/secret
 589 Authentication set to Open (admin/secret)
 590 Connecting...OK.
 591 Sent initrequest.
 592 Connection accepted by v3 target.
 593 ID     : 81
 594 Name   : Zebra Information Server/GFS/YAZ
 595 Version: Zebra 1.4.0/1.63/2.1.9
 596 Options: search present delSet triggerResourceCtrl scan sort
 597 extendedServices namedResultSets
 598 Elapsed: 0.007046
 599 Z> adm-create
 600 Admin request
 601 Got extended services response
 602 Status: done
 603 Elapsed: 0.045009
 604 :
 605 Now Default was created..  We can now insert an XML file (esdd0006.grs
 606 from example/gils/records) and index it:
 607
 608 Z> update insert 1 esdd0006.grs
 609 Got extended services response
 610 Status: done
 611 Elapsed: 0.438016
 612
 613 The 3rd parameter.. 1 here .. is the opaque record id from Ext update.
 614 It a record ID that _we_ assign to the record in question. If we do not
 615 assign one the usual rules for match apply (recordId: from zebra.cfg).
 616
 617 Actually, we should have a way to specify "no opaque record id" for
 618 yaz-client's update command.. We'll fix that.
 619
 620 Elapsed: 0.438016
 621 Z> f utah
 622 Sent searchRequest.
 623 Received SearchResponse.
 624 Search was a success.
 625 Number of hits: 1, setno 1
 626 SearchResult-1: term=utah cnt=1
 627 records returned: 0
 628 Elapsed: 0.014179
 629
 630 Let's delete the beast:
 631 Z> update delete 1
 632 No last record (update ignored)
 633 Z> update delete 1 esdd0006.grs
 634 Got extended services response
 635 Status: done
 636 Elapsed: 0.072441
 637 Z> f utah
 638 Sent searchRequest.
 639 Received SearchResponse.
 640 Search was a success.
 641 Number of hits: 0, setno 2
 642 SearchResult-1: term=utah cnt=0
 643 records returned: 0
 644 Elapsed: 0.013610
 645
 646 If shadow register is enabled you must run the adm-commit command in
 647 order write your changes..
 648
 649    </para>
 650
 651
 652
 653   </sect1>
 654 -->
 655
 656  </chapter>
 657
 658  <!-- Keep this comment at the end of the file
 659  Local variables:
 660  mode: sgml
 661  sgml-omittag:t
 662  sgml-shorttag:t
 663  sgml-minimize-attributes:nil
 664  sgml-always-quote-attributes:t
 665  sgml-indent-step:1
 666  sgml-indent-data:t
 667  sgml-parent-document: "zebra.xml"
 668  sgml-local-catalogs: nil
 669  sgml-namecase-general:t
 670  End:
 671  -->