doc/introduction.xml

   1 <chapter id="introduction">
   2  <!-- $Id: introduction.xml,v 1.12 2002-08-30 01:17:10 mike Exp $ -->
   3  <title>Introduction</title>
   4
   5  <sect1>
   6   <title>Overview</title>
   7
   8   <para>
   9    <ulink url="http://www.indexdata.dk/zebra/">
  10      Zebra</ulink>
  11    is a high-performance, general-purpose structured text
  12    indexing and retrieval engine. It reads structured records in a
  13    variety of input formats (eg. email, XML, MARC) and provides access
  14    to them through a powerful combination of boolean search
  15    expressions and relevance-ranked free-text queries.
  16   </para>
  17
  18   <para>
  19    Zebra supports large databases (tens of millions of records,
  20    tens of gigabytes of data). It allows safe, incremental
  21    database updates on live systems. Because Zebra supports
  22    the industry-standard information retrieval protocol, Z39.50,
  23    you can search Zebra databases using an enormous variety of
  24    programs and toolkits, both commercial and free, which understand
  25    this protocol.  Application libraries are available to allow
  26    bespoke clients to be written in Perl, C, C++, Java, Tcl, Visual
  27    Basic, Python, PHP and more - see
  28    <ulink url="http://zoom.z3950.org/">the ZOOM web site</ulink>
  29    for more information on some of these client toolkits.
  30   </para>
  31
  32   <para>
  33    This document is an introduction to the Zebra system. It explains
  34    how to compile the software, how to prepare your first database,
  35    and how to configure the server to give you the
  36    functionality that you need.
  37   </para>
  38
  39   <para>
  40    If you use Zebra, you should visit its
  41    <ulink url="http://www.indexdata.dk/zebra/">web site</ulink>,
  42    where you can join the
  43    <ulink url="http://www.indexdata.dk/mailman/listinfo/zebralist">
  44    mailing-list</ulink>
  45    by sending email to
  46    <email>### zebra-subscribe@mailman.indexdata.dk</email>
  47   </para>
  48
  49  </sect1>
  50
  51  <sect1 id="features">
  52   <title>Features</title>
  53
  54   <para>
  55    This is an overview of some of Zebra's most important features:
  56   </para>
  57
  58   <para>
  59    <itemizedlist>
  60
  61     <listitem>
  62      <para>
  63       Very large databases: files for indexes, etc. can be
  64       automatically partitioned over multiple disks.
  65      </para>
  66     </listitem>
  67
  68     <listitem>
  69      <para>
  70       Arbitrarily complex records.  The internal data format
  71       is an structured format conceptually similar to XML or GRS-1,
  72       which allows nested structured data elements and
  73       variant forms of data.
  74      </para>
  75     </listitem>
  76
  77     <listitem>
  78      <para>
  79       Robust updating - records can be added and deleted ``on the fly''
  80       without rebuilding the index from scratch.
  81       Records can be safely updated even while users are accessing
  82       the server.
  83       The update procedure is tolerant to crashes or hard interrupts
  84       during database updating - data can be reconstructed following
  85       a crash.
  86      </para>
  87     </listitem>
  88
  89     <listitem>
  90      <para>
  91       Configurable to understand many input formats.
  92       A system of input filters driven by
  93       regular expressions allows you to easily process most ASCII-based
  94       data formats. SGML, XML, ISO2709 (MARC), and raw text are also
  95       supported.
  96      </para>
  97     </listitem>
  98
  99     <listitem>
 100      <para>
 101       Searching supports a powerful combination of boolean queries as
 102       well as relevance-ranking (free-text) queries.  Truncation,
 103       masking, full regular expression matching and "approximate
 104       matching" (eg. spelling mistakes) are all supported.
 105      </para>
 106     </listitem>
 107
 108     <listitem>
 109       <para>
 110         Index-only databases: data can be, and usually is, imported
 111         into Zebra's own storage, but Zebra can also refer to
 112         external files, building and maintaining indexes of "live"
 113         collections.
 114       </para>
 115     </listitem>
 116
 117     <listitem>
 118      <para>
 119       Zebra is written in portable C, so it runs on most Unix-like systems
 120       as well as Windows NT.  A binary distribution for Windows NT is
 121       available.
 122      </para>
 123     </listitem>
 124
 125    </itemizedlist>
 126
 127   </para>
 128
 129   <para>
 130    Z39.50 protocol support:
 131   </para>
 132
 133   <para>
 134    <itemizedlist>
 135     <listitem>
 136      <para>
 137       Protocol facilities: Init, Search, Present (retrieval), Delete,
 138       Scan (index browsing) and Sort.
 139      </para>
 140     </listitem>
 141
 142     <listitem>
 143      <para>
 144       Piggy-backed presents are honored in the search-request.
 145      </para>
 146     </listitem>
 147
 148     <listitem>
 149      <para>
 150       Named result sets are supported.
 151      </para>
 152     </listitem>
 153
 154     <listitem>
 155      <para>
 156       Easily configured to support different application profiles, with
 157       tables for attribute sets, tag sets, and abstract syntaxes.
 158       Additional tables control facilities such as element mappings to
 159       different schema (eg., GILS-to-USMARC).
 160      </para>
 161     </listitem>
 162
 163     <listitem>
 164      <para>
 165       Complex composition specifications using Espec-1 (partial support).
 166       Element sets are defined using the Espec-1 capability,
 167       and are specified in configuration files as simple element
 168       requests (and, optionally, variant requests).
 169      </para>
 170     </listitem>
 171
 172     <listitem>
 173      <para>
 174       Multiple record syntaxes
 175       for data retrieval: GRS-1, SUTRS,
 176       XML, ISO2709 (MARC), etc. Records can be mapped between record syntaxes
 177       and schemas on the fly.
 178      </para>
 179     </listitem>
 180
 181    </itemizedlist>
 182
 183   </para>
 184
 185  </sect1>
 186
 187  <sect1 id="apps">
 188   <title>Applications</title>
 189   <para>
 190    Zebra has been deployed in numerous applications, in both the
 191    academic and commercial worlds, in application domains as diverse
 192    as bibliographic catalogues, geospatial information, structured
 193    vocabulary browsing, government information locators, civic
 194    information systems, environmental observations, museum information
 195    and web indexes.
 196   </para>
 197   <para>
 198    Notable applications include the following:
 199   </para>
 200
 201   <sect2>
 202    <title>DADS - the DTV Article Database Service</title>
 203    <para>
 204     DADS is a huge database of more than ten million records, totalling
 205     over ten gigabytes of data.  The records are metadata about academic
 206     journal articles, primarily scientific; about 10% of these
 207     metadata records link to the full text of the articles they
 208     describe, a body of about a terabyte of information (although the
 209     full text is not indexed.)
 210    </para>
 211    <para>
 212     It allows students and researchers at DTU (Danmarks Tekniske
 213     Universitet, the Technical College of Denmark) to find and order
 214     articles from multiple databases in a single query.  The database
 215     contains literature on all engineering subjects.  It's available
 216     on-line through a web gateway, though currently only to registered
 217     users.
 218    </para>
 219    <para>
 220     More information can be found at
 221     <ulink url="http://www.dtv.dk/help/dads/index_e.htm"/>
 222    </para>
 223   </sect2>
 224
 225 <!--
 226 Envelope-to: zebra@miketaylor.org.uk
 227 From: Johannes Leveling <Johannes.Leveling@FernUni-Hagen.de>
 228 Content-Type: text/plain; charset=iso-8859-1
 229 Date: Thu, 29 Aug 2002 19:19:55 +0200
 230 To: zebra@miketaylor.org.uk
 231 Subject: [Zebralist] Looking for Deployment Stories
 232 In-Reply-To: <200208281002.LAA16526@seatbooker.net>
 233 X-Virus-Scanned: by AMaViS perl-11
 234 X-MIME-Autoconverted: from quoted-printable to 8bit by localhost.localdomain id g7TLWR905724
 235
 236 Mike Taylor writes:
 237  > People,
 238  >
 239  > In collaboration with Sebastian, Adam and Heikki, I am reworking some
 240  > parts of the Zebra documentation in preparation for the forthcoming
 241  > release.  One area I am keen to expand on is (briefly) describing
 242  > interesting applications of Zebra.  If you've deployed it in a way
 243  > that you consider interesting, I'd love to hear from you, however
 244  > briefly.  Think of this as a chance to get some free publicity for
 245  > your application in the Zebra documentation.
 246  >
 247  > Replies off-list to <zebra@miketaylor.org.uk>, please.
 248  >
 249  >  _/|_         _______________________________________________________________
 250  > /o ) \/  Mike Taylor   <mike@miketaylor.org.uk>   www.miketaylor.org.uk
 251  > )_v__/\  There are some good things you can never have too much of.
 252  >
 253  >
 254  > _______________________________________________
 255  > Zebralist mailing list
 256  > Zebralist@indexdata.dk
 257  > http://www.indexdata.dk/mailman/listinfo/zebralist
 258  >
 259 Intersting?
 260 We have developed a natural language interface (NLI-Z39.50) for access
 261 to library databases at the Fernuniversität Hagen, Germany
 262 (http://ki212.fernuni-hagen.de/nli/NLI.html).
 263 To prepare formal information retrieval evaluation,
 264 we chose the Zebra server as the basis for
 265 evaluating retrieval effectiveness (measuring recall
 266 and precision for the GIRT database). The Zebra database
 267 consists of more than 76000 records in SGML format (bibliographic
 268 records from social science), which are mapped to MARC for presentation.
 269 Evaluation will take place as part of the TREC/CLEF campaign 2003
 270 (see http://clef.iei.pi.cnr.it or http://www4.eurospider.ch/CLEF/).
 271
 272
 273 Johannes Leveling        Praktische Informatik VII/KI
 274                          FernUniversität Hagen
 275
 276 Email : Johannes.Leveling@FernUni-Hagen.De
 277 Tel.  : +49 2331 987-4525
 278
 279 -->
 280
 281   <sect2>
 282    <title>Various web indexes</title>
 283    <para>
 284     Zebra has been used by a variety of institutions to construct
 285     indexes of large web sites, typically in the region of tens of
 286     millions of pages.  In this role, it functions somewhat similarly
 287     to the engine of google or altavista, but for a selected intranet
 288     or subset of the whole Web.
 289    </para>
 290    <para>
 291     ### examples, details and numbers, please!
 292    </para>
 293   </sect2>
 294  </sect1>
 295
 296  <sect1 id="future">
 297   <title>Future Directions</title>
 298
 299   <para>
 300    These are some of the plans that we have for the software in the near
 301    and far future, ordered approximately as we expect to work on them.
 302   </para>
 303
 304   <para>
 305    <itemizedlist>
 306
 307     <listitem>
 308      <para>
 309        Improved support for XML in search and retrieval. Eventually,
 310        the goal is for Zebra to pull double duty as a flexible
 311        information retrieval engine and high-performance XML
 312        repository.
 313      </para>
 314     </listitem>
 315
 316     <listitem>
 317      <para>
 318        Access to search engine through SOAP/RPC API to allow the
 319        construction of applications without requiring Z39.50 tools.
 320      </para>
 321     </listitem>
 322
 323     <listitem>
 324      <para>
 325        Finalisation and documentation of Zebra's C programming
 326        API, allowing updates, database management and other functions
 327        not readily expressed in Z39.50.  We will also consider
 328        exposing the API through SOAP.
 329      </para>
 330     </listitem>
 331
 332     <listitem>
 333      <para>
 334        Improved free-text searching. We're first and foremost octet jockeys and
 335        we're actively looking for organisations or people who'd like
 336        to contribute experience in relevance ranking and text
 337        searching.
 338      </para>
 339     </listitem>
 340
 341    </itemizedlist>
 342   </para>
 343
 344   <para>
 345    Programmers thrive on user feedback. If you are interested in a
 346    facility that you don't see mentioned here, or if there's something
 347    you think we could do better, please drop us a mail.  Better still,
 348    implement it and send us the patches.
 349   </para>
 350   <para>
 351    If you think it's all really neat, you're welcome to drop us a line
 352    saying that, too. You'll find contact info at the end of this file.
 353   </para>
 354
 355  </sect1>
 356 </chapter>
 357  <!-- Keep this comment at the end of the file
 358  Local variables:
 359  mode: sgml
 360  sgml-omittag:t
 361  sgml-shorttag:t
 362  sgml-minimize-attributes:nil
 363  sgml-always-quote-attributes:t
 364  sgml-indent-step:1
 365  sgml-indent-data:t
 366  sgml-parent-document: "zebra.xml"
 367  sgml-local-catalogs: nil
 368  sgml-namecase-general:t
 369  End:
 370  -->