1 <chapter id="introduction">
2 <!-- $Id: introduction.xml,v 1.33 2006-06-13 13:45:08 marc Exp $ -->
3 <title>Introduction</title>
6 <title>Overview</title>
9 <ulink url="http://indexdata.dk/zebra/">Zebra</ulink>
10 is a high-performance, general-purpose structured text
11 indexing and retrieval engine. It reads records in a
12 variety of input formats (eg. email, XML, MARC) and provides access
13 to them through a powerful combination of boolean search
14 expressions and relevance-ranked free-text queries.
18 Zebra supports large databases (tens of millions of records,
19 tens of gigabytes of data). It allows safe, incremental
20 database updates on live systems. Because Zebra supports
21 the industry-standard information retrieval protocol, Z39.50,
22 you can search Zebra databases using an enormous variety of
23 programs and toolkits, both commercial and free, which understand
24 this protocol. Application libraries are available to allow
25 bespoke clients to be written in Perl, C, C++, Java, Tcl, Visual
26 Basic, Python, PHP and more - see
27 <ulink url="http://zoom.z3950.org/">the ZOOM web site</ulink>
28 for more information on some of these client toolkits.
32 This document is an introduction to the Zebra system. It explains
33 how to compile the software, how to prepare your first database,
34 and how to configure the server to give you the
35 functionality that you need.
40 <title>Features</title>
43 This is an overview of some of Zebra's most important features:
51 Very large databases: logical files can be
52 automatically partitioned over multiple disks.
58 Arbitrarily complex records. The internal data format
59 is a structured format conceptually similar to XML or GRS-1,
60 which allows lists, nested structured data elements and
61 variant forms of data.
67 Robust updating - records can be added and deleted ``on the fly''
68 without rebuilding the index from scratch.
69 Records can be safely updated even while users are accessing
71 The update procedure is tolerant to crashes or hard interrupts
72 during database updating - data can be reconstructed following
79 Configurable to understand many input formats.
80 A system of input filters driven by
81 regular expressions allows most ASCII-based
82 data formats to be easily processed.
83 SGML, XML, ISO2709 (MARC), and raw text are also
90 Searching supports a powerful combination of boolean queries as
91 well as relevance-ranking (free-text) queries. Truncation,
92 masking, full regular expression matching and "approximate
93 matching" (eg. spelling mistakes) are all handled.
99 Index-only databases: data can be, and usually is, imported
100 into Zebra's own storage, but Zebra can also refer to
101 external files, building and maintaining indexes of "live"
108 Zebra is written in portable C, so it runs on most Unix-like systems
109 as well as Windows NT. A binary distribution for Windows NT is
111 <ulink url="http://ftp.indexdata.dk/pub/zebra/win32/"/>,
112 and pre-built packages are available for
116 <ulink url="http://ftp.indexdata.dk/pub/zebra/RedHat7.X/"/>
117 and Debian packages at
119 <literal>GNU/Debian Linux</literal> at
120 <ulink url="http://ftp.indexdata.dk/pub/zebra/debian/"/>.
129 Z39.50 protocol support:
136 Protocol facilities: Init, Search, Present (retrieval),
137 Segmentation (support for very large records), Delete, Scan
138 (index browsing), Sort, Close and support for the ``update''
139 Extended Service to add or replace an existing XML record.
142 You can insert/delete/replace an XML record given an
143 "external" ID. Actually this way of doing ES Update was
144 meant for an OAI application that Ian Ibbotson had in
145 mind to implement. The "update" command in YAZ client
146 implements this on the client side. My plan is to make
147 this available in ZOOM "extended" soon..
154 Piggy-backed presents are honored in the search request - that
155 is, a subset of the found records can be returned directly with
156 a search response, enabling search and retrieval to happen in a
163 Named result sets are supported.
169 Easily configured to support different application profiles, with
170 tables for attribute sets, tag sets, and abstract syntaxes.
171 Additional tables control facilities such as element mappings to
172 different schema (eg., GILS-to-USMARC).
178 Complex composition specifications using Espec-1 (partial support).
179 Element sets are defined using the Espec-1 capability,
180 and are specified in configuration files as simple element
181 requests (and, optionally, variant requests).
187 Multiple record syntaxes
188 for data retrieval: GRS-1, SUTRS,
189 XML, ISO2709 (MARC), etc. Records can be mapped between record syntaxes
190 and schemas on the fly.
201 <title>Applications</title>
203 Zebra has been deployed in numerous applications, in both the
204 academic and commercial worlds, in application domains as diverse
205 as bibliographic catalogues, geospatial information, structured
206 vocabulary browsing, government information locators, civic
207 information systems, environmental observations, museum information
211 Notable applications include the following:
215 <title>DADS - the DTV Article Database Service</title>
217 DADS is a huge database of more than ten million records, totalling
218 over ten gigabytes of data. The records are metadata about academic
219 journal articles, primarily scientific; about 10% of these
220 metadata records link to the full text of the articles they
221 describe, a body of about a terabyte of information (although the
222 full text is not indexed.)
225 It allows students and researchers at DTU (Danmarks Tekniske
226 Universitet, the Technical College of Denmark) to find and order
227 articles from multiple databases in a single query. The database
228 contains literature on all engineering subjects. It's available
229 on-line through a web gateway, though currently only to registered
233 More information can be found at
234 <ulink url="http://www.dtv.dk/"/> and
235 <ulink url="http://dads.dtv.dk"/>
240 <title>Infonet Eprints</title>
242 The InfoNet Eprints service from the
243 <ulink url="http://www.dtv.dk/">
244 Technical Knowledge Center of Denmark</ulink>
245 provides access to documents stored in
246 eprint/preprint servers and institutional research archives around
247 the world. The service is based on Open Archives Initiative metadata
248 harvesting of selected scientific archives around the world. These
249 open archives offer free and unrestricted access to their contents.
252 Infonet Eprints currently holds 1.4 million records from 16 archives.
253 The online search facility is found at
254 <ulink url="http://preprints.cvt.dk"/>.
259 <title>NLI-Z39.50 - a Natural Language Interface for Libraries</title>
261 Fernuniversität Hagen in Germany have developed a natural
262 language interface for access to library databases.
264 url="http://ki212.fernuni-hagen.de/nli/NLIintro.html"/> -->
265 In order to evaluate this interface for recall and precision, they
266 chose Zebra as the basis for retrieval effectiveness. The Zebra
267 server contains a copy of the GIRT database, consisting of more
268 than 76000 records in SGML format (bibliographic records from
269 social science), which are mapped to MARC for presentation.
272 (GIRT is the German Indexing and Retrieval Testdatabase. It is a
273 standard German-language test database for intelligent indexing
274 and retrieval systems. See
275 <ulink url="http://www.gesis.org/forschung/informationstechnologie/clef-delos.htm"/>)
278 Evaluation will take place as part of the TREC/CLEF campaign 2003
279 <ulink url="http://clef.iei.pi.cnr.it"/>.
280 <!-- or <ulink url="http://www4.eurospider.ch/CLEF/"/> -->
283 For more information, contact Johannes Leveling
284 <email>Johannes.Leveling@FernUni-Hagen.De</email>
289 <title>ULS (Union List of Serials)</title>
292 has created a union catalogue for the periodicals of the
293 twenty-one constituent libraries of the University of London and
294 the University of Westminster
295 (<ulink url="http://www.m25lib.ac.uk/ULS/"/>).
296 They have achieved this using an
297 unusual architecture, which they describe as a
298 ``non-distributed virtual union catalogue''.
301 The member libraries send in data files representing their
302 periodicals, including both brief bibliographic data and summary
303 holdings. Then 21 individual Z39.50 targets are created, each
304 using Zebra, and all mounted on the single hardware server.
305 The live service provides a web gateway allowing Z39.50 searching
306 of all of the targets or a selection of them. Zebra's small
307 footprint allows a relatively modest system to comfortably host
311 More information can be found at
312 <ulink url="http://www.m25lib.ac.uk/ULS/"/>
317 <title>Various web indexes</title>
319 Zebra has been used by a variety of institutions to construct
320 indexes of large web sites, typically in the region of tens of
321 millions of pages. In this role, it functions somewhat similarly
322 to the engine of google or altavista, but for a selected intranet
323 or a subset of the whole Web.
326 For example, Liverpool University's web-search facility (see on
328 <ulink url="http://www.liv.ac.uk/"/>
329 and many sub-pages) works by relevance-searching a Zebra database
330 which is populated by the Harvest-NG web-crawling software.
333 For more information on Liverpool university's intranet search
334 architecture, contact John Gilbertson
335 <email>jgilbert@liverpool.ac.uk</email>
339 <email>lee@arco.de</email>,
340 has recently modified the Harvest web indexer to use Zebra as
341 its native repository engine. His comments on the switch over
342 from the old engine are revealing:
345 The first results after some testing with Zebra are very
346 promising. The tests were done with around 220,000 SOIF files,
347 which occupies 1.6GB of disk space.
350 Building the index from scratch takes around one hour with Zebra
351 where [old-engine] needs around five hours. While [old-engine]
352 blocks search requests when updating its index, Zebra can still
353 answer search requests.
355 Zebra supports incremental indexing which will speed up indexing
359 While the search time of [old-engine] varies from some seconds
360 to some minutes depending how expensive the query is, Zebra
361 usually takes around one to three seconds, even for expensive
364 Zebra can search more than 100 times faster than [old-engine]
365 and can process multiple search requests simultaneously
368 I am very happy to see such nice software available under GPL.
377 <title>Support</title>
379 You can get support for Zebra from at least three sources.
382 First, there's the Zebra web site at
383 <ulink url="http://indexdata.dk/zebra/"/>,
384 which always has the most recent version available for download.
385 If you have a problem with Zebra, the first thing to do is see
386 whether it's fixed in the current release.
389 Second, there's the Zebra mailing list. Its home page at
390 <ulink url="http://lists.indexdata.dk/cgi-bin/mailman/listinfo/zebralist"/>
391 includes a complete archive of all messages that have ever been
392 posted on the list. The Zebra mailing list is used both for
393 announcements from the authors (new
394 releases, bug fixes, etc.) and general discussion. You are welcome
395 to seek support there. Join by filling the form on the list home page.
398 Third, it's possible to buy a commercial support contract, with
399 well defined service levels and response times, from Index Data.
401 <ulink url="http://indexdata.dk/support/"/>
408 <title>Future Directions</title>
411 These are some of the plans that we have for the software in the near
412 and far future, ordered approximately as we expect to work on them.
420 Improved support for XML in search and retrieval. Eventually,
421 the goal is for Zebra to pull double duty as a flexible
422 information retrieval engine and high-performance XML
423 repository. The recent addition of XPath searching is one
424 example of the kind of enhancement we're working on.
427 There is also the experimental <literal>ALVIS XSLT</literal>
428 XML input filter, which unleashes the full power of DOM based
429 XSLT transformations during indexing and record retrieval. Work
430 on this filter has been sponsored by the ALVIS EU project
431 <ulink url="http://www.alvis.info/alvis/"/>. We expect this filter to
432 mature soon, as it is planned to be included in the version 1.4
439 Access to the search engine through SOAP/RPC API to allow the
440 construction of applications without requiring Z39.50 tools.
442 This will shortly be available by means of Index Data's
443 <ulink url="http://www.loc.gov/standards/sru/srw/">SRW</ulink>-to-Z39.50 gateway, currently in beta test.
445 Experimental support of the
446 Search/Retrieve Via URL ( <ulink url="&url.sru;">SRU</ulink>)
447 <ulink url="&url.sru;"/>
448 REST webservice, and the
449 Search/Retrieve Web Service ( <ulink url="http://www.loc.gov/standards/sru/srw/">SRW</ulink>)
450 <ulink url="http://www.loc.gov/standards/sru/srw/"/>
451 SOAP Web Service have recently been added to the YAZ/Zebra
452 combo - including server side Common Query Language (<ulink url="&url.cql;">CQL</ulink>)
453 <ulink url="&url.cql;"/> parsing
454 and configuration. It remains to find a sponsor for further testing,
455 documentation and packaging of this exiting component.
461 Finalisation and documentation of Zebra's C programming
462 API, allowing updates, database management and other functions
463 not readily expressed in Z39.50. We will also consider
464 exposing the API through SOAP.
470 Support for the use of Perl both for access to the Zebra API
471 and for building extension ``plug-ins'' such as input filters.
472 The code for this has been contributed to the source tree by
474 <email>pop@technomat.hu</email>,
475 and is in the process of being integrated and tested.
481 Improved free-text searching. We're first and foremost octet jockeys and
482 we're actively looking for organisations or people who'd like
483 to contribute experience in relevance ranking and text
492 Programmers thrive on user feedback. If you are interested in a
493 facility that you don't see mentioned here, or if there's something
494 you think we could do better, please drop us a mail. Better still,
495 implement it and send us the patches.
498 If you think it's all really neat, you're welcome to drop us a line
499 saying that, too. You can email us on
500 <email>info@indexdata.dk</email>
501 or check the contact info at the end of this manual.
506 <!-- Keep this comment at the end of the file
511 sgml-minimize-attributes:nil
512 sgml-always-quote-attributes:t
515 sgml-parent-document: "zebra.xml"
516 sgml-local-catalogs: nil
517 sgml-namecase-general:t