1 <chapter id="introduction">
2 <!-- $Id: introduction.xml,v 1.41 2007-02-01 20:49:05 marc Exp $ -->
3 <title>Introduction</title>
5 <section id="overview">
6 <title>Overview</title>
9 <ulink url="http://indexdata.dk/zebra/">Zebra</ulink>
10 is a high-performance, general-purpose structured text
11 indexing and retrieval engine. It reads records in a
12 variety of input formats (eg. email, XML, MARC) and provides access
13 to them through a powerful combination of boolean search
14 expressions and relevance-ranked free-text queries.
18 Zebra supports large databases (tens of millions of records,
19 tens of gigabytes of data). It allows safe, incremental
20 database updates on live systems. Because Zebra supports
21 the industry-standard information retrieval protocol, Z39.50,
22 you can search Zebra databases using an enormous variety of
23 programs and toolkits, both commercial and free, which understand
24 this protocol. Application libraries are available to allow
25 bespoke clients to be written in Perl, C, C++, Java, Tcl, Visual
26 Basic, Python, PHP and more - see the
27 <ulink url="&url.zoom;">ZOOM web site</ulink>
28 for more information on some of these client toolkits.
32 This document is an introduction to the Zebra system. It explains
33 how to compile the software, how to prepare your first database,
34 and how to configure the server to give you the
35 functionality that you need.
39 <section id="features">
40 <title>Zebra Features Overview</title>
43 <table id="table-features-overview" frame="top">
44 <title>Zebra Features Overview</title>
48 <entry>Feature</entry>
49 <entry>Availability</entry>
51 <entry>Reference</entry>
56 <entry>Boolean query language</entry>
57 <entry>CQL and RPN/PQF</entry>
58 <entry>The type-1 Reverse Polish Notation (RPN)
59 and it's textual representation Prefix Query Format (PQF) are
60 supported. The Common Query Language (CQL) can be configured as
61 a mapping from CQL to RPN/PQF</entry>
62 <entry><xref linkend="querymodel-query-languages-pqf"/>
63 <xref linkend="querymodel-cql-to-pqf"/></entry>
66 <entry>Operation types</entry>
67 <entry> Z39.50/SRU explain, search, and scan</entry>
69 <entry><xref linkend="querymodel-operation-types"/></entry>
72 <entry>Recursive boolean query tree</entry>
73 <entry>CQL and RPN/PQF</entry>
74 <entry>Both CQL and RPN/PQF allow atomic query parts (APT) to
75 be combined into complex boolean query trees</entry>
76 <entry><xref linkend="querymodel-rpn-tree"/></entry>
79 <entry>Large databases</entry>
80 <entry>64 file pointers assure that register files can extend
81 the 2 GB limit. Logical files can be
82 automatically partitioned over multiple disks, thus allowing for
83 large databases.</entry>
85 <entry><xref linkend=""/></entry>
88 <entry>Complex semi-structured Documents</entry>
89 <entry>XML and GRS-1 Documents</entry>
90 <entry>Both XML and GRS-1 documents exhibit a DOM like internal
91 representation allowing for complex indexing and display rules</entry>
92 <entry><xref linkend=""/></entry>
95 <entry>Database updates</entry>
96 <entry>live, incremental updates</entry>
97 <entry>Robust updating - records can be added and deleted ``on the fly''
98 without rebuilding the index from scratch.
99 Records can be safely updated even while users are accessing
101 The update procedure is tolerant to crashes or hard interrupts
102 during database updating - data can be reconstructed following
104 <entry><xref linkend=""/></entry>
107 <entry>Input document formats</entry>
108 <entry>XML, SGML, Text, ISO2709 (MARC)</entry>
110 A system of input filters driven by
111 regular expressions allows most ASCII-based
112 data formats to be easily processed.
113 SGML, XML, ISO2709 (MARC), and raw text are also
115 <entry><xref linkend=""/></entry>
118 <entry>Relevance ranking</entry>
119 <entry>TF-IDF like</entry>
120 <entry>Relevance-ranking of free-text queries is supported
121 using a TF-IDF like algorithm.</entry>
122 <entry><xref linkend=""/></entry>
125 <entry>Document storage</entry>
126 <entry>Index-only, Key storage, Document storage</entry>
127 <entry>Data can be, and usually is, imported
128 into Zebra's own storage, but Zebra can also refer to
129 external files, building and maintaining indexes of "live"
131 <entry><xref linkend=""/></entry>
134 <entry>Regular expression matching</entry>
135 <entry>Regexp </entry>
136 <entry>Full regular expression matching and "approximate
137 matching" (eg. spelling mistake corrections) are handled.</entry>
138 <entry><xref linkend=""/></entry>
141 <entry>Search truncation</entry>
144 <entry><xref linkend=""/></entry>
147 <entry>Remote update</entry>
148 <entry>Z39.50 extended services</entry>
150 <entry><xref linkend=""/></entry>
153 <entry>Supported Platforms</entry>
154 <entry>UNIX, Linux, Windows (NT/2000/2003/XP)</entry>
155 <entry>Zebra is written in portable C, so it runs on most
156 Unix-like systems as well as Windows (NT/2000/2003/XP). Binary
158 available for GNU/Debian Linux and Windows</entry>
159 <entry><xref linkend=""/></entry>
162 <entry>Z39.50</entry>
163 <entry>Z39.50 protocol support</entry>
164 <entry> Protocol facilities: Init, Search, Present (retrieval),
165 Segmentation (support for very large records), Delete, Scan
166 (index browsing), Sort, Close and support for the ``update''
167 Extended Service to add or replace an existing XML
168 record. Piggy-backed presents are honored in the search
169 request. Named result sets are supported.</entry>
170 <entry><xref linkend=""/></entry>
173 <entry>Record Syntaxes</entry>
175 <entry> Multiple record syntaxes
176 for data retrieval: GRS-1, SUTRS,
177 XML, ISO2709 (MARC), etc. Records can be mapped between record syntaxes
178 and schemas on the fly.</entry>
179 <entry><xref linkend=""/></entry>
182 <entry>Web Service support</entry>
183 <entry>SRU GET/POST/SOAP</entry>
184 <entry> The protocol operations <literal>explain</literal>,
185 <literal>searchRetrieve</literal> and <literal>scan</literal>
186 are supported. <ulink url="&url.cql;">CQL</ulink> to internal
187 query model RPN conversion is supported. Extended RPN queries
188 for search/retrieve and scan are supported.</entry>
189 <entry><xref linkend=""/></entry>
195 <entry><xref linkend=""/></entry>
201 <entry><xref linkend=""/></entry>
207 <entry><xref linkend=""/></entry>
213 <entry><xref linkend=""/></entry>
219 <entry><xref linkend=""/></entry>
225 <entry><xref linkend=""/></entry>
231 <entry><xref linkend=""/></entry>
237 <entry><xref linkend=""/></entry>
243 <entry><xref linkend=""/></entry>
253 <section id="introduction-apps">
254 <title>References and Zebra based Applications</title>
256 Zebra has been deployed in numerous applications, in both the
257 academic and commercial worlds, in application domains as diverse
258 as bibliographic catalogues, geospatial information, structured
259 vocabulary browsing, government information locators, civic
260 information systems, environmental observations, museum information
264 Notable applications include the following:
268 <section id="koha-ils">
269 <title>Koha free open-source ILS</title>
271 <ulink url="http://www.koha.org/">Koha</ulink> is a full-featured
272 open-source ILS, initially developed in
273 New Zealand by Katipo Communications Ltd, and first deployed in
274 January of 2000 for Horowhenua Library Trust. It is currently
275 maintained by a team of software providers and library technology
276 staff from around the globe.
279 <ulink url="http://liblime.com/">LibLime</ulink>,
280 a company that is marketing and supporting Koha, adds in
281 the new release of Koha 3.0 the Zebra
282 database server to drive its bibliographic database.
285 In early 2005, the Koha project development team began looking at
286 ways to improve MARC support and overcome scalability limitations
287 in the Koha 2.x series. After extensive evaluations of the best
288 of the Open Source textual database engines - including MySQL
289 full-text searching, PostgreSQL, Lucene and Plucene - the team
293 "Zebra completely eliminates scalability limitations, because it
294 can support tens of millions of records." explained Joshua
295 Ferraro, LibLime's Technology President and Koha's Project
296 Release Manager. "Our performance tests showed search results in
297 under a second for databases with over 5 million records on a
298 modest i386 900Mhz test server."
301 "Zebra also includes support for true boolean search expressions
302 and relevance-ranked free-text queries, both of which the Koha
303 2.x series lack. Zebra also supports incremental and safe
304 database updates, which allow on-the-fly record
305 management. Finally, since Zebra has at its heart the Z39.50
306 protocol, it greatly improves Koha's support for that critical
310 Although the bibliographic database will be moved to Zebra, Koha
311 3.0 will continue to use a relational SQL-based database design
312 for the 'factual' database. "Relational database managers have
313 their strengths, in spite of their inability to handle large
314 numbers of bibliographic records efficiently," summed up Ferraro,
315 "We're taking the best from both worlds in our redesigned Koha
319 See also LibLime's newsletter article
320 <ulink url="http://www.liblime.com/newsletter/2006/01/features/koha-earns-its-stripes/">
321 Koha Earns its Stripes</ulink>.
325 <section id="emilda-ils">
326 <title>Emilda open source ILS</title>
328 <ulink url="http://www.emilda.org/">Emilda</ulink>
329 is a complete Integrated Library System, released under the
330 GNU General Public License. It has a
331 full featured Web-OPAC, allowing comprehensive system management
332 from virtually any computer with an Internet connection, has
333 template based layout allowing anyone to alter the visual
334 appearance of Emilda, and is
335 XML based language for fast and easy portability to virtually any
337 Currently, Emilda is used at three schools in Espoo, Finland.
340 As a surplus, 100% MARC compatibility has been achieved using the
341 Zebra Server from Index Data as backend server.
345 <section id="reindex-ils">
346 <title>ReIndex.Net web based ILS</title>
348 <ulink url="http://www.reindex.net/index.php?lang=en">Reindex.net</ulink>
349 is a netbased library service offering all
350 traditional functions on a very high level plus many new
351 services. Reindex.net is a comprehensive and powerful WEB system
352 based on standards such as XML and Z39.50.
353 updates. Reindex supports MARC21, danMARC eller Dublin Core with
357 Reindex.net runs on GNU/Debian Linux with Zebra and Simpleserver
359 Data for bibliographic data. The relational database system
360 Sybase 9 XML is used for
362 Internally MARCXML is used for bibliographical records. Update
363 utilizes Z39.50 extended services.
367 <section id="dads-article-database">
368 <title>DADS - the DTV Article Database
371 DADS is a huge database of more than ten million records, totalling
372 over ten gigabytes of data. The records are metadata about academic
373 journal articles, primarily scientific; about 10% of these
374 metadata records link to the full text of the articles they
375 describe, a body of about a terabyte of information (although the
376 full text is not indexed.)
379 It allows students and researchers at DTU (Danmarks Tekniske
380 Universitet, the Technical College of Denmark) to find and order
381 articles from multiple databases in a single query. The database
382 contains literature on all engineering subjects. It's available
383 on-line through a web gateway, though currently only to registered
387 More information can be found at
388 <ulink url="http://www.dtv.dk/"/> and
389 <ulink url="http://dads.dtv.dk"/>
393 <section id="infonet-eprints">
394 <title>Infonet Eprints</title>
396 The InfoNet Eprints service from the
397 <ulink url="http://www.dtv.dk/">
398 Technical Knowledge Center of Denmark</ulink>
399 provides access to documents stored in
400 eprint/preprint servers and institutional research archives around
401 the world. The service is based on Open Archives Initiative metadata
402 harvesting of selected scientific archives around the world. These
403 open archives offer free and unrestricted access to their contents.
406 Infonet Eprints currently holds 1.4 million records from 16 archives.
407 The online search facility is found at
408 <ulink url="http://preprints.cvt.dk"/>.
412 <section id="alvis-project">
415 The <ulink url="http://www.alvis.info/alvis/">Alvis</ulink> EU
416 project run under the 6th Framework (IST-1-002068-STP)
417 is building a semantic-based peer-to-peer search engine. A
418 consortium of eleven partners from six different European
419 Community countries plus Switzerland and China contribute
420 with expertise in a broad range of specialties including network
421 topologies, routing algorithms, linguistic analysis and
425 The Zebra information retrieval indexing machine is used inside
426 the Alvis framework to
427 manage huge collections of natural language processed and
428 enhanced XML data, coming from a topic relevant web crawl.
429 In this application, Zebra swallows and manages 37GB of XML data
430 in about 4 hours, resulting in search times of fractions of
437 <title>ULS (Union List of Serials)</title>
440 has created a union catalogue for the periodicals of the
441 twenty-one constituent libraries of the University of London and
442 the University of Westminster
443 (<ulink url="http://www.m25lib.ac.uk/ULS/"/>).
444 They have achieved this using an
445 unusual architecture, which they describe as a
446 ``non-distributed virtual union catalogue''.
449 The member libraries send in data files representing their
450 periodicals, including both brief bibliographic data and summary
451 holdings. Then 21 individual Z39.50 targets are created, each
452 using Zebra, and all mounted on the single hardware server.
453 The live service provides a web gateway allowing Z39.50 searching
454 of all of the targets or a selection of them. Zebra's small
455 footprint allows a relatively modest system to comfortably host
459 More information can be found at
460 <ulink url="http://www.m25lib.ac.uk/ULS/"/>
465 <title>NLI-Z39.50 - a Natural Language Interface for Libraries</title>
467 Fernuniversität Hagen in Germany have developed a natural
468 language interface for access to library databases.
470 url="http://ki212.fernuni-hagen.de/nli/NLIintro.html"/> -->
471 In order to evaluate this interface for recall and precision, they
472 chose Zebra as the basis for retrieval effectiveness. The Zebra
473 server contains a copy of the GIRT database, consisting of more
474 than 76000 records in SGML format (bibliographic records from
475 social science), which are mapped to MARC for presentation.
478 (GIRT is the German Indexing and Retrieval Testdatabase. It is a
479 standard German-language test database for intelligent indexing
480 and retrieval systems. See
481 <ulink url="http://www.gesis.org/forschung/informationstechnologie/clef-delos.htm"/>)
484 Evaluation will take place as part of the TREC/CLEF campaign 2003
485 <ulink url="http://clef.iei.pi.cnr.it"/>.
486 <!-- or <ulink url="http://www4.eurospider.ch/CLEF/"/> -->
489 For more information, contact Johannes Leveling
490 <email>Johannes.Leveling@FernUni-Hagen.De</email>
494 <section id="various-web-indexes">
495 <title>Various web indexes</title>
497 Zebra has been used by a variety of institutions to construct
498 indexes of large web sites, typically in the region of tens of
499 millions of pages. In this role, it functions somewhat similarly
500 to the engine of google or altavista, but for a selected intranet
501 or a subset of the whole Web.
504 For example, Liverpool University's web-search facility (see on
506 <ulink url="http://www.liv.ac.uk/"/>
507 and many sub-pages) works by relevance-searching a Zebra database
508 which is populated by the Harvest-NG web-crawling software.
511 For more information on Liverpool university's intranet search
512 architecture, contact John Gilbertson
513 <email>jgilbert@liverpool.ac.uk</email>
517 has recently modified the Harvest web indexer to use Zebra as
518 its native repository engine. His comments on the switch over
519 from the old engine are revealing:
522 The first results after some testing with Zebra are very
523 promising. The tests were done with around 220,000 SOIF files,
524 which occupies 1.6GB of disk space.
527 Building the index from scratch takes around one hour with Zebra
528 where [old-engine] needs around five hours. While [old-engine]
529 blocks search requests when updating its index, Zebra can still
530 answer search requests.
532 Zebra supports incremental indexing which will speed up indexing
536 While the search time of [old-engine] varies from some seconds
537 to some minutes depending how expensive the query is, Zebra
538 usually takes around one to three seconds, even for expensive
541 Zebra can search more than 100 times faster than [old-engine]
542 and can process multiple search requests simultaneously
545 I am very happy to see such nice software available under GPL.
553 <section id="introduction-support">
554 <title>Support</title>
556 You can get support for Zebra from at least three sources.
559 First, there's the Zebra web site at
560 <ulink url="&url.idzebra;"/>,
561 which always has the most recent version available for download.
562 If you have a problem with Zebra, the first thing to do is see
563 whether it's fixed in the current release.
566 Second, there's the Zebra mailing list. Its home page at
567 <ulink url="&url.idzebra.mailinglist;"/>
568 includes a complete archive of all messages that have ever been
569 posted on the list. The Zebra mailing list is used both for
570 announcements from the authors (new
571 releases, bug fixes, etc.) and general discussion. You are welcome
572 to seek support there. Join by filling the form on the list home page.
575 Third, it's possible to buy a commercial support contract, with
576 well defined service levels and response times, from Index Data.
578 <ulink url="&url.indexdata.support;"/>
584 <section id="future">
585 <title>Future Directions</title>
588 These are some of the plans that we have for the software in the near
589 and far future, ordered approximately as we expect to work on them.
597 Improved support for XML in search and retrieval. Eventually,
598 the goal is for Zebra to pull double duty as a flexible
599 information retrieval engine and high-performance XML
600 repository. The recent addition of XPath searching is one
601 example of the kind of enhancement we're working on.
604 There is also the experimental <literal>ALVIS XSLT</literal>
605 XML input filter, which unleashes the full power of DOM based
606 XSLT transformations during indexing and record retrieval. Work
607 on this filter has been sponsored by the ALVIS EU project
608 <ulink url="http://www.alvis.info/alvis/"/>. We expect this filter to
609 mature soon, as it is planned to be included in the version 2.0
616 Finalisation and documentation of Zebra's C programming
617 API, allowing updates, database management and other functions
618 not readily expressed in Z39.50. We will also consider
619 exposing the API through SOAP.
625 Improved free-text searching. We're first and foremost octet jockeys and
626 we're actively looking for organisations or people who'd like
627 to contribute experience in relevance ranking and text
636 Programmers thrive on user feedback. If you are interested in a
637 facility that you don't see mentioned here, or if there's something
638 you think we could do better, please drop us a mail. Better still,
639 implement it and send us the patches.
642 If you think it's all really neat, you're welcome to drop us a line
643 saying that, too. You can email us on
644 <email>info@indexdata.dk</email>
645 or check the contact info at the end of this manual.
650 <!-- Keep this comment at the end of the file
655 sgml-minimize-attributes:nil
656 sgml-always-quote-attributes:t
659 sgml-parent-document: "zebra.xml"
660 sgml-local-catalogs: nil
661 sgml-namecase-general:t