1 <chapter id="introduction">
2 <!-- $Id: introduction.xml,v 1.42 2007-02-02 09:58:39 marc Exp $ -->
3 <title>Introduction</title>
5 <section id="overview">
6 <title>Overview</title>
9 &zebra; is a free, fast, friendly information management system. It can
10 index records in XML/SGML, MARC, e-mail archives and many other
11 formats, and quickly find them using a combination of boolean
12 searching and relevance ranking. Search-and-retrieve applications can
13 be written using APIs in a wide variety of languages, communicating
14 with the &zebra; server using industry-standard information-retrieval
15 protocols or web services.
18 &zebra; is licensed Open Source, and can be
19 deployed by anyone for any purpose without license fees. The C source
20 code is open to anybody to read and change under the GPL license.
23 &zebra; is a networked component which acts as a reliable &z3950; server
24 for both record/document search, presentation, insert, update and
25 delete operations. In addition, it understands the &sru; family of
26 webservices, which exist in REST GET/POST and truly SOAP flavors.
29 &zebra; is available as MS Windows 2003 Server (32 bit) self-extracting
30 package as well as GNU/Debian Linux (32 bit and 64 bit) precompiled
31 packages. It has been deployed successfully on other Unix systems,
32 including Sun Sparc, HP Unix, and many variants of Linux and BSD
36 <ulink url="http://www.indexdata.com/zebra/">http://www.indexdata.com/zebra/</ulink>
37 <ulink url="http://ftp.indexdata.dk/pub/zebra/win32/">http://ftp.indexdata.dk/pub/zebra/win32/</ulink>
38 <ulink url="http://ftp.indexdata.dk/pub/zebra/debian/">http://ftp.indexdata.dk/pub/zebra/debian/</ulink>
42 <ulink url="http://indexdata.dk/zebra/">&zebra;</ulink>
43 is a high-performance, general-purpose structured text
44 indexing and retrieval engine. It reads records in a
45 variety of input formats (eg. email, XML, MARC) and provides access
46 to them through a powerful combination of boolean search
47 expressions and relevance-ranked free-text queries.
51 &zebra; supports large databases (tens of millions of records,
52 tens of gigabytes of data). It allows safe, incremental
53 database updates on live systems. Because &zebra; supports
54 the industry-standard information retrieval protocol, Z39.50,
55 you can search &zebra; databases using an enormous variety of
56 programs and toolkits, both commercial and free, which understand
57 this protocol. Application libraries are available to allow
58 bespoke clients to be written in Perl, C, C++, Java, Tcl, Visual
59 Basic, Python, PHP and more - see the
60 <ulink url="&url.zoom;">ZOOM web site</ulink>
61 for more information on some of these client toolkits.
65 This document is an introduction to the &zebra; system. It explains
66 how to compile the software, how to prepare your first database,
67 and how to configure the server to give you the
68 functionality that you need.
72 <section id="features">
73 <title>&zebra; Features Overview</title>
76 <table id="table-features-overview" frame="top">
77 <title>&zebra; Features Overview</title>
81 <entry>Feature</entry>
82 <entry>Availability</entry>
84 <entry>Reference</entry>
89 <entry>Boolean query language</entry>
90 <entry>CQL and RPN/PQF</entry>
91 <entry>The type-1 Reverse Polish Notation (RPN)
92 and it's textual representation Prefix Query Format (PQF) are
93 supported. The Common Query Language (CQL) can be configured as
94 a mapping from CQL to RPN/PQF</entry>
95 <entry><xref linkend="querymodel-query-languages-pqf"/>
96 <xref linkend="querymodel-cql-to-pqf"/></entry>
99 <entry>Operation types</entry>
100 <entry> Z39.50/SRU explain, search, and scan</entry>
102 <entry><xref linkend="querymodel-operation-types"/></entry>
105 <entry>Recursive boolean query tree</entry>
106 <entry>CQL and RPN/PQF</entry>
107 <entry>Both CQL and RPN/PQF allow atomic query parts (APT) to
108 be combined into complex boolean query trees</entry>
109 <entry><xref linkend="querymodel-rpn-tree"/></entry>
112 <entry>Large databases</entry>
113 <entry>64 file pointers assure that register files can extend
114 the 2 GB limit. Logical files can be
115 automatically partitioned over multiple disks, thus allowing for
116 large databases.</entry>
118 <entry><xref linkend=""/></entry>
121 <entry>Complex semi-structured Documents</entry>
122 <entry>XML and GRS-1 Documents</entry>
123 <entry>Both XML and GRS-1 documents exhibit a DOM like internal
124 representation allowing for complex indexing and display rules</entry>
125 <entry><xref linkend=""/></entry>
128 <entry>Database updates</entry>
129 <entry>live, incremental updates</entry>
130 <entry>Robust updating - records can be added and deleted ``on the fly''
131 without rebuilding the index from scratch.
132 Records can be safely updated even while users are accessing
134 The update procedure is tolerant to crashes or hard interrupts
135 during database updating - data can be reconstructed following
137 <entry><xref linkend=""/></entry>
140 <entry>Input document formats</entry>
141 <entry>XML, SGML, Text, ISO2709 (MARC)</entry>
143 A system of input filters driven by
144 regular expressions allows most ASCII-based
145 data formats to be easily processed.
146 SGML, XML, ISO2709 (MARC), and raw text are also
148 <entry><xref linkend=""/></entry>
151 <entry>Relevance ranking</entry>
152 <entry>TF-IDF like</entry>
153 <entry>Relevance-ranking of free-text queries is supported
154 using a TF-IDF like algorithm.</entry>
155 <entry><xref linkend=""/></entry>
158 <entry>Document storage</entry>
159 <entry>Index-only, Key storage, Document storage</entry>
160 <entry>Data can be, and usually is, imported
161 into &zebra;'s own storage, but &zebra; can also refer to
162 external files, building and maintaining indexes of "live"
164 <entry><xref linkend=""/></entry>
167 <entry>Regular expression matching</entry>
168 <entry>Regexp </entry>
169 <entry>Full regular expression matching and "approximate
170 matching" (eg. spelling mistake corrections) are handled.</entry>
171 <entry><xref linkend=""/></entry>
174 <entry>Search truncation</entry>
177 <entry><xref linkend=""/></entry>
180 <entry>Remote update</entry>
181 <entry>Z39.50 extended services</entry>
183 <entry><xref linkend=""/></entry>
186 <entry>Supported Platforms</entry>
187 <entry>UNIX, Linux, Windows (NT/2000/2003/XP)</entry>
188 <entry>&zebra; is written in portable C, so it runs on most
189 Unix-like systems as well as Windows (NT/2000/2003/XP). Binary
191 available for GNU/Debian Linux and Windows</entry>
192 <entry><xref linkend=""/></entry>
195 <entry>Z39.50</entry>
196 <entry>Z39.50 protocol support</entry>
197 <entry> Protocol facilities: Init, Search, Present (retrieval),
198 Segmentation (support for very large records), Delete, Scan
199 (index browsing), Sort, Close and support for the ``update''
200 Extended Service to add or replace an existing XML
201 record. Piggy-backed presents are honored in the search
202 request. Named result sets are supported.</entry>
203 <entry><xref linkend=""/></entry>
206 <entry>Record Syntaxes</entry>
208 <entry> Multiple record syntaxes
209 for data retrieval: GRS-1, SUTRS,
210 XML, ISO2709 (MARC), etc. Records can be mapped between record syntaxes
211 and schemas on the fly.</entry>
212 <entry><xref linkend=""/></entry>
215 <entry>Web Service support</entry>
216 <entry>SRU GET/POST/SOAP</entry>
217 <entry> The protocol operations <literal>explain</literal>,
218 <literal>searchRetrieve</literal> and <literal>scan</literal>
219 are supported. <ulink url="&url.cql;">CQL</ulink> to internal
220 query model RPN conversion is supported. Extended RPN queries
221 for search/retrieve and scan are supported.</entry>
222 <entry><xref linkend=""/></entry>
228 <entry><xref linkend=""/></entry>
234 <entry><xref linkend=""/></entry>
240 <entry><xref linkend=""/></entry>
246 <entry><xref linkend=""/></entry>
252 <entry><xref linkend=""/></entry>
258 <entry><xref linkend=""/></entry>
264 <entry><xref linkend=""/></entry>
270 <entry><xref linkend=""/></entry>
276 <entry><xref linkend=""/></entry>
286 <section id="introduction-apps">
287 <title>References and &zebra; based Applications</title>
289 &zebra; has been deployed in numerous applications, in both the
290 academic and commercial worlds, in application domains as diverse
291 as bibliographic catalogues, geospatial information, structured
292 vocabulary browsing, government information locators, civic
293 information systems, environmental observations, museum information
297 Notable applications include the following:
301 <section id="koha-ils">
302 <title>Koha free open-source ILS</title>
304 <ulink url="http://www.koha.org/">Koha</ulink> is a full-featured
305 open-source ILS, initially developed in
306 New Zealand by Katipo Communications Ltd, and first deployed in
307 January of 2000 for Horowhenua Library Trust. It is currently
308 maintained by a team of software providers and library technology
309 staff from around the globe.
312 <ulink url="http://liblime.com/">LibLime</ulink>,
313 a company that is marketing and supporting Koha, adds in
314 the new release of Koha 3.0 the &zebra;
315 database server to drive its bibliographic database.
318 In early 2005, the Koha project development team began looking at
319 ways to improve MARC support and overcome scalability limitations
320 in the Koha 2.x series. After extensive evaluations of the best
321 of the Open Source textual database engines - including MySQL
322 full-text searching, PostgreSQL, Lucene and Plucene - the team
326 "&zebra; completely eliminates scalability limitations, because it
327 can support tens of millions of records." explained Joshua
328 Ferraro, LibLime's Technology President and Koha's Project
329 Release Manager. "Our performance tests showed search results in
330 under a second for databases with over 5 million records on a
331 modest i386 900Mhz test server."
334 "&zebra; also includes support for true boolean search expressions
335 and relevance-ranked free-text queries, both of which the Koha
336 2.x series lack. &zebra; also supports incremental and safe
337 database updates, which allow on-the-fly record
338 management. Finally, since &zebra; has at its heart the Z39.50
339 protocol, it greatly improves Koha's support for that critical
343 Although the bibliographic database will be moved to &zebra;, Koha
344 3.0 will continue to use a relational SQL-based database design
345 for the 'factual' database. "Relational database managers have
346 their strengths, in spite of their inability to handle large
347 numbers of bibliographic records efficiently," summed up Ferraro,
348 "We're taking the best from both worlds in our redesigned Koha
352 See also LibLime's newsletter article
353 <ulink url="http://www.liblime.com/newsletter/2006/01/features/koha-earns-its-stripes/">
354 Koha Earns its Stripes</ulink>.
358 <section id="emilda-ils">
359 <title>Emilda open source ILS</title>
361 <ulink url="http://www.emilda.org/">Emilda</ulink>
362 is a complete Integrated Library System, released under the
363 GNU General Public License. It has a
364 full featured Web-OPAC, allowing comprehensive system management
365 from virtually any computer with an Internet connection, has
366 template based layout allowing anyone to alter the visual
367 appearance of Emilda, and is
368 XML based language for fast and easy portability to virtually any
370 Currently, Emilda is used at three schools in Espoo, Finland.
373 As a surplus, 100% MARC compatibility has been achieved using the
374 &zebra; Server from Index Data as backend server.
378 <section id="reindex-ils">
379 <title>ReIndex.Net web based ILS</title>
381 <ulink url="http://www.reindex.net/index.php?lang=en">Reindex.net</ulink>
382 is a netbased library service offering all
383 traditional functions on a very high level plus many new
384 services. Reindex.net is a comprehensive and powerful WEB system
385 based on standards such as XML and Z39.50.
386 updates. Reindex supports MARC21, danMARC eller Dublin Core with
390 Reindex.net runs on GNU/Debian Linux with &zebra; and Simpleserver
392 Data for bibliographic data. The relational database system
393 Sybase 9 XML is used for
395 Internally MARCXML is used for bibliographical records. Update
396 utilizes Z39.50 extended services.
400 <section id="dads-article-database">
401 <title>DADS - the DTV Article Database
404 DADS is a huge database of more than ten million records, totalling
405 over ten gigabytes of data. The records are metadata about academic
406 journal articles, primarily scientific; about 10% of these
407 metadata records link to the full text of the articles they
408 describe, a body of about a terabyte of information (although the
409 full text is not indexed.)
412 It allows students and researchers at DTU (Danmarks Tekniske
413 Universitet, the Technical College of Denmark) to find and order
414 articles from multiple databases in a single query. The database
415 contains literature on all engineering subjects. It's available
416 on-line through a web gateway, though currently only to registered
420 More information can be found at
421 <ulink url="http://www.dtv.dk/"/> and
422 <ulink url="http://dads.dtv.dk"/>
426 <section id="infonet-eprints">
427 <title>Infonet Eprints</title>
429 The InfoNet Eprints service from the
430 <ulink url="http://www.dtv.dk/">
431 Technical Knowledge Center of Denmark</ulink>
432 provides access to documents stored in
433 eprint/preprint servers and institutional research archives around
434 the world. The service is based on Open Archives Initiative metadata
435 harvesting of selected scientific archives around the world. These
436 open archives offer free and unrestricted access to their contents.
439 Infonet Eprints currently holds 1.4 million records from 16 archives.
440 The online search facility is found at
441 <ulink url="http://preprints.cvt.dk"/>.
445 <section id="alvis-project">
448 The <ulink url="http://www.alvis.info/alvis/">Alvis</ulink> EU
449 project run under the 6th Framework (IST-1-002068-STP)
450 is building a semantic-based peer-to-peer search engine. A
451 consortium of eleven partners from six different European
452 Community countries plus Switzerland and China contribute
453 with expertise in a broad range of specialties including network
454 topologies, routing algorithms, linguistic analysis and
458 The &zebra; information retrieval indexing machine is used inside
459 the Alvis framework to
460 manage huge collections of natural language processed and
461 enhanced XML data, coming from a topic relevant web crawl.
462 In this application, &zebra; swallows and manages 37GB of XML data
463 in about 4 hours, resulting in search times of fractions of
470 <title>ULS (Union List of Serials)</title>
473 has created a union catalogue for the periodicals of the
474 twenty-one constituent libraries of the University of London and
475 the University of Westminster
476 (<ulink url="http://www.m25lib.ac.uk/ULS/"/>).
477 They have achieved this using an
478 unusual architecture, which they describe as a
479 ``non-distributed virtual union catalogue''.
482 The member libraries send in data files representing their
483 periodicals, including both brief bibliographic data and summary
484 holdings. Then 21 individual Z39.50 targets are created, each
485 using &zebra;, and all mounted on the single hardware server.
486 The live service provides a web gateway allowing Z39.50 searching
487 of all of the targets or a selection of them. &zebra;'s small
488 footprint allows a relatively modest system to comfortably host
492 More information can be found at
493 <ulink url="http://www.m25lib.ac.uk/ULS/"/>
498 <title>NLI-Z39.50 - a Natural Language Interface for Libraries</title>
500 Fernuniversität Hagen in Germany have developed a natural
501 language interface for access to library databases.
503 url="http://ki212.fernuni-hagen.de/nli/NLIintro.html"/> -->
504 In order to evaluate this interface for recall and precision, they
505 chose &zebra; as the basis for retrieval effectiveness. The &zebra;
506 server contains a copy of the GIRT database, consisting of more
507 than 76000 records in SGML format (bibliographic records from
508 social science), which are mapped to MARC for presentation.
511 (GIRT is the German Indexing and Retrieval Testdatabase. It is a
512 standard German-language test database for intelligent indexing
513 and retrieval systems. See
514 <ulink url="http://www.gesis.org/forschung/informationstechnologie/clef-delos.htm"/>)
517 Evaluation will take place as part of the TREC/CLEF campaign 2003
518 <ulink url="http://clef.iei.pi.cnr.it"/>.
519 <!-- or <ulink url="http://www4.eurospider.ch/CLEF/"/> -->
522 For more information, contact Johannes Leveling
523 <email>Johannes.Leveling@FernUni-Hagen.De</email>
527 <section id="various-web-indexes">
528 <title>Various web indexes</title>
530 &zebra; has been used by a variety of institutions to construct
531 indexes of large web sites, typically in the region of tens of
532 millions of pages. In this role, it functions somewhat similarly
533 to the engine of google or altavista, but for a selected intranet
534 or a subset of the whole Web.
537 For example, Liverpool University's web-search facility (see on
539 <ulink url="http://www.liv.ac.uk/"/>
540 and many sub-pages) works by relevance-searching a &zebra; database
541 which is populated by the Harvest-NG web-crawling software.
544 For more information on Liverpool university's intranet search
545 architecture, contact John Gilbertson
546 <email>jgilbert@liverpool.ac.uk</email>
550 has recently modified the Harvest web indexer to use &zebra; as
551 its native repository engine. His comments on the switch over
552 from the old engine are revealing:
555 The first results after some testing with &zebra; are very
556 promising. The tests were done with around 220,000 SOIF files,
557 which occupies 1.6GB of disk space.
560 Building the index from scratch takes around one hour with &zebra;
561 where [old-engine] needs around five hours. While [old-engine]
562 blocks search requests when updating its index, &zebra; can still
563 answer search requests.
565 &zebra; supports incremental indexing which will speed up indexing
569 While the search time of [old-engine] varies from some seconds
570 to some minutes depending how expensive the query is, &zebra;
571 usually takes around one to three seconds, even for expensive
574 &zebra; can search more than 100 times faster than [old-engine]
575 and can process multiple search requests simultaneously
578 I am very happy to see such nice software available under GPL.
586 <section id="introduction-support">
587 <title>Support</title>
589 You can get support for &zebra; from at least three sources.
592 First, there's the &zebra; web site at
593 <ulink url="&url.idzebra;"/>,
594 which always has the most recent version available for download.
595 If you have a problem with &zebra;, the first thing to do is see
596 whether it's fixed in the current release.
599 Second, there's the &zebra; mailing list. Its home page at
600 <ulink url="&url.idzebra.mailinglist;"/>
601 includes a complete archive of all messages that have ever been
602 posted on the list. The &zebra; mailing list is used both for
603 announcements from the authors (new
604 releases, bug fixes, etc.) and general discussion. You are welcome
605 to seek support there. Join by filling the form on the list home page.
608 Third, it's possible to buy a commercial support contract, with
609 well defined service levels and response times, from Index Data.
611 <ulink url="&url.indexdata.support;"/>
617 <section id="future">
618 <title>Future Directions</title>
621 These are some of the plans that we have for the software in the near
622 and far future, ordered approximately as we expect to work on them.
630 Improved support for XML in search and retrieval. Eventually,
631 the goal is for &zebra; to pull double duty as a flexible
632 information retrieval engine and high-performance XML
633 repository. The recent addition of XPath searching is one
634 example of the kind of enhancement we're working on.
637 There is also the experimental <literal>ALVIS XSLT</literal>
638 XML input filter, which unleashes the full power of DOM based
639 XSLT transformations during indexing and record retrieval. Work
640 on this filter has been sponsored by the ALVIS EU project
641 <ulink url="http://www.alvis.info/alvis/"/>. We expect this filter to
642 mature soon, as it is planned to be included in the version 2.0
649 Finalisation and documentation of &zebra;'s C programming
650 API, allowing updates, database management and other functions
651 not readily expressed in Z39.50. We will also consider
652 exposing the API through SOAP.
658 Improved free-text searching. We're first and foremost octet jockeys and
659 we're actively looking for organisations or people who'd like
660 to contribute experience in relevance ranking and text
669 Programmers thrive on user feedback. If you are interested in a
670 facility that you don't see mentioned here, or if there's something
671 you think we could do better, please drop us a mail. Better still,
672 implement it and send us the patches.
675 If you think it's all really neat, you're welcome to drop us a line
676 saying that, too. You can email us on
677 <email>info@indexdata.dk</email>
678 or check the contact info at the end of this manual.
683 <!-- Keep this comment at the end of the file
688 sgml-minimize-attributes:nil
689 sgml-always-quote-attributes:t
692 sgml-parent-document: "zebra.xml"
693 sgml-local-catalogs: nil
694 sgml-namecase-general:t