1 <chapter id="introduction">
2 <!-- $Id: introduction.xml,v 1.39 2006-09-03 21:37:26 adam Exp $ -->
3 <title>Introduction</title>
5 <section id="overview">
6 <title>Overview</title>
9 <ulink url="http://indexdata.dk/zebra/">Zebra</ulink>
10 is a high-performance, general-purpose structured text
11 indexing and retrieval engine. It reads records in a
12 variety of input formats (eg. email, XML, MARC) and provides access
13 to them through a powerful combination of boolean search
14 expressions and relevance-ranked free-text queries.
18 Zebra supports large databases (tens of millions of records,
19 tens of gigabytes of data). It allows safe, incremental
20 database updates on live systems. Because Zebra supports
21 the industry-standard information retrieval protocol, Z39.50,
22 you can search Zebra databases using an enormous variety of
23 programs and toolkits, both commercial and free, which understand
24 this protocol. Application libraries are available to allow
25 bespoke clients to be written in Perl, C, C++, Java, Tcl, Visual
26 Basic, Python, PHP and more - see
27 <ulink url="http://zoom.z3950.org/">the ZOOM web site</ulink>
28 for more information on some of these client toolkits.
32 This document is an introduction to the Zebra system. It explains
33 how to compile the software, how to prepare your first database,
34 and how to configure the server to give you the
35 functionality that you need.
39 <section id="features">
40 <title>Features</title>
43 This is an overview of some of Zebra's most important features:
51 Very large databases: logical files can be
52 automatically partitioned over multiple disks.
58 Arbitrarily complex records. The internal data format
59 is a structured format conceptually similar to XML or GRS-1,
60 which allows lists, nested structured data elements and
61 variant forms of data.
67 Robust updating - records can be added and deleted ``on the fly''
68 without rebuilding the index from scratch.
69 Records can be safely updated even while users are accessing
71 The update procedure is tolerant to crashes or hard interrupts
72 during database updating - data can be reconstructed following
79 Configurable to understand many input formats.
80 A system of input filters driven by
81 regular expressions allows most ASCII-based
82 data formats to be easily processed.
83 SGML, XML, ISO2709 (MARC), and raw text are also
90 Searching supports a powerful combination of boolean queries as
91 well as relevance-ranking (free-text) queries. Truncation,
92 masking, full regular expression matching and "approximate
93 matching" (eg. spelling mistakes) are all handled.
99 Index-only databases: data can be, and usually is, imported
100 into Zebra's own storage, but Zebra can also refer to
101 external files, building and maintaining indexes of "live"
108 Zebra is written in portable C, so it runs on most Unix-like systems
109 as well as Windows NT. A binary distribution for Windows NT is
111 <ulink url="http://ftp.indexdata.dk/pub/zebra/win32/"/>,
112 and pre-built packages are available for
116 <ulink url="http://ftp.indexdata.dk/pub/zebra/RedHat7.X/"/>
117 and Debian packages at
119 <literal>GNU/Debian Linux</literal> at
120 <ulink url="http://ftp.indexdata.dk/pub/zebra/debian/"/>.
129 <ulink url="&url.z39.50;">Z39.50</ulink> protocol support:
136 Protocol facilities: Init, Search, Present (retrieval),
137 Segmentation (support for very large records), Delete, Scan
138 (index browsing), Sort, Close and support for the ``update''
139 Extended Service to add or replace an existing XML record.
145 Piggy-backed presents are honored in the search request - that
146 is, a subset of the found records can be returned directly with
147 a search response, enabling search and retrieval to happen in a
154 Named result sets are supported.
160 Easily configured to support different application profiles, with
161 tables for attribute sets, tag sets, and abstract syntaxes.
162 Additional tables control facilities such as element mappings to
163 different schema (eg., GILS-to-USMARC).
169 Complex composition specifications using Espec-1 (partial support).
170 Element sets are defined using the Espec-1 capability,
171 and are specified in configuration files as simple element
172 requests (and, optionally, variant requests).
178 Multiple record syntaxes
179 for data retrieval: GRS-1, SUTRS,
180 XML, ISO2709 (MARC), etc. Records can be mapped between record syntaxes
181 and schemas on the fly.
191 <ulink url="&url.sru;">SRU</ulink> Web Service support:
197 The protocol operations <literal>explain</literal>,
198 <literal>searchRetrieve</literal> and <literal>scan</literal>
204 <ulink url="&url.cql;">CQL</ulink> to internal query model RPN
205 conversion is supported.
210 Multiple XML record formats
211 for data retrieval are supported, modelled over the GRS-1, SUTRS,
212 MARC record formats. Records can be mapped between record
213 schemas on the fly. Arbitrarily complex XSLT transformations
214 can be applied during record retrieval if one uses the
215 <literal>alvis</literal> filter module.
220 Additional PQF query syntax for
221 <literal>searchRetrieve</literal>
222 and <literal>scan</literal> operations is supported.
233 <section id="introduction-apps">
234 <title>References and Zebra based Applications</title>
236 Zebra has been deployed in numerous applications, in both the
237 academic and commercial worlds, in application domains as diverse
238 as bibliographic catalogues, geospatial information, structured
239 vocabulary browsing, government information locators, civic
240 information systems, environmental observations, museum information
244 Notable applications include the following:
248 <section id="koha-ils">
249 <title>Koha free open-source ILS</title>
251 <ulink url="http://www.koha.org/">Koha</ulink> is a full-featured
252 open-source ILS, initially developed in
253 New Zealand by Katipo Communications Ltd, and first deployed in
254 January of 2000 for Horowhenua Library Trust. It is currently
255 maintained by a team of software providers and library technology
256 staff from around the globe.
259 <ulink url="http://liblime.com/">LibLime</ulink>,
260 a company that is marketing and supporting Koha, adds in
261 the new release of Koha 3.0 the Zebra
262 database server to drive its bibliographic database.
265 In early 2005, the Koha project development team began looking at
266 ways to improve MARC support and overcome scalability limitations
267 in the Koha 2.x series. After extensive evaluations of the best
268 of the Open Source textual database engines - including MySQL
269 full-text searching, PostgreSQL, Lucene and Plucene - the team
273 "Zebra completely eliminates scalability limitations, because it
274 can support tens of millions of records." explained Joshua
275 Ferraro, LibLime's Technology President and Koha's Project
276 Release Manager. "Our performance tests showed search results in
277 under a second for databases with over 5 million records on a
278 modest i386 900Mhz test server."
281 "Zebra also includes support for true boolean search expressions
282 and relevance-ranked free-text queries, both of which the Koha
283 2.x series lack. Zebra also supports incremental and safe
284 database updates, which allow on-the-fly record
285 management. Finally, since Zebra has at its heart the Z39.50
286 protocol, it greatly improves Koha's support for that critical
290 Although the bibliographic database will be moved to Zebra, Koha
291 3.0 will continue to use a relational SQL-based database design
292 for the 'factual' database. "Relational database managers have
293 their strengths, in spite of their inability to handle large
294 numbers of bibliographic records efficiently," summed up Ferraro,
295 "We're taking the best from both worlds in our redesigned Koha
299 See also LibLime's newsletter article
300 <ulink url="http://www.liblime.com/newsletter/2006/01/features/koha-earns-its-stripes/">
301 Koha Earns its Stripes</ulink>.
305 <section id="emilda-ils">
306 <title>Emilda open source ILS</title>
308 <ulink url="http://www.emilda.org/">Emilda</ulink>
309 is a complete Integrated Library System, released under the
310 GNU General Public License. It has a
311 full featured Web-OPAC, allowing comprehensive system management
312 from virtually any computer with an Internet connection, has
313 template based layout allowing anyone to alter the visual
314 appearance of Emilda, and is
315 XML based language for fast and easy portability to virtually any
317 Currently, Emilda is used at three schools in Espoo, Finland.
320 As a surplus, 100% MARC compatibility has been achieved using the
321 Zebra Server from Index Data as backend server.
325 <section id="reindex-ils">
326 <title>ReIndex.Net web based ILS</title>
328 <ulink url="http://www.reindex.net/index.php?lang=en">Reindex.net</ulink>
329 is a netbased library service offering all
330 traditional functions on a very high level plus many new
331 services. Reindex.net is a comprehensive and powerful WEB system
332 based on standards such as XML and Z39.50.
333 updates. Reindex supports MARC21, danMARC eller Dublin Core with
337 Reindex.net runs on GNU/Debian Linux with Zebra and Simpleserver
339 Data for bibliographic data. The relational database system
340 Sybase 9 XML is used for
342 Internally MARCXML is used for bibliographical records. Update
343 utilizes Z39.50 extended services.
347 <section id="dads-article-database">
348 <title>DADS - the DTV Article Database
351 DADS is a huge database of more than ten million records, totalling
352 over ten gigabytes of data. The records are metadata about academic
353 journal articles, primarily scientific; about 10% of these
354 metadata records link to the full text of the articles they
355 describe, a body of about a terabyte of information (although the
356 full text is not indexed.)
359 It allows students and researchers at DTU (Danmarks Tekniske
360 Universitet, the Technical College of Denmark) to find and order
361 articles from multiple databases in a single query. The database
362 contains literature on all engineering subjects. It's available
363 on-line through a web gateway, though currently only to registered
367 More information can be found at
368 <ulink url="http://www.dtv.dk/"/> and
369 <ulink url="http://dads.dtv.dk"/>
373 <section id="infonet-eprints">
374 <title>Infonet Eprints</title>
376 The InfoNet Eprints service from the
377 <ulink url="http://www.dtv.dk/">
378 Technical Knowledge Center of Denmark</ulink>
379 provides access to documents stored in
380 eprint/preprint servers and institutional research archives around
381 the world. The service is based on Open Archives Initiative metadata
382 harvesting of selected scientific archives around the world. These
383 open archives offer free and unrestricted access to their contents.
386 Infonet Eprints currently holds 1.4 million records from 16 archives.
387 The online search facility is found at
388 <ulink url="http://preprints.cvt.dk"/>.
392 <section id="alvis-project">
395 The <ulink url="http://www.alvis.info/alvis/">Alvis</ulink> EU
396 project run under the 6th Framework (IST-1-002068-STP)
397 is building a semantic-based peer-to-peer search engine. A
398 consortium of eleven partners from six different European
399 Community countries plus Switzerland and China contribute
400 with expertise in a broad range of specialties including network
401 topologies, routing algorithms, linguistic analysis and
405 The Zebra information retrieval indexing machine is used inside
406 the Alvis framework to
407 manage huge collections of natural language processed and
408 enhanced XML data, coming from a topic relevant web crawl.
409 In this application, Zebra swallows and manages 37GB of XML data
410 in about 4 hours, resulting in search times of fractions of
417 <title>ULS (Union List of Serials)</title>
420 has created a union catalogue for the periodicals of the
421 twenty-one constituent libraries of the University of London and
422 the University of Westminster
423 (<ulink url="http://www.m25lib.ac.uk/ULS/"/>).
424 They have achieved this using an
425 unusual architecture, which they describe as a
426 ``non-distributed virtual union catalogue''.
429 The member libraries send in data files representing their
430 periodicals, including both brief bibliographic data and summary
431 holdings. Then 21 individual Z39.50 targets are created, each
432 using Zebra, and all mounted on the single hardware server.
433 The live service provides a web gateway allowing Z39.50 searching
434 of all of the targets or a selection of them. Zebra's small
435 footprint allows a relatively modest system to comfortably host
439 More information can be found at
440 <ulink url="http://www.m25lib.ac.uk/ULS/"/>
445 <title>NLI-Z39.50 - a Natural Language Interface for Libraries</title>
447 Fernuniversität Hagen in Germany have developed a natural
448 language interface for access to library databases.
450 url="http://ki212.fernuni-hagen.de/nli/NLIintro.html"/> -->
451 In order to evaluate this interface for recall and precision, they
452 chose Zebra as the basis for retrieval effectiveness. The Zebra
453 server contains a copy of the GIRT database, consisting of more
454 than 76000 records in SGML format (bibliographic records from
455 social science), which are mapped to MARC for presentation.
458 (GIRT is the German Indexing and Retrieval Testdatabase. It is a
459 standard German-language test database for intelligent indexing
460 and retrieval systems. See
461 <ulink url="http://www.gesis.org/forschung/informationstechnologie/clef-delos.htm"/>)
464 Evaluation will take place as part of the TREC/CLEF campaign 2003
465 <ulink url="http://clef.iei.pi.cnr.it"/>.
466 <!-- or <ulink url="http://www4.eurospider.ch/CLEF/"/> -->
469 For more information, contact Johannes Leveling
470 <email>Johannes.Leveling@FernUni-Hagen.De</email>
474 <section id="various-web-indexes">
475 <title>Various web indexes</title>
477 Zebra has been used by a variety of institutions to construct
478 indexes of large web sites, typically in the region of tens of
479 millions of pages. In this role, it functions somewhat similarly
480 to the engine of google or altavista, but for a selected intranet
481 or a subset of the whole Web.
484 For example, Liverpool University's web-search facility (see on
486 <ulink url="http://www.liv.ac.uk/"/>
487 and many sub-pages) works by relevance-searching a Zebra database
488 which is populated by the Harvest-NG web-crawling software.
491 For more information on Liverpool university's intranet search
492 architecture, contact John Gilbertson
493 <email>jgilbert@liverpool.ac.uk</email>
497 has recently modified the Harvest web indexer to use Zebra as
498 its native repository engine. His comments on the switch over
499 from the old engine are revealing:
502 The first results after some testing with Zebra are very
503 promising. The tests were done with around 220,000 SOIF files,
504 which occupies 1.6GB of disk space.
507 Building the index from scratch takes around one hour with Zebra
508 where [old-engine] needs around five hours. While [old-engine]
509 blocks search requests when updating its index, Zebra can still
510 answer search requests.
512 Zebra supports incremental indexing which will speed up indexing
516 While the search time of [old-engine] varies from some seconds
517 to some minutes depending how expensive the query is, Zebra
518 usually takes around one to three seconds, even for expensive
521 Zebra can search more than 100 times faster than [old-engine]
522 and can process multiple search requests simultaneously
525 I am very happy to see such nice software available under GPL.
533 <section id="introduction-support">
534 <title>Support</title>
536 You can get support for Zebra from at least three sources.
539 First, there's the Zebra web site at
540 <ulink url="http://indexdata.dk/zebra/"/>,
541 which always has the most recent version available for download.
542 If you have a problem with Zebra, the first thing to do is see
543 whether it's fixed in the current release.
546 Second, there's the Zebra mailing list. Its home page at
547 <ulink url="http://lists.indexdata.dk/cgi-bin/mailman/listinfo/zebralist"/>
548 includes a complete archive of all messages that have ever been
549 posted on the list. The Zebra mailing list is used both for
550 announcements from the authors (new
551 releases, bug fixes, etc.) and general discussion. You are welcome
552 to seek support there. Join by filling the form on the list home page.
555 Third, it's possible to buy a commercial support contract, with
556 well defined service levels and response times, from Index Data.
558 <ulink url="http://indexdata.dk/support/"/>
564 <section id="future">
565 <title>Future Directions</title>
568 These are some of the plans that we have for the software in the near
569 and far future, ordered approximately as we expect to work on them.
577 Improved support for XML in search and retrieval. Eventually,
578 the goal is for Zebra to pull double duty as a flexible
579 information retrieval engine and high-performance XML
580 repository. The recent addition of XPath searching is one
581 example of the kind of enhancement we're working on.
584 There is also the experimental <literal>ALVIS XSLT</literal>
585 XML input filter, which unleashes the full power of DOM based
586 XSLT transformations during indexing and record retrieval. Work
587 on this filter has been sponsored by the ALVIS EU project
588 <ulink url="http://www.alvis.info/alvis/"/>. We expect this filter to
589 mature soon, as it is planned to be included in the version 1.4
596 Access to the search engine through SOAP/RPC API to allow the
597 construction of applications without requiring Z39.50 tools.
599 This will shortly be available by means of Index Data's
600 <ulink url="http://www.loc.gov/standards/sru/srw/">SRW</ulink>-to-Z39.50 gateway, currently in beta test.
602 Experimental support of the
603 Search/Retrieve Via URL ( <ulink url="&url.sru;">SRU</ulink>)
604 <ulink url="&url.sru;"/>
605 REST webservice, and the
606 Search/Retrieve Web Service ( <ulink url="http://www.loc.gov/standards/sru/srw/">SRW</ulink>)
607 <ulink url="http://www.loc.gov/standards/sru/srw/"/>
608 SOAP Web Service have recently been added to the YAZ/Zebra
609 combo - including server side Common Query Language (<ulink url="&url.cql;">CQL</ulink>)
610 <ulink url="&url.cql;"/> parsing
611 and configuration. It remains to find a sponsor for further testing,
612 documentation and packaging of this exiting component.
618 Finalisation and documentation of Zebra's C programming
619 API, allowing updates, database management and other functions
620 not readily expressed in Z39.50. We will also consider
621 exposing the API through SOAP.
627 Support for the use of Perl both for access to the Zebra API
628 and for building extension ``plug-ins'' such as input filters.
629 The code for this has been contributed to the source tree by
631 <email>pop@technomat.hu</email>,
632 and is in the process of being integrated and tested.
638 Improved free-text searching. We're first and foremost octet jockeys and
639 we're actively looking for organisations or people who'd like
640 to contribute experience in relevance ranking and text
649 Programmers thrive on user feedback. If you are interested in a
650 facility that you don't see mentioned here, or if there's something
651 you think we could do better, please drop us a mail. Better still,
652 implement it and send us the patches.
655 If you think it's all really neat, you're welcome to drop us a line
656 saying that, too. You can email us on
657 <email>info@indexdata.dk</email>
658 or check the contact info at the end of this manual.
663 <!-- Keep this comment at the end of the file
668 sgml-minimize-attributes:nil
669 sgml-always-quote-attributes:t
672 sgml-parent-document: "zebra.xml"
673 sgml-local-catalogs: nil
674 sgml-namecase-general:t