1 <chapter id="introduction">
2 <!-- $Id: introduction.xml,v 1.12 2002-08-30 01:17:10 mike Exp $ -->
3 <title>Introduction</title>
6 <title>Overview</title>
9 <ulink url="http://www.indexdata.dk/zebra/">
11 is a high-performance, general-purpose structured text
12 indexing and retrieval engine. It reads structured records in a
13 variety of input formats (eg. email, XML, MARC) and provides access
14 to them through a powerful combination of boolean search
15 expressions and relevance-ranked free-text queries.
19 Zebra supports large databases (tens of millions of records,
20 tens of gigabytes of data). It allows safe, incremental
21 database updates on live systems. Because Zebra supports
22 the industry-standard information retrieval protocol, Z39.50,
23 you can search Zebra databases using an enormous variety of
24 programs and toolkits, both commercial and free, which understand
25 this protocol. Application libraries are available to allow
26 bespoke clients to be written in Perl, C, C++, Java, Tcl, Visual
27 Basic, Python, PHP and more - see
28 <ulink url="http://zoom.z3950.org/">the ZOOM web site</ulink>
29 for more information on some of these client toolkits.
33 This document is an introduction to the Zebra system. It explains
34 how to compile the software, how to prepare your first database,
35 and how to configure the server to give you the
36 functionality that you need.
40 If you use Zebra, you should visit its
41 <ulink url="http://www.indexdata.dk/zebra/">web site</ulink>,
42 where you can join the
43 <ulink url="http://www.indexdata.dk/mailman/listinfo/zebralist">
46 <email>### zebra-subscribe@mailman.indexdata.dk</email>
52 <title>Features</title>
55 This is an overview of some of Zebra's most important features:
63 Very large databases: files for indexes, etc. can be
64 automatically partitioned over multiple disks.
70 Arbitrarily complex records. The internal data format
71 is an structured format conceptually similar to XML or GRS-1,
72 which allows nested structured data elements and
73 variant forms of data.
79 Robust updating - records can be added and deleted ``on the fly''
80 without rebuilding the index from scratch.
81 Records can be safely updated even while users are accessing
83 The update procedure is tolerant to crashes or hard interrupts
84 during database updating - data can be reconstructed following
91 Configurable to understand many input formats.
92 A system of input filters driven by
93 regular expressions allows you to easily process most ASCII-based
94 data formats. SGML, XML, ISO2709 (MARC), and raw text are also
101 Searching supports a powerful combination of boolean queries as
102 well as relevance-ranking (free-text) queries. Truncation,
103 masking, full regular expression matching and "approximate
104 matching" (eg. spelling mistakes) are all supported.
110 Index-only databases: data can be, and usually is, imported
111 into Zebra's own storage, but Zebra can also refer to
112 external files, building and maintaining indexes of "live"
119 Zebra is written in portable C, so it runs on most Unix-like systems
120 as well as Windows NT. A binary distribution for Windows NT is
130 Z39.50 protocol support:
137 Protocol facilities: Init, Search, Present (retrieval), Delete,
138 Scan (index browsing) and Sort.
144 Piggy-backed presents are honored in the search-request.
150 Named result sets are supported.
156 Easily configured to support different application profiles, with
157 tables for attribute sets, tag sets, and abstract syntaxes.
158 Additional tables control facilities such as element mappings to
159 different schema (eg., GILS-to-USMARC).
165 Complex composition specifications using Espec-1 (partial support).
166 Element sets are defined using the Espec-1 capability,
167 and are specified in configuration files as simple element
168 requests (and, optionally, variant requests).
174 Multiple record syntaxes
175 for data retrieval: GRS-1, SUTRS,
176 XML, ISO2709 (MARC), etc. Records can be mapped between record syntaxes
177 and schemas on the fly.
188 <title>Applications</title>
190 Zebra has been deployed in numerous applications, in both the
191 academic and commercial worlds, in application domains as diverse
192 as bibliographic catalogues, geospatial information, structured
193 vocabulary browsing, government information locators, civic
194 information systems, environmental observations, museum information
198 Notable applications include the following:
202 <title>DADS - the DTV Article Database Service</title>
204 DADS is a huge database of more than ten million records, totalling
205 over ten gigabytes of data. The records are metadata about academic
206 journal articles, primarily scientific; about 10% of these
207 metadata records link to the full text of the articles they
208 describe, a body of about a terabyte of information (although the
209 full text is not indexed.)
212 It allows students and researchers at DTU (Danmarks Tekniske
213 Universitet, the Technical College of Denmark) to find and order
214 articles from multiple databases in a single query. The database
215 contains literature on all engineering subjects. It's available
216 on-line through a web gateway, though currently only to registered
220 More information can be found at
221 <ulink url="http://www.dtv.dk/help/dads/index_e.htm"/>
226 Envelope-to: zebra@miketaylor.org.uk
227 From: Johannes Leveling <Johannes.Leveling@FernUni-Hagen.de>
228 Content-Type: text/plain; charset=iso-8859-1
229 Date: Thu, 29 Aug 2002 19:19:55 +0200
230 To: zebra@miketaylor.org.uk
231 Subject: [Zebralist] Looking for Deployment Stories
232 In-Reply-To: <200208281002.LAA16526@seatbooker.net>
233 X-Virus-Scanned: by AMaViS perl-11
234 X-MIME-Autoconverted: from quoted-printable to 8bit by localhost.localdomain id g7TLWR905724
239 > In collaboration with Sebastian, Adam and Heikki, I am reworking some
240 > parts of the Zebra documentation in preparation for the forthcoming
241 > release. One area I am keen to expand on is (briefly) describing
242 > interesting applications of Zebra. If you've deployed it in a way
243 > that you consider interesting, I'd love to hear from you, however
244 > briefly. Think of this as a chance to get some free publicity for
245 > your application in the Zebra documentation.
247 > Replies off-list to <zebra@miketaylor.org.uk>, please.
249 > _/|_ _______________________________________________________________
250 > /o ) \/ Mike Taylor <mike@miketaylor.org.uk> www.miketaylor.org.uk
251 > )_v__/\ There are some good things you can never have too much of.
254 > _______________________________________________
255 > Zebralist mailing list
256 > Zebralist@indexdata.dk
257 > http://www.indexdata.dk/mailman/listinfo/zebralist
260 We have developed a natural language interface (NLI-Z39.50) for access
261 to library databases at the Fernuniversität Hagen, Germany
262 (http://ki212.fernuni-hagen.de/nli/NLI.html).
263 To prepare formal information retrieval evaluation,
264 we chose the Zebra server as the basis for
265 evaluating retrieval effectiveness (measuring recall
266 and precision for the GIRT database). The Zebra database
267 consists of more than 76000 records in SGML format (bibliographic
268 records from social science), which are mapped to MARC for presentation.
269 Evaluation will take place as part of the TREC/CLEF campaign 2003
270 (see http://clef.iei.pi.cnr.it or http://www4.eurospider.ch/CLEF/).
273 Johannes Leveling Praktische Informatik VII/KI
274 FernUniversität Hagen
276 Email : Johannes.Leveling@FernUni-Hagen.De
277 Tel. : +49 2331 987-4525
282 <title>Various web indexes</title>
284 Zebra has been used by a variety of institutions to construct
285 indexes of large web sites, typically in the region of tens of
286 millions of pages. In this role, it functions somewhat similarly
287 to the engine of google or altavista, but for a selected intranet
288 or subset of the whole Web.
291 ### examples, details and numbers, please!
297 <title>Future Directions</title>
300 These are some of the plans that we have for the software in the near
301 and far future, ordered approximately as we expect to work on them.
309 Improved support for XML in search and retrieval. Eventually,
310 the goal is for Zebra to pull double duty as a flexible
311 information retrieval engine and high-performance XML
318 Access to search engine through SOAP/RPC API to allow the
319 construction of applications without requiring Z39.50 tools.
325 Finalisation and documentation of Zebra's C programming
326 API, allowing updates, database management and other functions
327 not readily expressed in Z39.50. We will also consider
328 exposing the API through SOAP.
334 Improved free-text searching. We're first and foremost octet jockeys and
335 we're actively looking for organisations or people who'd like
336 to contribute experience in relevance ranking and text
345 Programmers thrive on user feedback. If you are interested in a
346 facility that you don't see mentioned here, or if there's something
347 you think we could do better, please drop us a mail. Better still,
348 implement it and send us the patches.
351 If you think it's all really neat, you're welcome to drop us a line
352 saying that, too. You'll find contact info at the end of this file.
357 <!-- Keep this comment at the end of the file
362 sgml-minimize-attributes:nil
363 sgml-always-quote-attributes:t
366 sgml-parent-document: "zebra.xml"
367 sgml-local-catalogs: nil
368 sgml-namecase-general:t