1 <chapter id="introduction">
2 <!-- $Id: introduction.xml,v 1.9 2002-08-28 08:14:47 mike Exp $ -->
3 <title>Introduction</title>
6 <title>Overview</title>
9 <ulink url="http://www.indexdata.dk/zebra/">
11 is a high-performance, general-purpose structured text
12 indexing and retrieval engine. It reads structured records in a
13 variety of input formats (eg. email, XML, MARC) and provides access
14 to them through a powerful combination of boolean search
15 expressions and relevance-ranked free-text queries.
19 Zebra supports large databases (tens of millions of records,
20 tens of gigabytes of data). It allows safe, incremental
21 database updates on live systems. Because Zebra supports
22 the industry-standard information retrieval protocol, Z39.50,
23 you can search Zebra databases using an enormous variety of
24 programs and toolkits, both commercial and free, which understand
25 this protocol. Application libraries are available to allow
26 bespoke clients to be written in Perl, C, C++, Java, Tcl, Visual
27 Basic, Python, PHP and more - see
28 <ulink url="http://zoom.z3950.org/">the ZOOM web site</ulink>
29 for more information on some of these client toolkits.
33 This document is an introduction to the Zebra system. It explains
34 how to compile the software, how to prepare your first database,
35 and how to configure the server to give you the
36 functionality that you need.
40 If you use Zebra, you should visit its
41 <ulink url="http://www.indexdata.dk/zebra/">web site</ulink>,
42 where you can join the
43 <ulink url="http://www.indexdata.dk/mailman/listinfo/zebralist">
46 <email>### zebra-subscribe@mailman.indexdata.dk</email>
52 <title>Features</title>
55 This is an overview of some of the most important features of the
64 Supports large databases - files for indexes, etc. can be
65 automatically partitioned over multiple disks.
71 Supports arbitrarily complex records - base input format is an
72 SGML-like syntax which allows nested (structured) data elements, as
73 well as variant forms of data.
79 Robust updating - records can be added and deleted without
80 rebuilding the index from scratch.
81 The update procedure is tolerant to crashes or hard interrupts
82 during register updating - registers can be reconstructed following
84 Registers can be safely updated even while users are accessing
91 Supports random storage formats. A system of input filters driven by
92 regular expressions allows you to easily process most ASCII-based
93 data formats. SGML, XML, ISO2709 (MARC), and raw text are also
100 Supports boolean queries as well as relevance-ranking (free-text)
101 searching. Right truncation and masking in terms are supported, as
102 well as full regular expressions.
108 Can import the data into Zebras own storage, or just refer to
109 external files (good for building indexes of "live"
116 Supports multiple concrete syntaxes
117 for record exchange (depending on the configuration): GRS-1, SUTRS,
118 XML, ISO2709 (*MARC). Records can be mapped between record syntaxes
119 and schema on the fly.
125 Supports approximate matching in registers (ie. spelling mistakes,
132 Zebra is written in portable C, so it runs on most Unix-like systems
133 as well as Windows NT - a binary distribution for Windows NT is available.
142 Z39.50 protocol support:
149 Protocol facilities: Init, Search, Retrieve, Delete, Browse and Sort.
155 Piggy-backed presents are honored in the search-request.
161 Named result sets are supported.
166 Easily configured to support different application profiles, with
167 tables for attribute sets, tag sets, and abstract syntaxes.
168 Additional tables control facilities such as element mappings to
169 different schema (eg., GILS-to-USMARC).
175 Complex composition specifications using Espec-1 are partially
176 supported (simple element requests only).
182 Element Set Names are defined using the Espec-1 capability of the
183 system, and are given in configuration files as simple element
184 requests (and possibly variant requests).
195 <title>Applications</title>
197 Zebra has been deployed in numerous applications, in both the
198 academic and commercial worlds, in application domains as diverse
199 as bibliographic information, geospatial, ### (Help, guys!)
202 Notable applications include the following:
206 <title>DADS - the DTV Article Database Service</title>
208 DADS is a huge database of more than ten million records, totally
209 over ten gigabytes of data. The records are metadata about academic
210 journal articles, primarily scientific; about 10% of these
211 metadata records link to the full text of the articles they
212 describe, a body of about a terabyte of information (although the
213 full text is not indexed.)
216 It allows students and researchers at DTU (###) to find and order
217 articles from multiple databases in a single query. The database
218 contains literature on all engineering subjects. It's available
219 on-line through a web gateway at
220 http://www.dtv.dk/search/index_e.htm
221 though currently only to registered users.
226 <title>Various web indexes</title>
228 Zebra has been used by a variety of institutions to construct
229 indexes of large web sites, typically in the region of tens of
230 millions of pages. In this role, it functions somewhat similarly
231 to the engine of google or altavista, but for a selected intranet
232 or subset of the whole Web.
235 ### examples, details and numbers, please!
241 <title>Future Work</title>
244 These are some of the plans that we have for the software in the near
245 and far future, approximately ordered after their relative importance.
253 Improved support for XML in search and retrieval. Eventually,
254 the goal is for Zebra to pull double duty as a flexible
255 information retrieval engine and high-performance XML
262 Access to search engine through SOAP/RPC API to allow the
263 construction of applications without requiring Z39.50 tools.
269 Finalisation, documentation of the Zebra API. Consider
270 exposing the API through SOAP as well (allowing updates,
271 database management).
277 Improved free-text searching. We're first and foremost octet jockeys and
278 we're actively looking for organisations or people who'd like
279 to contribute experience in relevance ranking and text
288 Programmers thrive on user feedback. If you are interested in a
289 facility that you don't see mentioned here, or if there's something
290 you think we could do better, please drop us a mail.
291 If you think it's all really neat, you're welcome to drop us a line
292 saying that, too. You'll find contact info at the end of this file.
297 <!-- Keep this comment at the end of the file
302 sgml-minimize-attributes:nil
303 sgml-always-quote-attributes:t
306 sgml-parent-document: "zebra.xml"
307 sgml-local-catalogs: nil
308 sgml-namecase-general:t