1 <!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook V4.4//EN"
2 "http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"
4 <!ENTITY % local SYSTEM "local.ent">
6 <!ENTITY % entities SYSTEM "entities.ent">
8 <!ENTITY % idcommon SYSTEM "common/common.ent">
11 <refentry id="yaz-icu">
13 <productname>YAZ</productname>
14 <productnumber>&version;</productnumber>
15 <info><orgname>Index Data</orgname></info>
19 <refentrytitle>yaz-icu</refentrytitle>
20 <manvolnum>1</manvolnum>
21 <refmiscinfo class="manual">Commands</refmiscinfo>
25 <refname>yaz-icu</refname>
26 <refpurpose>YAZ ICU utility</refpurpose>
31 <command>yaz-icu</command>
32 <arg>-c <replaceable>config</replaceable></arg>
33 <arg>-p <replaceable>opt</replaceable></arg>
36 <arg choice="opt">infile</arg>
40 <refsect1><title>DESCRIPTION</title>
42 <command>yaz-icu</command> is utility which demonstrates
43 the ICU chain module of yaz. (<filename>yaz/icu.h</filename>).
46 The utility can be used in two ways. It may read some text
47 using an XML configuration for configuring ICU and show text analysis.
48 This mode is triggered by option <literal>-c</literal> which specififies
49 the configuration to be used. The input file is read from standard
50 input or from a file if <literal>infile</literal> is specified.
53 The utility may also show ICU information. This is triggered by
54 option <literal>-p</literal>.
58 <refsect1><title>OPTIONS</title>
61 <term>-c <replaceable>config</replaceable></term>
63 Specifies the file containing ICU chain configuration
69 <term>-p <replaceable>type</replaceable></term>
71 Specifies extra information to be printed about the ICU system.
72 If <replaceable>type</replaceable> is <literal>c</literal>
73 then ICU converters are printed.
74 If <replaceable>type</replaceable> is <literal>l</literal>
75 available locales are printed.
76 If <replaceable>type</replaceable> is <literal>t</literal>
77 available transliterators are printed.
84 Specifies that output should include sort key as well. Note that
85 sort key differs between ICU versions.
92 Specifies that output should be XML based rather than
99 <refsect1><title>ICU chain configuration</title>
101 The ICU chain configuration speicifies one or more rules to convert
102 text data into tokens. The configuration format is XML based.
105 The toplevel element must be named <literal>icu_chain</literal>.
106 The <literal>icu_chain</literal> element has one required attribute
107 <literal>locale</literal> which specifies the ICU locale to be used
108 in the conversion steps.
111 The <literal>icu_chain</literal> element must include elements where
112 each element specifies a conversion step. The conversion is performed
113 in the order in which the conversion steps are specified.
114 Each conversion element takes one attribute: <literal>rule</literal>
115 which serves as argument to the conversion step.
118 The following conversion elements are available:
124 Converts case and rule specifies how:
130 <para>Lowercase using ICU function u_strToLower. </para>
137 <para>Upper case using ICU function u_strToUpper.</para>
144 <para>To title using UCU function u_strToTitle.</para>
151 <para>Fold case using ICU function u_strFoldCase.</para>
162 This is a meta step which specifies that a term/token is to
163 be displayed. This term is retrieved in an application
164 using function icu_chain_token_display (<filename>yaz/icu.h</filename>).
169 <term>transform</term>
171 Specifies an ICU transform rule using a transliterator
173 The rule attribute is the transliterator Identifier.
174 See <ulink url="&url.icu.transform;">ICU Transforms</ulink> for
180 <term>transliterate</term>
182 Specifies a rule-based transliterator.
183 The rule attribute is the custom transformation rule to be used.
184 See <ulink url="&url.icu.transform;">ICU Transforms</ulink> for
190 <term>tokenize</term>
192 Breaks / tokenizes a string into components using
193 ICU functions ubrk_open, ubrk_setText, .. . The rule is
199 <para>Line. ICU: UBRK_LINE.</para>
206 <para>Sentence. ICU: UBRK_SENTENCE.</para>
213 <para>Word. ICU: UBRK_WORD.</para>
220 <para>Character. ICU: UBRK_CHARACTER.</para>
227 <para>Title. ICU: UBRK_TITLE.</para>
239 Joins tokens into one string. The rule attribute is the joining
240 string, which may be empty. The join conversion element was added
249 <refsect1><title>EXAMPLES</title>
251 The following command analyzes text in file <filename>text</filename>
252 using ICU chain configuration <filename>chain.xml</filename>:
254 cat text | yaz-icu -c chain.xml
256 The chain.xml might look as follows:
258 <icu_chain locale="en">
259 <transform rule="[:Control:] Any-Remove"/>
261 <transform rule="[[:WhiteSpace:][:Punctuation:]] Remove"/>
262 <transliterate rule="xy > z;"/>
270 <refsect1><title>SEE ALSO</title>
273 <refentrytitle>yaz</refentrytitle>
274 <manvolnum>7</manvolnum>
278 <ulink url="&url.icu;">ICU Home</ulink>
281 <ulink url="&url.icu.transform;">ICU Transforms</ulink>
286 <!-- Keep this comment at the end of the file