Sisyphus repository
Last update: 1 october 2023 | SRPMs: 18631 | Visits: 37570005
en ru br
ALT Linux repos
S:2.9-alt1

Group :: Text tools
RPM: unicode

 Main   Changelog   Spec   Patches   Sources   Download   Gear   Bugs and FR  Repocop 

unicode-0.9.4/000075500000000000000000000000001165276173700131675ustar00rootroot00000000000000unicode-0.9.4/COPYING000064400000000000000000000000101165276173700142110ustar00rootroot00000000000000GPL v3

unicode-0.9.4/README000064400000000000000000000030771165276173700140560ustar00rootroot00000000000000This file is in UTF-8 encoding.

To use unicode utility, you need:
- python >=2.2 (generators are needed), preferrably wide
unicode build,
- python optparse library (part of python2.3)
- UnicodeData.txt file (http://www.unicode.org/Public) which
you should put into /usr/share/unicode/, ~/.unicode/ or current working directory.
- if you want to see UniHan properties, you need also Unihan.txt file
which should be put into /usr/share/unicode/, ~./unicode/ or current working directory.


Enter regular expression or hexadecimal number as an argument.
Not much documentation at the moment, see the manpage, here are just some examples
how to use this script:

$ unicode.py euro
U+20A0 EURO-CURRENCY SIGN
UTF-8: e2 82 a0 UTF-16BE: 20a0 Decimal: ₠

Category: Sc (Symbol, Currency)
Bidi: ET (European Number Terminator)

U+20AC EURO SIGN
UTF-8: e2 82 ac UTF-16BE: 20ac Decimal: €

Category: Sc (Symbol, Currency)
Bidi: ET (European Number Terminator)

$ unicode.py 00c0
U+00C0 LATIN CAPITAL LETTER A WITH GRAVE
UTF-8: c3 80 UTF-16BE: 00c0 Decimal: À
À (à)
Lowercase: U+00E0
Category: Lu (Letter, Uppercase)
Bidi: L (Left-to-Right)
Decomposition: 0041 0300


You can specify a range of characters as argumets, unicode will show
these characters in nice tabular format, aligned to 256-byte boundaries.
Use two dots ".." to indicate the range, e.g.

unicode 0450..0520

will display the whole cyrillic, armenian and hebrew blocks (characters from U+0400 to U+05FF)

unicode 0400..

will display just characters from U+0400 up to U+04FF
unicode-0.9.4/README-paracode000064400000000000000000000022711165276173700156250ustar00rootroot00000000000000Written by Radovan Garabík <garabik @ kassiopeia.juls.savba.sk>.
For new versions, look at http://kassiopeia.juls.savba.sk/~garabik/software/unicode/

-------------------

paracode exploits the full power of the Unicode standard to convert the
text into visually similar stream of glyphs, while using completely
different codepoints. It is an excellent didactic tool demonstrating the
principles and advanced use of the Unicode standard. paracode is a
command line tool working as a filter, reading standard input in UTF-8
encoding and writing to standard output.

Use optional -t switch to select what tables to use.

Special name 'all' selects all the tables.

Note that selecting 'other', 'cyrillic_plus' and 'cherokee' tables (and
'all') makes use of rather esoteric characters, and not all fonts
contain them.

Special table 'mirror' uses quite different character substitution,
is not selected automatically with 'all' and does not work well
with anything except plain ascii alphabetical characters.

Example:

paracode -t cyrillic+greek+cherokee

paracode -t cherokee <input >output

paracode -r -t mirror <input >output


Possible tables are:

cyrillic

cyrillic_plus

greek

other

cherokee

all

unicode-0.9.4/changelog000077700000000000000000000000001165276173700201322debian/changelogustar00rootroot00000000000000unicode-0.9.4/debian/000075500000000000000000000000001165276173700144115ustar00rootroot00000000000000unicode-0.9.4/debian/README.Debian000064400000000000000000000003561165276173700164560ustar00rootroot00000000000000unicode for Debian
------------------

packaged as native package, the source resides at
http://kassiopeia.juls.savba.sk/~garabik/software/unicode/

-- Radovan Garabík <garabik@kassiopeia.juls.savba.sk>, Fri, 7 Feb 2003 15:09:19 +0100
unicode-0.9.4/debian/changelog000064400000000000000000000111321165276173700162610ustar00rootroot00000000000000unicode (0.9.4) unstable; urgency=low

* recognise split unihan files (closes: #551789)

-- Radovan Garabík <garabik@kassiopeia.juls.savba.sk> Sun, 07 Feb 2010 18:36:29 +0100

unicode (0.9.3) unstable; urgency=low

* run pylint & pychecker – fix some previously unnoticed bugs

-- Radovan Garabík <garabik@kassiopeia.juls.savba.sk> Mon, 04 May 2009 22:40:51 +0200

unicode (0.9.2) unstable; urgency=low

* giving "latin alpha" as an argument will now search
for all the character names containing the "latin.*alpha"
regular expression, not _either_ "latin" or "alpha" strings
(closes: #439146), idea from martin f. krafft.
* added forgotten README-paracode to the docfiles

-- Radovan Garabík <garabik@kassiopeia.juls.savba.sk> Thu, 30 Oct 2008 18:58:48 +0100

unicode (0.9.1) unstable; urgency=low

* add package URL to debian/copyright and
debian/README.Debian (closes: #495555)

-- Radovan Garabík <garabik@kassiopeia.juls.savba.sk> Sat, 23 Aug 2008 10:28:02 +0200

unicode (0.9) unstable; urgency=low

* include paracode utility
* clarify GPL version (v3)

-- Radovan Garabík <garabik@kassiopeia.juls.savba.sk> Wed, 19 Sep 2007 19:01:55 +0100

unicode (0.8) unstable; urgency=low

* fix traceback when letter has no uppercase or lowercase forms

-- Radovan Garabík <garabik@kassiopeia.juls.savba.sk> Sun, 1 Oct 2006 21:42:33 +0200

unicode (0.7) unstable; urgency=low

* updated to use unicode-data (closes: #386853)
* data files can be bzip2'ed now
* use data from unicode data files, not from python unicodedata
module (the latter tends to be obsolete)

-- Radovan Garabík <garabik@kassiopeia.juls.savba.sk> Sat, 16 Sep 2006 21:44:34 +0200

unicode (0.6) unstable; urgency=low

* fix stupid undeclared options bug (thanks to Tim Hatch)
* remove absolute path from z?grep, rely on OS's default PATH
to execute the command(s)
* add default path to UnicodeData.txt for MacOSX systems

-- Radovan Garabík <garabik@kassiopeia.juls.savba.sk> Wed, 4 Jan 2006 19:57:54 +0100

unicode (0.5) unstable; urgency=low

* work around browser invocations that cannot handle UTF-8 in URL's

-- Radovan Garabík <garabik@kassiopeia.juls.savba.sk> Sun, 1 Jan 2006 00:59:60 +0100

unicode (0.4.9) unstable; urgency=low

* better directional overriding for RTL characters
* query wikipedia with -w switch
* better heuristics guessing argument type

-- Radovan Garabík <garabik@kassiopeia.juls.savba.sk> Sun, 11 Sep 2005 18:30:59 +0200

unicode (0.4.8) unstable; urgency=low

* catch an exception if locale.nl_langinfo is not present (thanks to
Michael Weir)
* default to no colour if the system in MS Windows
* put back accidentally disabled left-to-right mark - as a result,
tabular display of arabic, hebrew and other RTL scripts is
much better (the bug manifested itself only on powerful i18n terminals,
such as mlterm)

-- Radovan Garabík <garabik@kassiopeia.juls.savba.sk> Fri, 26 Aug 2005 14:25:58 +0200

unicode (0.4.7) unstable; urgency=low

* some UniHan support (closes: #187214)
* --color as a synonum for --colour added (closes: #273503)

-- Radovan Garabík <garabik@kassiopeia.juls.savba.sk> Thu, 4 Aug 2005 16:36:07 +0200

unicode (0.4.6) unstable; urgency=low

* change charset guessing (closes: #241889), thanks to Євгeнiй Meщepяĸoв
(Eugeniy Meshcheryakov) for the patch
* closes: #229857 - it has been closed together with 215267

-- Radovan Garabík <garabik@kassiopeia.juls.savba.sk> Tue, 20 Apr 2004 15:39:34 +0200

unicode (0.4.5) unstable; urgency=low

* catch exception if input sequence is invalid in given encoding
(closes: #188438)
* automatically find and symlink UnicodeData.txt from perl, if installed
(thanks to LarstiQ <larstiq @ larstiq.dyndns.org> for the patch)
(closes: #215267)
* change architecture to 'all' (closes: #215264)

-- Radovan Garabík <garabik@kassiopeia.juls.savba.sk> Wed, 21 Jan 2004 10:30:38 +0100

unicode (0.4) unstable; urgency=low

* added option to choose colour output (closes: #187215)

-- Radovan Garabík <garabik@kassiopeia.juls.savba.sk> Wed, 9 Apr 2003 16:37:39 +0200

unicode (0.3.1) unstable; urgency=low

* added python to Build-depends (closes: #183662)
* properly quote hyphens in manpage (closes: #186151)
* do not use UTF-8 in manpage (closes: #186193)
* added versioned dependency for python2.3 (closes: #186444)

-- Radovan Garabík <garabik@kassiopeia.juls.savba.sk> Mon, 24 Mar 2003 14:39:31 +0100

unicode (0.3) unstable; urgency=low

* Initial Release.

-- Radovan Garabík <garabik@kassiopeia.juls.savba.sk> Fri, 7 Feb 2003 15:09:19 +0100

unicode-0.9.4/debian/compat000064400000000000000000000000021165276173700156070ustar00rootroot000000000000004
unicode-0.9.4/debian/control000064400000000000000000000007571165276173700160250ustar00rootroot00000000000000Source: unicode
Section: utils
Priority: optional
Maintainer: Radovan Garabík <garabik@kassiopeia.juls.savba.sk>
Build-Depends: debhelper (>= 4)
Standards-Version: 3.8.0

Package: unicode
Architecture: all
Depends: python (>= 2.3)
Suggests: perl-modules | console-data (<< 2:1.0-1) | unicode-data
Description: display unicode character properties
unicode is a simple command line utility that displays
properties for a given unicode character, or searches
unicode database for a given name.
unicode-0.9.4/debian/copyright000064400000000000000000000005551165276173700163510ustar00rootroot00000000000000This program was written by Radovan Garabík
<garabik @ kassiopeia.juls.savba.sk> on
Fri, 7 Feb 2003 15:09:19 +0100, and
packaged for Debian as a native package.

The sources and package can be downloaded from:
http://kassiopeia.juls.savba.sk/~garabik/software/unicode/


Copyright:
© Radovan Garabík,
released under GPL v3, see /usr/share/common-licenses/GPL
unicode-0.9.4/debian/dirs000064400000000000000000000000101165276173700152640ustar00rootroot00000000000000usr/bin
unicode-0.9.4/debian/docs000064400000000000000000000000301165276173700152550ustar00rootroot00000000000000README
README-paracode

unicode-0.9.4/debian/rules000075500000000000000000000035561165276173700155020ustar00rootroot00000000000000#!/usr/bin/make -f
# Sample debian/rules that uses debhelper.
# GNU copyright 1997 to 1999 by Joey Hess.

# Uncomment this to turn on verbose mode.
#export DH_VERBOSE=1

# This is the debhelper compatibility version to use.
#export DH_COMPAT=4



CFLAGS = -Wall -g

ifneq (,$(findstring noopt,$(DEB_BUILD_OPTIONS)))
CFLAGS += -O0
else
CFLAGS += -O2
endif
ifeq (,$(findstring nostrip,$(DEB_BUILD_OPTIONS)))
INSTALL_PROGRAM += -s
endif

configure: configure-stamp
configure-stamp:
dh_testdir
# Add here commands to configure the package.

touch configure-stamp


build: build-stamp

build-stamp: configure-stamp
dh_testdir

# Add here commands to compile the package.
#$(MAKE)
#/usr/bin/docbook-to-man debian/unicode.sgml > unicode.1

touch build-stamp

clean:
dh_testdir
dh_testroot
rm -f build-stamp configure-stamp

# Add here commands to clean up after the build process.
#-$(MAKE) clean

dh_clean

install: build
dh_testdir
dh_testroot
dh_clean -k
dh_installdirs

# Add here commands to install the package into debian/unicode.
#$(MAKE) install DESTDIR=$(CURDIR)/debian/unicode
cp unicode paracode $(CURDIR)/debian/unicode/usr/bin

# Build architecture-dependent files here.
#binary-arch: build install
# We have nothing to do by default.

# Build architecture-independent files here.
binary-indep: build install
dh_testdir
dh_testroot
# dh_installdebconf
dh_installdocs
# dh_installexamples
# dh_installmenu
# dh_installlogrotate
# dh_installemacsen
# dh_installpam
# dh_installmime
# dh_installinit
# dh_installcron
dh_installman unicode.1 paracode.1
# dh_installinfo
# dh_undocumented
dh_installchangelogs
# dh_link
dh_strip
dh_compress
dh_fixperms
# dh_makeshlibs
dh_installdeb
# dh_perl
# dh_shlibdeps
# dh_python
dh_gencontrol
dh_md5sums
dh_builddeb

binary: binary-indep binary-arch
.PHONY: build clean binary-indep binary-arch binary install configure
unicode-0.9.4/paracode000075500000000000000000000147341165276173700147040ustar00rootroot00000000000000#!/usr/bin/python

import sys, unicodedata
from optparse import OptionParser



table_cyrillic = {

'A' : u'\N{CYRILLIC CAPITAL LETTER A}',
'B' : u'\N{CYRILLIC CAPITAL LETTER VE}',
'C' : u'\N{CYRILLIC CAPITAL LETTER ES}',
'E' : u'\N{CYRILLIC CAPITAL LETTER IE}',
'H' : u'\N{CYRILLIC CAPITAL LETTER EN}',
'I' : u'\N{CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I}',
'J' : u'\N{CYRILLIC CAPITAL LETTER JE}',
'K' : u'\N{CYRILLIC CAPITAL LETTER KA}',
'M' : u'\N{CYRILLIC CAPITAL LETTER EM}',
'O' : u'\N{CYRILLIC CAPITAL LETTER O}',
'P' : u'\N{CYRILLIC CAPITAL LETTER ER}',
'S' : u'\N{CYRILLIC CAPITAL LETTER DZE}',
'T' : u'\N{CYRILLIC CAPITAL LETTER TE}',
'X' : u'\N{CYRILLIC CAPITAL LETTER HA}',
'Y' : u'\N{CYRILLIC CAPITAL LETTER U}',

'a' : u'\N{CYRILLIC SMALL LETTER A}',
'c' : u'\N{CYRILLIC SMALL LETTER ES}',
'e' : u'\N{CYRILLIC SMALL LETTER IE}',
'i' : u'\N{CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I}',
'j' : u'\N{CYRILLIC SMALL LETTER JE}',
'o' : u'\N{CYRILLIC SMALL LETTER O}',
'p' : u'\N{CYRILLIC SMALL LETTER ER}',
's' : u'\N{CYRILLIC SMALL LETTER DZE}',
'x' : u'\N{CYRILLIC SMALL LETTER HA}',
'y' : u'\N{CYRILLIC SMALL LETTER U}',

}

table_cyrillic_plus = {
'Y' : u'\N{CYRILLIC CAPITAL LETTER STRAIGHT U}',
'h' : u'\N{CYRILLIC SMALL LETTER SHHA}',
}

table_greek = {
'A' : u'\N{GREEK CAPITAL LETTER ALPHA}',
'B' : u'\N{GREEK CAPITAL LETTER BETA}',
'E' : u'\N{GREEK CAPITAL LETTER EPSILON}',
'H' : u'\N{GREEK CAPITAL LETTER ETA}',
'I' : u'\N{GREEK CAPITAL LETTER IOTA}',
'K' : u'\N{GREEK CAPITAL LETTER KAPPA}',
'M' : u'\N{GREEK CAPITAL LETTER MU}',
'N' : u'\N{GREEK CAPITAL LETTER NU}',
'O' : u'\N{GREEK CAPITAL LETTER OMICRON}',
'P' : u'\N{GREEK CAPITAL LETTER RHO}',
'T' : u'\N{GREEK CAPITAL LETTER TAU}',
'X' : u'\N{GREEK CAPITAL LETTER CHI}',
'Y' : u'\N{GREEK CAPITAL LETTER UPSILON}',
'Z' : u'\N{GREEK CAPITAL LETTER ZETA}',

'o' : u'\N{GREEK SMALL LETTER OMICRON}',
}

table_other = {
'!' : u'\N{LATIN LETTER RETROFLEX CLICK}',

'O' : u'\N{ARMENIAN CAPITAL LETTER OH}',
'S' : u'\N{ARMENIAN CAPITAL LETTER TIWN}',
'o' : u'\N{ARMENIAN SMALL LETTER OH}',
'n' : u'\N{ARMENIAN SMALL LETTER VO}',
}

table_cherokee = {
'A' : u'\N{CHEROKEE LETTER GO}',
'B' : u'\N{CHEROKEE LETTER YV}',
'C' : u'\N{CHEROKEE LETTER TLI}',
'D' : u'\N{CHEROKEE LETTER A}',
'E' : u'\N{CHEROKEE LETTER GV}',
'G' : u'\N{CHEROKEE LETTER NAH}',
'H' : u'\N{CHEROKEE LETTER MI}',
'J' : u'\N{CHEROKEE LETTER GU}',
'K' : u'\N{CHEROKEE LETTER TSO}',
'L' : u'\N{CHEROKEE LETTER TLE}',
'M' : u'\N{CHEROKEE LETTER LU}',
'P' : u'\N{CHEROKEE LETTER TLV}',
'R' : u'\N{CHEROKEE LETTER SV}',
'S' : u'\N{CHEROKEE LETTER DU}',
'T' : u'\N{CHEROKEE LETTER I}',
'V' : u'\N{CHEROKEE LETTER DO}',
'W' : u'\N{CHEROKEE LETTER LA}',
'Y' : u'\N{CHEROKEE LETTER GI}',
'Z' : u'\N{CHEROKEE LETTER NO}',

}

table_mirror = {

'A' : u'\N{FOR ALL}',
'B' : u'\N{CANADIAN SYLLABICS CARRIER KHA}',
'C' : u'\N{LATIN CAPITAL LETTER OPEN O}',
'D' : u'\N{CANADIAN SYLLABICS CARRIER PA}',
'E' : u'\N{LATIN CAPITAL LETTER REVERSED E}',
'F' : u'\N{TURNED CAPITAL F}',
'G' : u'\N{TURNED SANS-SERIF CAPITAL G}',
'H' : u'H',
'I' : u'I',
'J' : u'\N{LATIN SMALL LETTER LONG S}',
'K' : u'\N{LATIN SMALL LETTER TURNED K}', # fixme
'L' : u'\N{TURNED SANS-SERIF CAPITAL L}',
'M' : u'W',
'N' : u'N',
'O' : u'O',
'P' : u'\N{CYRILLIC CAPITAL LETTER KOMI DE}',
'R' : u'\N{CANADIAN SYLLABICS TLHO}',
'S' : u'S',
'T' : u'\N{UP TACK}',
'U' : u'\N{ARMENIAN CAPITAL LETTER VO}',
'V' : u'\N{N-ARY LOGICAL AND}',
'W' : u'M',
'X' : u'X',
'Y' : u'\N{TURNED SANS-SERIF CAPITAL Y}',
'Z' : u'Z',

'a' : u'\N{LATIN SMALL LETTER TURNED A}',
'b' : u'q',
'c' : u'\N{LATIN SMALL LETTER OPEN O}',
'd' : u'p',
'e' : u'\N{LATIN SMALL LETTER SCHWA}',
'f' : u'\N{LATIN SMALL LETTER DOTLESS J WITH STROKE}',
'g' : u'\N{LATIN SMALL LETTER B WITH HOOK}',
'h' : u'\N{LATIN SMALL LETTER TURNED H}',
'i' : u'\N{LATIN SMALL LETTER DOTLESS I}' + u'\N{COMBINING DOT BELOW}',
'j' : u'\N{LATIN SMALL LETTER LONG S}' + u'\N{COMBINING DOT BELOW}',
'k' : u'\N{LATIN SMALL LETTER TURNED K}',
'l' : u'l',
'm' : u'\N{LATIN SMALL LETTER TURNED M}',
'n' : u'u',
'o' : u'o',
'p' : u'd',
'q' : u'b',
'r' : u'\N{LATIN SMALL LETTER TURNED R}',
's' : u's',
't' : u'\N{LATIN SMALL LETTER TURNED T}',
'u' : u'n',
'v' : u'\N{LATIN SMALL LETTER TURNED V}',
'w' : u'\N{LATIN SMALL LETTER TURNED W}',
'x' : u'x',
'y' : u'\N{LATIN SMALL LETTER TURNED Y}',
'z' : u'z',

'0' : '0',
'1' : u'I',
'2' : u'\N{INVERTED QUESTION MARK}\N{COMBINING MACRON}',
'3' : u'\N{LATIN CAPITAL LETTER OPEN E}',
'4' : u'\N{LATIN SMALL LETTER LZ DIGRAPH}',
'6' : '9',
'7' : u'\N{LATIN CAPITAL LETTER L WITH STROKE}',
'8' : '8',
'9' : '6',
',' : "'",
"'" : ',',
'.' : u'\N{DOT ABOVE}',
'?' : u'\N{INVERTED QUESTION MARK}',
'!' : u'\N{INVERTED EXCLAMATION MARK}',


}


tables_names = ['cyrillic', 'cyrillic_plus', 'greek',
'other', 'cherokee']

table_default = table_cyrillic
table_default.update(table_greek)

table_all={}
for t in tables_names:
table_all.update(globals()['table_'+t])

parser = OptionParser(usage="usage: %prog [options]")

parser.add_option("-t", "--tables",
action="store", default='default', dest="tables", type="string",
help="""list of tables to use, separated by a plus sign.
Possible tables are: """+'+'.join(tables_names)+""" and a special name 'all' to specify
all these tables joined together.
There is another table, 'mirror', that is not selected in 'all'.""")

parser.add_option("-r", "--reverse",
action="count", dest="reverse",
default=0,
help="Reverse the text after conversion. Best used with the 'mirror' table.")

(options, args) = parser.parse_args()

if args:
to_convert = ' '.join(args).decode('utf-8')
else:
to_convert = None

tables = options.tables.split('+')
tables = ['table_'+x for x in tables]

tables = [globals()[x] for x in tables]

table = {}
for t in tables:
table.update(t)

def reverse_string(s):
l = list(s)
l.reverse()
r = ''.join(l)
return r

def do_convert(s, reverse=0):
if reverse:
s = reverse_string(s)
l = unicodedata.normalize('NFKD', s)
out = []
for c in l:
out.append(table.get(c, c))
out = ''.join(out)
out = unicodedata.normalize('NFKC', out)
return out

if not to_convert:
if options.reverse:
lines = sys.stdin.readlines()
lines.reverse()
else:
lines = sys.stdin

for line in lines:
l = line.decode('utf-8')
out = do_convert(l, options.reverse)
sys.stdout.write(out.encode('utf-8'))

else:
out = do_convert(to_convert, options.reverse)
sys.stdout.write(out.encode('utf-8'))
sys.stdout.write('\n')
unicode-0.9.4/paracode.1000064400000000000000000000030521165276173700150270ustar00rootroot00000000000000.\" Hey, EMACS: -*- nroff -*-
.TH PARACODE 1 "2005-04-16"
.SH NAME
paracode \- command line Unicode conversion tool
.SH SYNOPSIS
.B paracode
.RI [ -t tables ]
string
.SH DESCRIPTION
This manual page documents the
.B paracode
command.
.PP
\fBparacode\fP exploits the full power of the Unicode standard to convert the text
into visually similar stream of glyphs, while using completely different codepoints.
It is an excellent didactic tool demonstrating the principles and advanced use of
the Unicode standard.
.PP
\fBparacode\fP is a command line tool working as
a filter, reading standard input in UTF-8 encoding and writing to
standard output.

.SH OPTIONS
.TP
.BI \-t tables
.BI \-\-tables

Use given list of conversion tables, separated by a plus sign.

Special name 'all' selects all the tables.

Note that selecting 'other', 'cyrillic_plus' and 'cherokee' tables (and 'all')
makes use of rather esoteric characters, and not all fonts contain them.


Special table 'mirror' uses quite different character substitution,
is not selected automatically with 'all' and does not work well
with anything except plain ascii alphabetical characters.

Example:

paracode -t cyrillic+greek+cherokee

paracode -t cherokee <input >output

paracode -r -t mirror <input >output



Possible tables are:

cyrillic

cyrillic_plus

greek

other

cherokee

all

.TP
.BI \-r

Display text in reverse order after conversion, best used together with -t mirror.

.SH SEE ALSO
iconv(1)


.SH AUTHOR
Radovan Garab\('ik <garabik @ kassiopeia.juls.savba.sk>


unicode-0.9.4/unicode000075500000000000000000000527541165276173700145600ustar00rootroot00000000000000#!/usr/bin/python

#from __future__ import generators

import os, glob, sys, unicodedata, locale, gzip, re, traceback, string, commands
import urllib, webbrowser

# bz2 was introduced in 2.3, we want this to work also with earlier versions
try:
import bz2
except ImportError:
bz2 = None

from optparse import OptionParser

VERSION='0.9.4'


# list of terminals that support bidi
biditerms = ['mlterm']

locale.setlocale(locale.LC_ALL, '')

# guess terminal charset
try:
iocharsetguess = locale.nl_langinfo(locale.CODESET) or "ascii"
except:
iocharsetguess = "ascii"

if os.environ.get('TERM') in biditerms and iocharsetguess.lower().startswith('utf'):
LTR = u'\u202d' # left to right override
else:
LTR = ''


def out(*args):
"pring args, converting them to output charset"
for i in args:
sys.stdout.write(i.encode(options.iocharset, 'replace'))

colours = {
'none' : "",
'default' : "\033[0m",
'bold' : "\033[1m",
'underline' : "\033[4m",
'blink' : "\033[5m",
'reverse' : "\033[7m",
'concealed' : "\033[8m",

'black' : "\033[30m",
'red' : "\033[31m",
'green' : "\033[32m",
'yellow' : "\033[33m",
'blue' : "\033[34m",
'magenta' : "\033[35m",
'cyan' : "\033[36m",
'white' : "\033[37m",

'on_black' : "\033[40m",
'on_red' : "\033[41m",
'on_green' : "\033[42m",
'on_yellow' : "\033[43m",
'on_blue' : "\033[44m",
'on_magenta' : "\033[45m",
'on_cyan' : "\033[46m",
'on_white' : "\033[47m",

'beep' : "\007",
}


general_category = {
'Lu': 'Letter, Uppercase',
'Ll': 'Letter, Lowercase',
'Lt': 'Letter, Titlecase',
'Lm': 'Letter, Modifier',
'Lo': 'Letter, Other',
'Mn': 'Mark, Non-Spacing',
'Mc': 'Mark, Spacing Combining',
'Me': 'Mark, Enclosing',
'Nd': 'Number, Decimal Digit',
'Nl': 'Number, Letter',
'No': 'Number, Other',
'Pc': 'Punctuation, Connector',
'Pd': 'Punctuation, Dash',
'Ps': 'Punctuation, Open',
'Pe': 'Punctuation, Close',
'Pi': 'Punctuation, Initial quote',
'Pf': 'Punctuation, Final quote',
'Po': 'Punctuation, Other',
'Sm': 'Symbol, Math',
'Sc': 'Symbol, Currency',
'Sk': 'Symbol, Modifier',
'So': 'Symbol, Other',
'Zs': 'Separator, Space',
'Zl': 'Separator, Line',
'Zp': 'Separator, Paragraph',
'Cc': 'Other, Control',
'Cf': 'Other, Format',
'Cs': 'Other, Surrogate',
'Co': 'Other, Private Use',
'Cn': 'Other, Not Assigned',
}

bidi_category = {
'L' : 'Left-to-Right',
'LRE' : 'Left-to-Right Embedding',
'LRO' : 'Left-to-Right Override',
'R' : 'Right-to-Left',
'AL' : 'Right-to-Left Arabic',
'RLE' : 'Right-to-Left Embedding',
'RLO' : 'Right-to-Left Override',
'PDF' : 'Pop Directional Format',
'EN' : 'European Number',
'ES' : 'European Number Separator',
'ET' : 'European Number Terminator',
'AN' : 'Arabic Number',
'CS' : 'Common Number Separator',
'NSM' : 'Non-Spacing Mark',
'BN' : 'Boundary Neutral',
'B' : 'Paragraph Separator',
'S' : 'Segment Separator',
'WS' : 'Whitespace',
'ON' : 'Other Neutrals',
}

comb_classes = {
0: 'Spacing, split, enclosing, reordrant, and Tibetan subjoined',
1: 'Overlays and interior',
7: 'Nuktas',
8: 'Hiragana/Katakana voicing marks',
9: 'Viramas',
10: 'Start of fixed position classes',
199: 'End of fixed position classes',
200: 'Below left attached',
202: 'Below attached',
204: 'Below right attached',
208: 'Left attached (reordrant around single base character)',
210: 'Right attached',
212: 'Above left attached',
214: 'Above attached',
216: 'Above right attached',
218: 'Below left',
220: 'Below',
222: 'Below right',
224: 'Left (reordrant around single base character)',
226: 'Right',
228: 'Above left',
230: 'Above',
232: 'Above right',
233: 'Double below',
234: 'Double above',
240: 'Below (iota subscript)',
}



def get_unicode_properties(ch):
properties = {}
if ch in linecache:
fields = linecache[ch].strip().split(';')
proplist = ['codepoint', 'name', 'category', 'combining', 'bidi', 'decomposition', 'dummy', 'digit_value', 'numeric_value', 'mirrored', 'unicode1name', 'iso_comment', 'uppercase', 'lowercase', 'titlecase']
for i, prop in enumerate(proplist):
if prop!='dummy':
properties[prop] = fields[i]

if properties['lowercase']:
properties['lowercase'] = unichr(int(properties['lowercase'], 16))
if properties['uppercase']:
properties['uppercase'] = unichr(int(properties['uppercase'], 16))
if properties['titlecase']:
properties['titlecase'] = unichr(int(properties['titlecase'], 16))

properties['combining'] = int(properties['combining'])
properties['mirrored'] = properties['mirrored']=='Y'
else:
properties['codepoint'] = '%04X' % ord(ch)
properties['name'] = unicodedata.name(ch, '')
properties['category'] = unicodedata.category(ch)
properties['combining'] = unicodedata.combining(ch)
properties['bidi'] = unicodedata.bidirectional(ch)
properties['decomposition'] = unicodedata.decomposition(ch)
properties['digit_value'] = unicodedata.digit(ch, '')
properties['numeric_value'] = unicodedata.numeric(ch, '')
properties['mirrored'] = unicodedata.mirrored(ch)
properties['unicode1name'] = ''
properties['iso_comment'] = ''
properties['uppercase'] = ch.upper()
properties['lowercase'] = ch.lower()
properties['titlecase'] = ''
return properties


def do_init():
HomeDir = os.path.expanduser('~/.unicode')
HomeUnicodeData = os.path.join(HomeDir, "UnicodeData.txt")
global UnicodeDataFileNames
UnicodeDataFileNames = [HomeUnicodeData, '/usr/share/unidata/UnicodeData.txt', '/usr/share/unicode/UnicodeData.txt', './UnicodeData.txt'] + \
glob.glob('/usr/share/unidata/UnicodeData*.txt') + \
glob.glob('/usr/share/perl/*/unicore/UnicodeData.txt') + \
glob.glob('/System/Library/Perl/*/unicore/UnicodeData.txt') # for MacOSX


HomeUnihanData = os.path.join(HomeDir, "Unihan*")
global UnihanDataGlobs
UnihanDataGlobs = [HomeUnihanData, '/usr/share/unidata/Unihan*', '/usr/share/unicode/Unihan*', './Unihan*']


def get_unihan_files():
fos = [] # list of file names for Unihan data file(s)
for gl in UnihanDataGlobs:
fnames = glob.glob(gl)
fos += fnames
return fos

def get_unihan_properties_internal(ch):
properties = {}
ch = ord(ch)
global unihan_fs
for f in unihan_fs:
fo = OpenGzip(f)
for l in fo:
if l.startswith('#'):
continue
line = l.strip()
if not line:
continue
char, key, value = line.strip().split('\t')
if int(char[2:], 16) == ch:
properties[key] = unicode(value, 'utf-8')
elif int(char[2:], 16)>ch:
break
return properties

def get_unihan_properties_zgrep(ch):
properties = {}
global unihan_fs
ch = ord(ch)
chs = 'U+%X' % ch
for f in unihan_fs:
if f.endswith('.gz'):
grepcmd = 'zgrep'
elif f.endswith('.bz2'):
grepcmd = 'bzgrep'
else:
grepcmd = 'grep'
cmd = grepcmd+' ^'+chs+r'\\b '+f
status, output = commands.getstatusoutput(cmd)
output = output.split('\n')
for l in output:
if not l:
continue
char, key, value = l.strip().split('\t')
if int(char[2:], 16) == ch:
properties[key] = unicode(value, 'utf-8')
elif int(char[2:], 16)>ch:
break
return properties

# basic sanity check, if e.g. you run this on MS Windows...
if os.path.exists('/bin/grep'):
get_unihan_properties = get_unihan_properties_zgrep
else:
get_unihan_properties = get_unihan_properties_internal



def error(txt):
out(txt)
out('\n')
sys.exit()

def get_gzip_filename(fname):
"return fname, if it does not exist, return fname+.gz, if neither that, fname+bz2, if neither that, return None"
if os.path.exists(fname):
return fname
if os.path.exists(fname+'.gz'):
return fname+'.gz'
if os.path.exists(fname+'.bz2') and bz2 is not None:
return fname+'.bz2'
return None


def OpenGzip(fname):
"open fname, try fname.gz or fname.bz2 if fname does not exist, return file object or GzipFile or BZ2File object"
if os.path.exists(fname) and not (fname.endswith('.gz') or fname.endswith('.bz2')):
return file(fname)
if os.path.exists(fname+'.gz'):
fname = fname+'.gz'
elif os.path.exists(fname+'.bz2') and bz2 is not None:
fname = fname+'.bz2'
if fname.endswith('.gz'):
return gzip.GzipFile(fname)
elif fname.endswith('.bz2'):
return bz2.BZ2File(fname)
return None
#raise IOError

def GrepInNames(pattern, fillcache=False):
p = re.compile(pattern, re.I)
f = None
for name in UnicodeDataFileNames:
f = OpenGzip(name)
if f != None:
break
if not fillcache:
if not f:
out( """
Cannot find UnicodeData.txt, please place it into
/usr/share/unidata/UnicodeData.txt,
/usr/share/unicode/UnicodeData.txt, ~/.unicode/ or current
working directory (optionally you can gzip it).
Without the file, searching will be much slower.

""" )
for i in xrange(sys.maxunicode):
try:
name = unicodedata.name(unichr(i))
if re.search(p, name):
yield myunichr(i)
except ValueError:
pass
else:
for l in f:
if re.search(p, l):
r = myunichr(int(l.split(';')[0], 16))
linecache[r] = l
yield r
f.close()
else:
if f:
for l in f:
if re.search(p, l):
r = myunichr(int(l.split(';')[0], 16))
linecache[r] = l
f.close()


def myunichr(n):
try:
r = unichr(n)
return r
except ValueError:
traceback.print_exc()
error("Consider recompiling your python interpreter with wide unicode characters")



def is_ascii(s):
"test is string s consists completely out of ascii characters"
try:
unicode(s, 'ascii')
except UnicodeDecodeError:
return False
return True

def guesstype(arg):
if not is_ascii(arg):
return 'string', arg
elif arg[:2]=='U+' or arg[:2]=='u+': # it is hexadecimal number
try:
val = int(arg[2:], 16)
if val>sys.maxunicode:
return 'regexp', arg
else:
return 'hexadecimal', arg[2:]
except ValueError:
return 'regexp', arg
elif arg[0] in "Uu" and len(arg)>4:
try:
val = int(arg[1:], 16)
if val>sys.maxunicode:
return 'regexp', arg
else:
return 'hexadecimal', arg
except ValueError:
return 'regexp', arg
elif len(arg)>=4:
try:
val = int(arg, 16)
if val>sys.maxunicode:
return 'regexp', arg
else:
return 'hexadecimal', arg
except ValueError:
return 'regexp', arg
else:
return 'string', arg


def process(arglist, t):
# build a list of values, so that we can combine queries like
# LATIN ALPHA and search for LATIN.*ALPHA and not names that
# contain either LATIN or ALPHA
result = []
names_query = [] # reserved for queries in names - i.e. -r
for arg_i in arglist:
if t==None:
tp, arg = guesstype(arg_i)
if tp == 'regexp':
# if the first argument is guessed to be a regexp, add
# all the following arguments to the regular expression -
# this is probably what you wanted, e.g.
# 'unicode cyrillic be' will now search for the 'cyrillic.*be' regular expression
t = 'regexp'
else:
tp, arg = t, arg_i
if tp=='hexadecimal':
val = int(arg, 16)
r = myunichr(val)
list(GrepInNames('%04X'%val, fillcache=True)) # fill the table with character properties
result.append(r)
elif tp=='decimal':
val = int(arg, 10)
r = myunichr(val)
list(GrepInNames('%04X'%val, fillcache=True))
result.append(r)
elif tp=='regexp':
names_query.append(arg)
elif tp=='string':
try:
unirepr = unicode(arg, options.iocharset)
except UnicodeDecodeError:
error ("Sequence %s is not valid in charset '%s'." % (repr(arg), options.iocharset))
unilist = ['%04X'%ord(x) for x in unirepr]
unireg = '|'.join(unilist)
list(GrepInNames(unireg, fillcache=True))
for r in unirepr:
result.append(r)
if names_query:
query = '.*'.join(names_query)
for r in GrepInNames(query):
result.append(r)
return result

def maybe_colours(colour):
if use_colour:
return colours[colour]
else:
return ""

# format key and value
def printkv(*l):
for i in range(0, len(l), 2):
if i<len(l)-2:
sep = " "
else:
sep = "\n"
k, v = l[i], l[i+1]
out(maybe_colours('green'))
out(k)
out(": ")
out(maybe_colours('default'))
out(unicode(v))
out(sep)


def print_characters(list, maxcount, query_wiki=0):
"""query_wiki - 0 - don't
1 - spawn browser
"""
counter = 0
for c in list:

if query_wiki:
ch = urllib.quote(c.encode('utf-8')) # wikipedia uses UTF-8 in names
wiki_url = 'http://en.wikipedia.org/wiki/'+ch
webbrowser.open(wiki_url)
query_wiki = 0 # query only the very first character


if maxcount:
counter += 1
if counter > options.maxcount:
out("\nToo many characters to display, more than %s, use --max option to change it\n" % options.maxcount)
return
properties = get_unicode_properties(c)
out(maybe_colours('bold'))
out('U+%04X '% ord(c))
if properties['name']:
out(properties['name'])
else:
out(maybe_colours('default'))
out(" - No such unicode character name in database")
out(maybe_colours('default'))
out('\n')

ar = ["UTF-8", string.join([("%02x" % ord(x)) for x in c.encode('utf-8')]) ,
"UTF-16BE", string.join([("%02x" % ord(x)) for x in c.encode('utf-16be')], ''),
"Decimal", "&#%s;" % ord(c) ]
if options.addcharset:
try:
rep = string.join([("%02x" % ord(x)) for x in c.encode(options.addcharset)] )
except UnicodeError:
rep = "NONE"
ar.extend( [options.addcharset, rep] )
printkv(*ar)


if properties['combining']:
pc = " "+c
else:
pc = c
out(pc)
uppercase = properties['uppercase']
lowercase = properties['lowercase']
if uppercase:
out(" (%s)" % uppercase)
out('\n')
printkv( "Uppercase", 'U+%04X'% ord(properties['uppercase']) )
elif lowercase:
out(" (%s)" % properties['lowercase'])
out('\n')
printkv( "Lowercase", 'U+%04X'% ord(properties['lowercase']) )
else:
out('\n')
printkv( 'Category', properties['category']+ " (%s)" % general_category[properties['category']] )

if properties['numeric_value']:
printkv( 'Numeric value', properties['numeric_value'])
if properties['digit_value']:
printkv( 'Digit value', properties['digit_value'])

bidi = properties['bidi']
if bidi:
printkv( 'Bidi', bidi+ " (%s)" % bidi_category[bidi] )
mirrored = properties['mirrored']
if mirrored:
out('Character is mirrored\n')
comb = properties['combining']
if comb:
printkv( 'Combining', str(comb)+ " (%s)" % (comb_classes.get(comb, '?')) )
decomp = properties['decomposition']
if decomp:
printkv( 'Decomposition', decomp )
if options.verbosity>0:
uhp = get_unihan_properties(c)
for key in uhp:
printkv(key, uhp[key])
out('\n')



def print_block(block):
#header
out(" "*10)
for i in range(16):
out(".%X " % i)
out('\n')
#body
for i in range(block*16, block*16+16):
hexi = "%X" % i
if len(hexi)>3:
hexi = "%07X" % i
hexi = hexi[:4]+" "+hexi[4:]
else:
hexi = " %03X" % i
out(LTR+hexi+". ")
for j in range(16):
c = unichr(i*16+j)
if unicodedata.combining(c):
c = " "+c
out(c)
out(' ')
out('\n')
out('\n')

def print_blocks(blocks):
for block in blocks:
print_block(block)


def is_range(s, typ):
sp = s.split('..')
if len(sp)<>2:
return False
if not sp[1]:
sp[1] = sp[0]
elif not sp[0]:
sp[0] = sp[1]
if not sp[0]:
return False
low = list(process([sp[0]], typ))
high = list(process([sp[1]], typ))
if len(low)<>1 or len(high)<>1:
return False
low = ord(low[0])
high = ord(high[0])
low = low // 256
high = high // 256 + 1
return range(low, high)




parser = OptionParser(usage="usage: %prog [options] arg")
parser.add_option("-x", "--hexadecimal",
action="store_const", const='hexadecimal', dest="type",
help="Assume arg to be hexadecimal number")
parser.add_option("-d", "--decimal",
action="store_const", const='decimal', dest="type",
help="Assume arg to be decimal number")
parser.add_option("-r", "--regexp",
action="store_const", const='regexp', dest="type",
help="Assume arg to be regular expression")
parser.add_option("-s", "--string",
action="store_const", const='string', dest="type",
help="Assume arg to be a sequence of characters")
parser.add_option("-a", "--auto",
action="store_const", const=None, dest="type",
help="Try to guess arg type (default)")
parser.add_option("-m", "--max",
action="store", default=10, dest="maxcount", type="int",
help="Maximal number of codepoints to display, default: 10; 0=unlimited")
parser.add_option("-i", "--io",
action="store", default=iocharsetguess, dest="iocharset", type="string",
help="I/O character set, I am guessing %s" % iocharsetguess)
parser.add_option("-c", "--charset-add",
action="store", dest="addcharset", type="string",
help="Show hexadecimal reprezentation in this additional charset")
parser.add_option("-C", "--colour",
action="store", dest="use_colour", type="string",
default="auto",
help="Use colours, on, off or auto")
parser.add_option('', "--color",
action="store", dest="use_colour", type="string",
default="auto",
help="synonym for --colour")
parser.add_option("-v", "--verbose",
action="count", dest="verbosity",
default=0,
help="Increase verbosity (reads Unihan properties - slow!)")
parser.add_option("-w", "--wikipedia",
action="count", dest="query_wiki",
default=0,
help="Query wikipedia for the character")



(options, arguments) = parser.parse_args()

linecache = {}
do_init()

if len(arguments)==0:
parser.print_help()
sys.exit()

if options.use_colour.lower() in ("on", "1", "true", "yes"):
use_colour = True
elif options.use_colour.lower() in ("off", "0", "false", "no"):
use_colour = False
else:
use_colour = sys.stdout.isatty()
if sys.platform == 'win32':
use_colour = False



l_args = [] # list of non range arguments to process
for argum in arguments:
is_r = is_range(argum, options.type)
if is_r:
print_blocks(is_r)
else:
l_args.append(argum)

if l_args:
unihan_fs = []
if options.verbosity>0:
unihan_fs = get_unihan_files() # list of file names for Unihan data file(s), empty if not available
if not unihan_fs:
out( """
Unihan_*.txt files not found. In order to view Unihan properties,
please place the file into /usr/share/unidata/,
/usr/share/unicode/, ~/.unicode/
or current working directory (optionally you can gzip or bzip2 them).
You can get the files by unpacking ftp://ftp.unicode.org/Public/UNIDATA/Unihan.zip
Warning, listing UniHan Properties is rather slow.

""")
options.verbosity = 0
try:
print_characters(process(l_args, options.type), options.maxcount, options.query_wiki)
except IOError: # e.g. broken pipe
pass

unicode-0.9.4/unicode.1000064400000000000000000000053201165276173700146770ustar00rootroot00000000000000.\" Hey, EMACS: -*- nroff -*-
.TH UNICODE 1 "2003-01-31"
.SH NAME
unicode \- command line unicode database query tool
.SH SYNOPSIS
.B unicode
.RI [ options ]
string
.SH DESCRIPTION
This manual page documents the
.B unicode
command.
.PP
\fBunicode\fP is a command line unicode database query tool.

.SH OPTIONS
.TP
.BI \-h
.BI \-\-help

Show help and exit.

.TP
.BI \-x
.BI \-\-hexadecimal

Assume
.I string
to be a hexadecimal number

.TP
.BI \-d
.BI \-\-decimal

Assume
.I string
to be a decimal number

.TP
.BI \-r
.BI \-\-regexp

Assume
.I string
to be a regular expression

.TP
.BI \-s
.BI \-\-string

Assume
.I string
to be a sequence of characters

.TP
.BI \-a
.BI \-\-auto

Try to guess type of
.I string
from one of the above (default)

.TP
.BI \-mMAXCOUNT
.BI \-\-max=MAXCOUNT

Maximal number of codepoints to display, default: 20; use 0 for unlimited

.TP
.BI \-iCHARSET
.BI \-\-io=IOCHARSET

I/O character set. For maximal pleasure, run \fBunicode\fP on UTF-8
capable terminal and specify IOCHARSET to be UTF-8. \fBunicode\fP
tries to guess this value from your locale, so with properly set up
locale, you should not need to specify it.

.TP
.BI \-cADDCHARSET
.BI \-\-charset\-add=ADDCHARSET

Show hexadecimal reprezentation of displayed characters in this additional charset.

.TP
.BI \-CUSE_COLOUR
.BI \-\-colour=USE_COLOUR

USE_COLOUR is one of
.I on
.I off
.I auto

.B \-\-colour=on
will use ANSI colour codes to colourise the output

.B \-\-colour=off
won't use colours.

.B \-\-colour=auto
will test if standard output is a tty, and use colours only when it is.

.BI \-\-color
is a synonym of
.BI \-\-colour

.TP
.BI \-v
.BI \-\-verbose

Be more verbose about displayed characters, e.g. display Unihan information, if available.

.TP
.BI \-w
.BI \-\-wikipedia

Spawn browser pointing to Wikipedia entry about the character.

.SH USAGE

\fBunicode\fP tries to guess the type of an argument. For example,
you can use any of the following to display information about
U+00E1 LATIN SMALL LETTER A WITH ACUTE (\('a):

\fBunicode\fP 00E1

\fBunicode\fP U+00E1

\fBunicode\fP \('a

\fBunicode\fP 'latin small letter a with acute'


You can specify a range of characters as argumets, \fBunicode\fP will
show these characters in nice tabular format, aligned to 256-byte boundaries.
Use two dots ".." to indicate the range, e.g.

\fBunicode\fP 0450..0520

will display the whole cyrillic and hebrew blocks (characters from U+0400 to U+05FF)

\fBunicode\fP 0400..

will display just characters from U+0400 up to U+04FF

.SH BUGS
Tabular format does not deal well with full-width, combining, control
and RTL characters.

.SH SEE ALSO
ascii(1)


.SH AUTHOR
Radovan Garab\('ik <garabik@melkor.dnp.fmph.uniba.sk>


 
design & coding: Vladimir Lettiev aka crux © 2004-2005, Andrew Avramenko aka liks © 2007-2008
current maintainer: Michael Shigorin