unicode-0.9.4/000075500000000000000000000000001165276173700131675ustar00rootroot00000000000000unicode-0.9.4/COPYING000064400000000000000000000000101165276173700142110ustar00rootroot00000000000000GPL v3 unicode-0.9.4/README000064400000000000000000000030771165276173700140560ustar00rootroot00000000000000This file is in UTF-8 encoding. To use unicode utility, you need: - python >=2.2 (generators are needed), preferrably wide unicode build, - python optparse library (part of python2.3) - UnicodeData.txt file (http://www.unicode.org/Public) which you should put into /usr/share/unicode/, ~/.unicode/ or current working directory. - if you want to see UniHan properties, you need also Unihan.txt file which should be put into /usr/share/unicode/, ~./unicode/ or current working directory. Enter regular expression or hexadecimal number as an argument. Not much documentation at the moment, see the manpage, here are just some examples how to use this script: $ unicode.py euro U+20A0 EURO-CURRENCY SIGN UTF-8: e2 82 a0 UTF-16BE: 20a0 Decimal: ₠ ₠ Category: Sc (Symbol, Currency) Bidi: ET (European Number Terminator) U+20AC EURO SIGN UTF-8: e2 82 ac UTF-16BE: 20ac Decimal: € € Category: Sc (Symbol, Currency) Bidi: ET (European Number Terminator) $ unicode.py 00c0 U+00C0 LATIN CAPITAL LETTER A WITH GRAVE UTF-8: c3 80 UTF-16BE: 00c0 Decimal: À À (à) Lowercase: U+00E0 Category: Lu (Letter, Uppercase) Bidi: L (Left-to-Right) Decomposition: 0041 0300 You can specify a range of characters as argumets, unicode will show these characters in nice tabular format, aligned to 256-byte boundaries. Use two dots ".." to indicate the range, e.g. unicode 0450..0520 will display the whole cyrillic, armenian and hebrew blocks (characters from U+0400 to U+05FF) unicode 0400.. will display just characters from U+0400 up to U+04FF unicode-0.9.4/README-paracode000064400000000000000000000022711165276173700156250ustar00rootroot00000000000000Written by Radovan Garabík . For new versions, look at http://kassiopeia.juls.savba.sk/~garabik/software/unicode/ ------------------- paracode exploits the full power of the Unicode standard to convert the text into visually similar stream of glyphs, while using completely different codepoints. It is an excellent didactic tool demonstrating the principles and advanced use of the Unicode standard. paracode is a command line tool working as a filter, reading standard input in UTF-8 encoding and writing to standard output. Use optional -t switch to select what tables to use. Special name 'all' selects all the tables. Note that selecting 'other', 'cyrillic_plus' and 'cherokee' tables (and 'all') makes use of rather esoteric characters, and not all fonts contain them. Special table 'mirror' uses quite different character substitution, is not selected automatically with 'all' and does not work well with anything except plain ascii alphabetical characters. Example: paracode -t cyrillic+greek+cherokee paracode -t cherokee output paracode -r -t mirror output Possible tables are: cyrillic cyrillic_plus greek other cherokee all unicode-0.9.4/changelog000077700000000000000000000000001165276173700201322debian/changelogustar00rootroot00000000000000unicode-0.9.4/debian/000075500000000000000000000000001165276173700144115ustar00rootroot00000000000000unicode-0.9.4/debian/README.Debian000064400000000000000000000003561165276173700164560ustar00rootroot00000000000000unicode for Debian ------------------ packaged as native package, the source resides at http://kassiopeia.juls.savba.sk/~garabik/software/unicode/ -- Radovan Garabík , Fri, 7 Feb 2003 15:09:19 +0100 unicode-0.9.4/debian/changelog000064400000000000000000000111321165276173700162610ustar00rootroot00000000000000unicode (0.9.4) unstable; urgency=low * recognise split unihan files (closes: #551789) -- Radovan Garabík Sun, 07 Feb 2010 18:36:29 +0100 unicode (0.9.3) unstable; urgency=low * run pylint & pychecker – fix some previously unnoticed bugs -- Radovan Garabík Mon, 04 May 2009 22:40:51 +0200 unicode (0.9.2) unstable; urgency=low * giving "latin alpha" as an argument will now search for all the character names containing the "latin.*alpha" regular expression, not _either_ "latin" or "alpha" strings (closes: #439146), idea from martin f. krafft. * added forgotten README-paracode to the docfiles -- Radovan Garabík Thu, 30 Oct 2008 18:58:48 +0100 unicode (0.9.1) unstable; urgency=low * add package URL to debian/copyright and debian/README.Debian (closes: #495555) -- Radovan Garabík Sat, 23 Aug 2008 10:28:02 +0200 unicode (0.9) unstable; urgency=low * include paracode utility * clarify GPL version (v3) -- Radovan Garabík Wed, 19 Sep 2007 19:01:55 +0100 unicode (0.8) unstable; urgency=low * fix traceback when letter has no uppercase or lowercase forms -- Radovan Garabík Sun, 1 Oct 2006 21:42:33 +0200 unicode (0.7) unstable; urgency=low * updated to use unicode-data (closes: #386853) * data files can be bzip2'ed now * use data from unicode data files, not from python unicodedata module (the latter tends to be obsolete) -- Radovan Garabík Sat, 16 Sep 2006 21:44:34 +0200 unicode (0.6) unstable; urgency=low * fix stupid undeclared options bug (thanks to Tim Hatch) * remove absolute path from z?grep, rely on OS's default PATH to execute the command(s) * add default path to UnicodeData.txt for MacOSX systems -- Radovan Garabík Wed, 4 Jan 2006 19:57:54 +0100 unicode (0.5) unstable; urgency=low * work around browser invocations that cannot handle UTF-8 in URL's -- Radovan Garabík Sun, 1 Jan 2006 00:59:60 +0100 unicode (0.4.9) unstable; urgency=low * better directional overriding for RTL characters * query wikipedia with -w switch * better heuristics guessing argument type -- Radovan Garabík Sun, 11 Sep 2005 18:30:59 +0200 unicode (0.4.8) unstable; urgency=low * catch an exception if locale.nl_langinfo is not present (thanks to Michael Weir) * default to no colour if the system in MS Windows * put back accidentally disabled left-to-right mark - as a result, tabular display of arabic, hebrew and other RTL scripts is much better (the bug manifested itself only on powerful i18n terminals, such as mlterm) -- Radovan Garabík Fri, 26 Aug 2005 14:25:58 +0200 unicode (0.4.7) unstable; urgency=low * some UniHan support (closes: #187214) * --color as a synonum for --colour added (closes: #273503) -- Radovan Garabík Thu, 4 Aug 2005 16:36:07 +0200 unicode (0.4.6) unstable; urgency=low * change charset guessing (closes: #241889), thanks to Євгeнiй Meщepяĸoв (Eugeniy Meshcheryakov) for the patch * closes: #229857 - it has been closed together with 215267 -- Radovan Garabík Tue, 20 Apr 2004 15:39:34 +0200 unicode (0.4.5) unstable; urgency=low * catch exception if input sequence is invalid in given encoding (closes: #188438) * automatically find and symlink UnicodeData.txt from perl, if installed (thanks to LarstiQ for the patch) (closes: #215267) * change architecture to 'all' (closes: #215264) -- Radovan Garabík Wed, 21 Jan 2004 10:30:38 +0100 unicode (0.4) unstable; urgency=low * added option to choose colour output (closes: #187215) -- Radovan Garabík Wed, 9 Apr 2003 16:37:39 +0200 unicode (0.3.1) unstable; urgency=low * added python to Build-depends (closes: #183662) * properly quote hyphens in manpage (closes: #186151) * do not use UTF-8 in manpage (closes: #186193) * added versioned dependency for python2.3 (closes: #186444) -- Radovan Garabík Mon, 24 Mar 2003 14:39:31 +0100 unicode (0.3) unstable; urgency=low * Initial Release. -- Radovan Garabík Fri, 7 Feb 2003 15:09:19 +0100 unicode-0.9.4/debian/compat000064400000000000000000000000021165276173700156070ustar00rootroot000000000000004 unicode-0.9.4/debian/control000064400000000000000000000007571165276173700160250ustar00rootroot00000000000000Source: unicode Section: utils Priority: optional Maintainer: Radovan Garabík Build-Depends: debhelper (>= 4) Standards-Version: 3.8.0 Package: unicode Architecture: all Depends: python (>= 2.3) Suggests: perl-modules | console-data (<< 2:1.0-1) | unicode-data Description: display unicode character properties unicode is a simple command line utility that displays properties for a given unicode character, or searches unicode database for a given name. unicode-0.9.4/debian/copyright000064400000000000000000000005551165276173700163510ustar00rootroot00000000000000This program was written by Radovan Garabík on Fri, 7 Feb 2003 15:09:19 +0100, and packaged for Debian as a native package. The sources and package can be downloaded from: http://kassiopeia.juls.savba.sk/~garabik/software/unicode/ Copyright: © Radovan Garabík, released under GPL v3, see /usr/share/common-licenses/GPL unicode-0.9.4/debian/dirs000064400000000000000000000000101165276173700152640ustar00rootroot00000000000000usr/bin unicode-0.9.4/debian/docs000064400000000000000000000000301165276173700152550ustar00rootroot00000000000000README README-paracode unicode-0.9.4/debian/rules000075500000000000000000000035561165276173700155020ustar00rootroot00000000000000#!/usr/bin/make -f # Sample debian/rules that uses debhelper. # GNU copyright 1997 to 1999 by Joey Hess. # Uncomment this to turn on verbose mode. #export DH_VERBOSE=1 # This is the debhelper compatibility version to use. #export DH_COMPAT=4 CFLAGS = -Wall -g ifneq (,$(findstring noopt,$(DEB_BUILD_OPTIONS))) CFLAGS += -O0 else CFLAGS += -O2 endif ifeq (,$(findstring nostrip,$(DEB_BUILD_OPTIONS))) INSTALL_PROGRAM += -s endif configure: configure-stamp configure-stamp: dh_testdir # Add here commands to configure the package. touch configure-stamp build: build-stamp build-stamp: configure-stamp dh_testdir # Add here commands to compile the package. #$(MAKE) #/usr/bin/docbook-to-man debian/unicode.sgml > unicode.1 touch build-stamp clean: dh_testdir dh_testroot rm -f build-stamp configure-stamp # Add here commands to clean up after the build process. #-$(MAKE) clean dh_clean install: build dh_testdir dh_testroot dh_clean -k dh_installdirs # Add here commands to install the package into debian/unicode. #$(MAKE) install DESTDIR=$(CURDIR)/debian/unicode cp unicode paracode $(CURDIR)/debian/unicode/usr/bin # Build architecture-dependent files here. #binary-arch: build install # We have nothing to do by default. # Build architecture-independent files here. binary-indep: build install dh_testdir dh_testroot # dh_installdebconf dh_installdocs # dh_installexamples # dh_installmenu # dh_installlogrotate # dh_installemacsen # dh_installpam # dh_installmime # dh_installinit # dh_installcron dh_installman unicode.1 paracode.1 # dh_installinfo # dh_undocumented dh_installchangelogs # dh_link dh_strip dh_compress dh_fixperms # dh_makeshlibs dh_installdeb # dh_perl # dh_shlibdeps # dh_python dh_gencontrol dh_md5sums dh_builddeb binary: binary-indep binary-arch .PHONY: build clean binary-indep binary-arch binary install configure unicode-0.9.4/paracode000075500000000000000000000147341165276173700147040ustar00rootroot00000000000000#!/usr/bin/python import sys, unicodedata from optparse import OptionParser table_cyrillic = { 'A' : u'\N{CYRILLIC CAPITAL LETTER A}', 'B' : u'\N{CYRILLIC CAPITAL LETTER VE}', 'C' : u'\N{CYRILLIC CAPITAL LETTER ES}', 'E' : u'\N{CYRILLIC CAPITAL LETTER IE}', 'H' : u'\N{CYRILLIC CAPITAL LETTER EN}', 'I' : u'\N{CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I}', 'J' : u'\N{CYRILLIC CAPITAL LETTER JE}', 'K' : u'\N{CYRILLIC CAPITAL LETTER KA}', 'M' : u'\N{CYRILLIC CAPITAL LETTER EM}', 'O' : u'\N{CYRILLIC CAPITAL LETTER O}', 'P' : u'\N{CYRILLIC CAPITAL LETTER ER}', 'S' : u'\N{CYRILLIC CAPITAL LETTER DZE}', 'T' : u'\N{CYRILLIC CAPITAL LETTER TE}', 'X' : u'\N{CYRILLIC CAPITAL LETTER HA}', 'Y' : u'\N{CYRILLIC CAPITAL LETTER U}', 'a' : u'\N{CYRILLIC SMALL LETTER A}', 'c' : u'\N{CYRILLIC SMALL LETTER ES}', 'e' : u'\N{CYRILLIC SMALL LETTER IE}', 'i' : u'\N{CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I}', 'j' : u'\N{CYRILLIC SMALL LETTER JE}', 'o' : u'\N{CYRILLIC SMALL LETTER O}', 'p' : u'\N{CYRILLIC SMALL LETTER ER}', 's' : u'\N{CYRILLIC SMALL LETTER DZE}', 'x' : u'\N{CYRILLIC SMALL LETTER HA}', 'y' : u'\N{CYRILLIC SMALL LETTER U}', } table_cyrillic_plus = { 'Y' : u'\N{CYRILLIC CAPITAL LETTER STRAIGHT U}', 'h' : u'\N{CYRILLIC SMALL LETTER SHHA}', } table_greek = { 'A' : u'\N{GREEK CAPITAL LETTER ALPHA}', 'B' : u'\N{GREEK CAPITAL LETTER BETA}', 'E' : u'\N{GREEK CAPITAL LETTER EPSILON}', 'H' : u'\N{GREEK CAPITAL LETTER ETA}', 'I' : u'\N{GREEK CAPITAL LETTER IOTA}', 'K' : u'\N{GREEK CAPITAL LETTER KAPPA}', 'M' : u'\N{GREEK CAPITAL LETTER MU}', 'N' : u'\N{GREEK CAPITAL LETTER NU}', 'O' : u'\N{GREEK CAPITAL LETTER OMICRON}', 'P' : u'\N{GREEK CAPITAL LETTER RHO}', 'T' : u'\N{GREEK CAPITAL LETTER TAU}', 'X' : u'\N{GREEK CAPITAL LETTER CHI}', 'Y' : u'\N{GREEK CAPITAL LETTER UPSILON}', 'Z' : u'\N{GREEK CAPITAL LETTER ZETA}', 'o' : u'\N{GREEK SMALL LETTER OMICRON}', } table_other = { '!' : u'\N{LATIN LETTER RETROFLEX CLICK}', 'O' : u'\N{ARMENIAN CAPITAL LETTER OH}', 'S' : u'\N{ARMENIAN CAPITAL LETTER TIWN}', 'o' : u'\N{ARMENIAN SMALL LETTER OH}', 'n' : u'\N{ARMENIAN SMALL LETTER VO}', } table_cherokee = { 'A' : u'\N{CHEROKEE LETTER GO}', 'B' : u'\N{CHEROKEE LETTER YV}', 'C' : u'\N{CHEROKEE LETTER TLI}', 'D' : u'\N{CHEROKEE LETTER A}', 'E' : u'\N{CHEROKEE LETTER GV}', 'G' : u'\N{CHEROKEE LETTER NAH}', 'H' : u'\N{CHEROKEE LETTER MI}', 'J' : u'\N{CHEROKEE LETTER GU}', 'K' : u'\N{CHEROKEE LETTER TSO}', 'L' : u'\N{CHEROKEE LETTER TLE}', 'M' : u'\N{CHEROKEE LETTER LU}', 'P' : u'\N{CHEROKEE LETTER TLV}', 'R' : u'\N{CHEROKEE LETTER SV}', 'S' : u'\N{CHEROKEE LETTER DU}', 'T' : u'\N{CHEROKEE LETTER I}', 'V' : u'\N{CHEROKEE LETTER DO}', 'W' : u'\N{CHEROKEE LETTER LA}', 'Y' : u'\N{CHEROKEE LETTER GI}', 'Z' : u'\N{CHEROKEE LETTER NO}', } table_mirror = { 'A' : u'\N{FOR ALL}', 'B' : u'\N{CANADIAN SYLLABICS CARRIER KHA}', 'C' : u'\N{LATIN CAPITAL LETTER OPEN O}', 'D' : u'\N{CANADIAN SYLLABICS CARRIER PA}', 'E' : u'\N{LATIN CAPITAL LETTER REVERSED E}', 'F' : u'\N{TURNED CAPITAL F}', 'G' : u'\N{TURNED SANS-SERIF CAPITAL G}', 'H' : u'H', 'I' : u'I', 'J' : u'\N{LATIN SMALL LETTER LONG S}', 'K' : u'\N{LATIN SMALL LETTER TURNED K}', # fixme 'L' : u'\N{TURNED SANS-SERIF CAPITAL L}', 'M' : u'W', 'N' : u'N', 'O' : u'O', 'P' : u'\N{CYRILLIC CAPITAL LETTER KOMI DE}', 'R' : u'\N{CANADIAN SYLLABICS TLHO}', 'S' : u'S', 'T' : u'\N{UP TACK}', 'U' : u'\N{ARMENIAN CAPITAL LETTER VO}', 'V' : u'\N{N-ARY LOGICAL AND}', 'W' : u'M', 'X' : u'X', 'Y' : u'\N{TURNED SANS-SERIF CAPITAL Y}', 'Z' : u'Z', 'a' : u'\N{LATIN SMALL LETTER TURNED A}', 'b' : u'q', 'c' : u'\N{LATIN SMALL LETTER OPEN O}', 'd' : u'p', 'e' : u'\N{LATIN SMALL LETTER SCHWA}', 'f' : u'\N{LATIN SMALL LETTER DOTLESS J WITH STROKE}', 'g' : u'\N{LATIN SMALL LETTER B WITH HOOK}', 'h' : u'\N{LATIN SMALL LETTER TURNED H}', 'i' : u'\N{LATIN SMALL LETTER DOTLESS I}' + u'\N{COMBINING DOT BELOW}', 'j' : u'\N{LATIN SMALL LETTER LONG S}' + u'\N{COMBINING DOT BELOW}', 'k' : u'\N{LATIN SMALL LETTER TURNED K}', 'l' : u'l', 'm' : u'\N{LATIN SMALL LETTER TURNED M}', 'n' : u'u', 'o' : u'o', 'p' : u'd', 'q' : u'b', 'r' : u'\N{LATIN SMALL LETTER TURNED R}', 's' : u's', 't' : u'\N{LATIN SMALL LETTER TURNED T}', 'u' : u'n', 'v' : u'\N{LATIN SMALL LETTER TURNED V}', 'w' : u'\N{LATIN SMALL LETTER TURNED W}', 'x' : u'x', 'y' : u'\N{LATIN SMALL LETTER TURNED Y}', 'z' : u'z', '0' : '0', '1' : u'I', '2' : u'\N{INVERTED QUESTION MARK}\N{COMBINING MACRON}', '3' : u'\N{LATIN CAPITAL LETTER OPEN E}', '4' : u'\N{LATIN SMALL LETTER LZ DIGRAPH}', '6' : '9', '7' : u'\N{LATIN CAPITAL LETTER L WITH STROKE}', '8' : '8', '9' : '6', ',' : "'", "'" : ',', '.' : u'\N{DOT ABOVE}', '?' : u'\N{INVERTED QUESTION MARK}', '!' : u'\N{INVERTED EXCLAMATION MARK}', } tables_names = ['cyrillic', 'cyrillic_plus', 'greek', 'other', 'cherokee'] table_default = table_cyrillic table_default.update(table_greek) table_all={} for t in tables_names: table_all.update(globals()['table_'+t]) parser = OptionParser(usage="usage: %prog [options]") parser.add_option("-t", "--tables", action="store", default='default', dest="tables", type="string", help="""list of tables to use, separated by a plus sign. Possible tables are: """+'+'.join(tables_names)+""" and a special name 'all' to specify all these tables joined together. There is another table, 'mirror', that is not selected in 'all'.""") parser.add_option("-r", "--reverse", action="count", dest="reverse", default=0, help="Reverse the text after conversion. Best used with the 'mirror' table.") (options, args) = parser.parse_args() if args: to_convert = ' '.join(args).decode('utf-8') else: to_convert = None tables = options.tables.split('+') tables = ['table_'+x for x in tables] tables = [globals()[x] for x in tables] table = {} for t in tables: table.update(t) def reverse_string(s): l = list(s) l.reverse() r = ''.join(l) return r def do_convert(s, reverse=0): if reverse: s = reverse_string(s) l = unicodedata.normalize('NFKD', s) out = [] for c in l: out.append(table.get(c, c)) out = ''.join(out) out = unicodedata.normalize('NFKC', out) return out if not to_convert: if options.reverse: lines = sys.stdin.readlines() lines.reverse() else: lines = sys.stdin for line in lines: l = line.decode('utf-8') out = do_convert(l, options.reverse) sys.stdout.write(out.encode('utf-8')) else: out = do_convert(to_convert, options.reverse) sys.stdout.write(out.encode('utf-8')) sys.stdout.write('\n') unicode-0.9.4/paracode.1000064400000000000000000000030521165276173700150270ustar00rootroot00000000000000.\" Hey, EMACS: -*- nroff -*- .TH PARACODE 1 "2005-04-16" .SH NAME paracode \- command line Unicode conversion tool .SH SYNOPSIS .B paracode .RI [ -t tables ] string .SH DESCRIPTION This manual page documents the .B paracode command. .PP \fBparacode\fP exploits the full power of the Unicode standard to convert the text into visually similar stream of glyphs, while using completely different codepoints. It is an excellent didactic tool demonstrating the principles and advanced use of the Unicode standard. .PP \fBparacode\fP is a command line tool working as a filter, reading standard input in UTF-8 encoding and writing to standard output. .SH OPTIONS .TP .BI \-t tables .BI \-\-tables Use given list of conversion tables, separated by a plus sign. Special name 'all' selects all the tables. Note that selecting 'other', 'cyrillic_plus' and 'cherokee' tables (and 'all') makes use of rather esoteric characters, and not all fonts contain them. Special table 'mirror' uses quite different character substitution, is not selected automatically with 'all' and does not work well with anything except plain ascii alphabetical characters. Example: paracode -t cyrillic+greek+cherokee paracode -t cherokee output paracode -r -t mirror output Possible tables are: cyrillic cyrillic_plus greek other cherokee all .TP .BI \-r Display text in reverse order after conversion, best used together with -t mirror. .SH SEE ALSO iconv(1) .SH AUTHOR Radovan Garab\('ik unicode-0.9.4/unicode000075500000000000000000000527541165276173700145600ustar00rootroot00000000000000#!/usr/bin/python #from __future__ import generators import os, glob, sys, unicodedata, locale, gzip, re, traceback, string, commands import urllib, webbrowser # bz2 was introduced in 2.3, we want this to work also with earlier versions try: import bz2 except ImportError: bz2 = None from optparse import OptionParser VERSION='0.9.4' # list of terminals that support bidi biditerms = ['mlterm'] locale.setlocale(locale.LC_ALL, '') # guess terminal charset try: iocharsetguess = locale.nl_langinfo(locale.CODESET) or "ascii" except: iocharsetguess = "ascii" if os.environ.get('TERM') in biditerms and iocharsetguess.lower().startswith('utf'): LTR = u'\u202d' # left to right override else: LTR = '' def out(*args): "pring args, converting them to output charset" for i in args: sys.stdout.write(i.encode(options.iocharset, 'replace')) colours = { 'none' : "", 'default' : "\033[0m", 'bold' : "\033[1m", 'underline' : "\033[4m", 'blink' : "\033[5m", 'reverse' : "\033[7m", 'concealed' : "\033[8m", 'black' : "\033[30m", 'red' : "\033[31m", 'green' : "\033[32m", 'yellow' : "\033[33m", 'blue' : "\033[34m", 'magenta' : "\033[35m", 'cyan' : "\033[36m", 'white' : "\033[37m", 'on_black' : "\033[40m", 'on_red' : "\033[41m", 'on_green' : "\033[42m", 'on_yellow' : "\033[43m", 'on_blue' : "\033[44m", 'on_magenta' : "\033[45m", 'on_cyan' : "\033[46m", 'on_white' : "\033[47m", 'beep' : "\007", } general_category = { 'Lu': 'Letter, Uppercase', 'Ll': 'Letter, Lowercase', 'Lt': 'Letter, Titlecase', 'Lm': 'Letter, Modifier', 'Lo': 'Letter, Other', 'Mn': 'Mark, Non-Spacing', 'Mc': 'Mark, Spacing Combining', 'Me': 'Mark, Enclosing', 'Nd': 'Number, Decimal Digit', 'Nl': 'Number, Letter', 'No': 'Number, Other', 'Pc': 'Punctuation, Connector', 'Pd': 'Punctuation, Dash', 'Ps': 'Punctuation, Open', 'Pe': 'Punctuation, Close', 'Pi': 'Punctuation, Initial quote', 'Pf': 'Punctuation, Final quote', 'Po': 'Punctuation, Other', 'Sm': 'Symbol, Math', 'Sc': 'Symbol, Currency', 'Sk': 'Symbol, Modifier', 'So': 'Symbol, Other', 'Zs': 'Separator, Space', 'Zl': 'Separator, Line', 'Zp': 'Separator, Paragraph', 'Cc': 'Other, Control', 'Cf': 'Other, Format', 'Cs': 'Other, Surrogate', 'Co': 'Other, Private Use', 'Cn': 'Other, Not Assigned', } bidi_category = { 'L' : 'Left-to-Right', 'LRE' : 'Left-to-Right Embedding', 'LRO' : 'Left-to-Right Override', 'R' : 'Right-to-Left', 'AL' : 'Right-to-Left Arabic', 'RLE' : 'Right-to-Left Embedding', 'RLO' : 'Right-to-Left Override', 'PDF' : 'Pop Directional Format', 'EN' : 'European Number', 'ES' : 'European Number Separator', 'ET' : 'European Number Terminator', 'AN' : 'Arabic Number', 'CS' : 'Common Number Separator', 'NSM' : 'Non-Spacing Mark', 'BN' : 'Boundary Neutral', 'B' : 'Paragraph Separator', 'S' : 'Segment Separator', 'WS' : 'Whitespace', 'ON' : 'Other Neutrals', } comb_classes = { 0: 'Spacing, split, enclosing, reordrant, and Tibetan subjoined', 1: 'Overlays and interior', 7: 'Nuktas', 8: 'Hiragana/Katakana voicing marks', 9: 'Viramas', 10: 'Start of fixed position classes', 199: 'End of fixed position classes', 200: 'Below left attached', 202: 'Below attached', 204: 'Below right attached', 208: 'Left attached (reordrant around single base character)', 210: 'Right attached', 212: 'Above left attached', 214: 'Above attached', 216: 'Above right attached', 218: 'Below left', 220: 'Below', 222: 'Below right', 224: 'Left (reordrant around single base character)', 226: 'Right', 228: 'Above left', 230: 'Above', 232: 'Above right', 233: 'Double below', 234: 'Double above', 240: 'Below (iota subscript)', } def get_unicode_properties(ch): properties = {} if ch in linecache: fields = linecache[ch].strip().split(';') proplist = ['codepoint', 'name', 'category', 'combining', 'bidi', 'decomposition', 'dummy', 'digit_value', 'numeric_value', 'mirrored', 'unicode1name', 'iso_comment', 'uppercase', 'lowercase', 'titlecase'] for i, prop in enumerate(proplist): if prop!='dummy': properties[prop] = fields[i] if properties['lowercase']: properties['lowercase'] = unichr(int(properties['lowercase'], 16)) if properties['uppercase']: properties['uppercase'] = unichr(int(properties['uppercase'], 16)) if properties['titlecase']: properties['titlecase'] = unichr(int(properties['titlecase'], 16)) properties['combining'] = int(properties['combining']) properties['mirrored'] = properties['mirrored']=='Y' else: properties['codepoint'] = '%04X' % ord(ch) properties['name'] = unicodedata.name(ch, '') properties['category'] = unicodedata.category(ch) properties['combining'] = unicodedata.combining(ch) properties['bidi'] = unicodedata.bidirectional(ch) properties['decomposition'] = unicodedata.decomposition(ch) properties['digit_value'] = unicodedata.digit(ch, '') properties['numeric_value'] = unicodedata.numeric(ch, '') properties['mirrored'] = unicodedata.mirrored(ch) properties['unicode1name'] = '' properties['iso_comment'] = '' properties['uppercase'] = ch.upper() properties['lowercase'] = ch.lower() properties['titlecase'] = '' return properties def do_init(): HomeDir = os.path.expanduser('~/.unicode') HomeUnicodeData = os.path.join(HomeDir, "UnicodeData.txt") global UnicodeDataFileNames UnicodeDataFileNames = [HomeUnicodeData, '/usr/share/unidata/UnicodeData.txt', '/usr/share/unicode/UnicodeData.txt', './UnicodeData.txt'] + \ glob.glob('/usr/share/unidata/UnicodeData*.txt') + \ glob.glob('/usr/share/perl/*/unicore/UnicodeData.txt') + \ glob.glob('/System/Library/Perl/*/unicore/UnicodeData.txt') # for MacOSX HomeUnihanData = os.path.join(HomeDir, "Unihan*") global UnihanDataGlobs UnihanDataGlobs = [HomeUnihanData, '/usr/share/unidata/Unihan*', '/usr/share/unicode/Unihan*', './Unihan*'] def get_unihan_files(): fos = [] # list of file names for Unihan data file(s) for gl in UnihanDataGlobs: fnames = glob.glob(gl) fos += fnames return fos def get_unihan_properties_internal(ch): properties = {} ch = ord(ch) global unihan_fs for f in unihan_fs: fo = OpenGzip(f) for l in fo: if l.startswith('#'): continue line = l.strip() if not line: continue char, key, value = line.strip().split('\t') if int(char[2:], 16) == ch: properties[key] = unicode(value, 'utf-8') elif int(char[2:], 16)>ch: break return properties def get_unihan_properties_zgrep(ch): properties = {} global unihan_fs ch = ord(ch) chs = 'U+%X' % ch for f in unihan_fs: if f.endswith('.gz'): grepcmd = 'zgrep' elif f.endswith('.bz2'): grepcmd = 'bzgrep' else: grepcmd = 'grep' cmd = grepcmd+' ^'+chs+r'\\b '+f status, output = commands.getstatusoutput(cmd) output = output.split('\n') for l in output: if not l: continue char, key, value = l.strip().split('\t') if int(char[2:], 16) == ch: properties[key] = unicode(value, 'utf-8') elif int(char[2:], 16)>ch: break return properties # basic sanity check, if e.g. you run this on MS Windows... if os.path.exists('/bin/grep'): get_unihan_properties = get_unihan_properties_zgrep else: get_unihan_properties = get_unihan_properties_internal def error(txt): out(txt) out('\n') sys.exit() def get_gzip_filename(fname): "return fname, if it does not exist, return fname+.gz, if neither that, fname+bz2, if neither that, return None" if os.path.exists(fname): return fname if os.path.exists(fname+'.gz'): return fname+'.gz' if os.path.exists(fname+'.bz2') and bz2 is not None: return fname+'.bz2' return None def OpenGzip(fname): "open fname, try fname.gz or fname.bz2 if fname does not exist, return file object or GzipFile or BZ2File object" if os.path.exists(fname) and not (fname.endswith('.gz') or fname.endswith('.bz2')): return file(fname) if os.path.exists(fname+'.gz'): fname = fname+'.gz' elif os.path.exists(fname+'.bz2') and bz2 is not None: fname = fname+'.bz2' if fname.endswith('.gz'): return gzip.GzipFile(fname) elif fname.endswith('.bz2'): return bz2.BZ2File(fname) return None #raise IOError def GrepInNames(pattern, fillcache=False): p = re.compile(pattern, re.I) f = None for name in UnicodeDataFileNames: f = OpenGzip(name) if f != None: break if not fillcache: if not f: out( """ Cannot find UnicodeData.txt, please place it into /usr/share/unidata/UnicodeData.txt, /usr/share/unicode/UnicodeData.txt, ~/.unicode/ or current working directory (optionally you can gzip it). Without the file, searching will be much slower. """ ) for i in xrange(sys.maxunicode): try: name = unicodedata.name(unichr(i)) if re.search(p, name): yield myunichr(i) except ValueError: pass else: for l in f: if re.search(p, l): r = myunichr(int(l.split(';')[0], 16)) linecache[r] = l yield r f.close() else: if f: for l in f: if re.search(p, l): r = myunichr(int(l.split(';')[0], 16)) linecache[r] = l f.close() def myunichr(n): try: r = unichr(n) return r except ValueError: traceback.print_exc() error("Consider recompiling your python interpreter with wide unicode characters") def is_ascii(s): "test is string s consists completely out of ascii characters" try: unicode(s, 'ascii') except UnicodeDecodeError: return False return True def guesstype(arg): if not is_ascii(arg): return 'string', arg elif arg[:2]=='U+' or arg[:2]=='u+': # it is hexadecimal number try: val = int(arg[2:], 16) if val>sys.maxunicode: return 'regexp', arg else: return 'hexadecimal', arg[2:] except ValueError: return 'regexp', arg elif arg[0] in "Uu" and len(arg)>4: try: val = int(arg[1:], 16) if val>sys.maxunicode: return 'regexp', arg else: return 'hexadecimal', arg except ValueError: return 'regexp', arg elif len(arg)>=4: try: val = int(arg, 16) if val>sys.maxunicode: return 'regexp', arg else: return 'hexadecimal', arg except ValueError: return 'regexp', arg else: return 'string', arg def process(arglist, t): # build a list of values, so that we can combine queries like # LATIN ALPHA and search for LATIN.*ALPHA and not names that # contain either LATIN or ALPHA result = [] names_query = [] # reserved for queries in names - i.e. -r for arg_i in arglist: if t==None: tp, arg = guesstype(arg_i) if tp == 'regexp': # if the first argument is guessed to be a regexp, add # all the following arguments to the regular expression - # this is probably what you wanted, e.g. # 'unicode cyrillic be' will now search for the 'cyrillic.*be' regular expression t = 'regexp' else: tp, arg = t, arg_i if tp=='hexadecimal': val = int(arg, 16) r = myunichr(val) list(GrepInNames('%04X'%val, fillcache=True)) # fill the table with character properties result.append(r) elif tp=='decimal': val = int(arg, 10) r = myunichr(val) list(GrepInNames('%04X'%val, fillcache=True)) result.append(r) elif tp=='regexp': names_query.append(arg) elif tp=='string': try: unirepr = unicode(arg, options.iocharset) except UnicodeDecodeError: error ("Sequence %s is not valid in charset '%s'." % (repr(arg), options.iocharset)) unilist = ['%04X'%ord(x) for x in unirepr] unireg = '|'.join(unilist) list(GrepInNames(unireg, fillcache=True)) for r in unirepr: result.append(r) if names_query: query = '.*'.join(names_query) for r in GrepInNames(query): result.append(r) return result def maybe_colours(colour): if use_colour: return colours[colour] else: return "" # format key and value def printkv(*l): for i in range(0, len(l), 2): if i options.maxcount: out("\nToo many characters to display, more than %s, use --max option to change it\n" % options.maxcount) return properties = get_unicode_properties(c) out(maybe_colours('bold')) out('U+%04X '% ord(c)) if properties['name']: out(properties['name']) else: out(maybe_colours('default')) out(" - No such unicode character name in database") out(maybe_colours('default')) out('\n') ar = ["UTF-8", string.join([("%02x" % ord(x)) for x in c.encode('utf-8')]) , "UTF-16BE", string.join([("%02x" % ord(x)) for x in c.encode('utf-16be')], ''), "Decimal", "&#%s;" % ord(c) ] if options.addcharset: try: rep = string.join([("%02x" % ord(x)) for x in c.encode(options.addcharset)] ) except UnicodeError: rep = "NONE" ar.extend( [options.addcharset, rep] ) printkv(*ar) if properties['combining']: pc = " "+c else: pc = c out(pc) uppercase = properties['uppercase'] lowercase = properties['lowercase'] if uppercase: out(" (%s)" % uppercase) out('\n') printkv( "Uppercase", 'U+%04X'% ord(properties['uppercase']) ) elif lowercase: out(" (%s)" % properties['lowercase']) out('\n') printkv( "Lowercase", 'U+%04X'% ord(properties['lowercase']) ) else: out('\n') printkv( 'Category', properties['category']+ " (%s)" % general_category[properties['category']] ) if properties['numeric_value']: printkv( 'Numeric value', properties['numeric_value']) if properties['digit_value']: printkv( 'Digit value', properties['digit_value']) bidi = properties['bidi'] if bidi: printkv( 'Bidi', bidi+ " (%s)" % bidi_category[bidi] ) mirrored = properties['mirrored'] if mirrored: out('Character is mirrored\n') comb = properties['combining'] if comb: printkv( 'Combining', str(comb)+ " (%s)" % (comb_classes.get(comb, '?')) ) decomp = properties['decomposition'] if decomp: printkv( 'Decomposition', decomp ) if options.verbosity>0: uhp = get_unihan_properties(c) for key in uhp: printkv(key, uhp[key]) out('\n') def print_block(block): #header out(" "*10) for i in range(16): out(".%X " % i) out('\n') #body for i in range(block*16, block*16+16): hexi = "%X" % i if len(hexi)>3: hexi = "%07X" % i hexi = hexi[:4]+" "+hexi[4:] else: hexi = " %03X" % i out(LTR+hexi+". ") for j in range(16): c = unichr(i*16+j) if unicodedata.combining(c): c = " "+c out(c) out(' ') out('\n') out('\n') def print_blocks(blocks): for block in blocks: print_block(block) def is_range(s, typ): sp = s.split('..') if len(sp)<>2: return False if not sp[1]: sp[1] = sp[0] elif not sp[0]: sp[0] = sp[1] if not sp[0]: return False low = list(process([sp[0]], typ)) high = list(process([sp[1]], typ)) if len(low)<>1 or len(high)<>1: return False low = ord(low[0]) high = ord(high[0]) low = low // 256 high = high // 256 + 1 return range(low, high) parser = OptionParser(usage="usage: %prog [options] arg") parser.add_option("-x", "--hexadecimal", action="store_const", const='hexadecimal', dest="type", help="Assume arg to be hexadecimal number") parser.add_option("-d", "--decimal", action="store_const", const='decimal', dest="type", help="Assume arg to be decimal number") parser.add_option("-r", "--regexp", action="store_const", const='regexp', dest="type", help="Assume arg to be regular expression") parser.add_option("-s", "--string", action="store_const", const='string', dest="type", help="Assume arg to be a sequence of characters") parser.add_option("-a", "--auto", action="store_const", const=None, dest="type", help="Try to guess arg type (default)") parser.add_option("-m", "--max", action="store", default=10, dest="maxcount", type="int", help="Maximal number of codepoints to display, default: 10; 0=unlimited") parser.add_option("-i", "--io", action="store", default=iocharsetguess, dest="iocharset", type="string", help="I/O character set, I am guessing %s" % iocharsetguess) parser.add_option("-c", "--charset-add", action="store", dest="addcharset", type="string", help="Show hexadecimal reprezentation in this additional charset") parser.add_option("-C", "--colour", action="store", dest="use_colour", type="string", default="auto", help="Use colours, on, off or auto") parser.add_option('', "--color", action="store", dest="use_colour", type="string", default="auto", help="synonym for --colour") parser.add_option("-v", "--verbose", action="count", dest="verbosity", default=0, help="Increase verbosity (reads Unihan properties - slow!)") parser.add_option("-w", "--wikipedia", action="count", dest="query_wiki", default=0, help="Query wikipedia for the character") (options, arguments) = parser.parse_args() linecache = {} do_init() if len(arguments)==0: parser.print_help() sys.exit() if options.use_colour.lower() in ("on", "1", "true", "yes"): use_colour = True elif options.use_colour.lower() in ("off", "0", "false", "no"): use_colour = False else: use_colour = sys.stdout.isatty() if sys.platform == 'win32': use_colour = False l_args = [] # list of non range arguments to process for argum in arguments: is_r = is_range(argum, options.type) if is_r: print_blocks(is_r) else: l_args.append(argum) if l_args: unihan_fs = [] if options.verbosity>0: unihan_fs = get_unihan_files() # list of file names for Unihan data file(s), empty if not available if not unihan_fs: out( """ Unihan_*.txt files not found. In order to view Unihan properties, please place the file into /usr/share/unidata/, /usr/share/unicode/, ~/.unicode/ or current working directory (optionally you can gzip or bzip2 them). You can get the files by unpacking ftp://ftp.unicode.org/Public/UNIDATA/Unihan.zip Warning, listing UniHan Properties is rather slow. """) options.verbosity = 0 try: print_characters(process(l_args, options.type), options.maxcount, options.query_wiki) except IOError: # e.g. broken pipe pass unicode-0.9.4/unicode.1000064400000000000000000000053201165276173700146770ustar00rootroot00000000000000.\" Hey, EMACS: -*- nroff -*- .TH UNICODE 1 "2003-01-31" .SH NAME unicode \- command line unicode database query tool .SH SYNOPSIS .B unicode .RI [ options ] string .SH DESCRIPTION This manual page documents the .B unicode command. .PP \fBunicode\fP is a command line unicode database query tool. .SH OPTIONS .TP .BI \-h .BI \-\-help Show help and exit. .TP .BI \-x .BI \-\-hexadecimal Assume .I string to be a hexadecimal number .TP .BI \-d .BI \-\-decimal Assume .I string to be a decimal number .TP .BI \-r .BI \-\-regexp Assume .I string to be a regular expression .TP .BI \-s .BI \-\-string Assume .I string to be a sequence of characters .TP .BI \-a .BI \-\-auto Try to guess type of .I string from one of the above (default) .TP .BI \-mMAXCOUNT .BI \-\-max=MAXCOUNT Maximal number of codepoints to display, default: 20; use 0 for unlimited .TP .BI \-iCHARSET .BI \-\-io=IOCHARSET I/O character set. For maximal pleasure, run \fBunicode\fP on UTF-8 capable terminal and specify IOCHARSET to be UTF-8. \fBunicode\fP tries to guess this value from your locale, so with properly set up locale, you should not need to specify it. .TP .BI \-cADDCHARSET .BI \-\-charset\-add=ADDCHARSET Show hexadecimal reprezentation of displayed characters in this additional charset. .TP .BI \-CUSE_COLOUR .BI \-\-colour=USE_COLOUR USE_COLOUR is one of .I on .I off .I auto .B \-\-colour=on will use ANSI colour codes to colourise the output .B \-\-colour=off won't use colours. .B \-\-colour=auto will test if standard output is a tty, and use colours only when it is. .BI \-\-color is a synonym of .BI \-\-colour .TP .BI \-v .BI \-\-verbose Be more verbose about displayed characters, e.g. display Unihan information, if available. .TP .BI \-w .BI \-\-wikipedia Spawn browser pointing to Wikipedia entry about the character. .SH USAGE \fBunicode\fP tries to guess the type of an argument. For example, you can use any of the following to display information about U+00E1 LATIN SMALL LETTER A WITH ACUTE (\('a): \fBunicode\fP 00E1 \fBunicode\fP U+00E1 \fBunicode\fP \('a \fBunicode\fP 'latin small letter a with acute' You can specify a range of characters as argumets, \fBunicode\fP will show these characters in nice tabular format, aligned to 256-byte boundaries. Use two dots ".." to indicate the range, e.g. \fBunicode\fP 0450..0520 will display the whole cyrillic and hebrew blocks (characters from U+0400 to U+05FF) \fBunicode\fP 0400.. will display just characters from U+0400 up to U+04FF .SH BUGS Tabular format does not deal well with full-width, combining, control and RTL characters. .SH SEE ALSO ascii(1) .SH AUTHOR Radovan Garab\('ik