[sword-devel] Character Frequency

Greg Hellings greg.hellings at gmail.com
Sun Jul 3 10:43:16 MST 2011


In fact,

http://dl.thehellings.com/count.py

churns through kjv.xml in 11 seconds on my machine and gives the
desired output of character counts.  Can be invoked with either the
name of a file (python count.py kjv.xml) as part of a pipe (cat
kjv.xml | ./count.py) or with a whole list of files (./count.py
kjv.xml kjvfull.xml kjvlite.xml).

--Greg

On Sun, Jul 3, 2011 at 12:30 PM, Greg Hellings <greg.hellings at gmail.com> wrote:
> A few simple pipes in Unix can do the same thing with relative ease.
>
> cat kjv.xml | sed -e 's/./&\n/g' | sort | uniq -c | sort -nr
> 1669596
> 1661832 "
> 1330866 o
> 1307266 r
> 1172801 s
> 1156121 e
> 1092384 n
> 1029125 m
>  901465 t
>  864037 >
>  864037 <
>  830916 =
>  776214 a
>  772641 w
>  625029 h
>  609087 :
>  560652 g
>  497519 l
>  469056 /
>  406801 i
>  393184 0
>  370919 p
>  350731 1
>  312386 H
>  290358 2
>  283469 8
>  263960 3
>  257239 d
>  220707 .
>  209066 5
>  204056 b
>  197713 4
>  197400 c
>  193701 7
>  183464 6
>  175932 G
>  172006 9
>  152074 -
>  133127 I
>  126782 M
>  121721 D
>  115182 N
>  114636 v
>  113384 T
>  111775 u
>  109108 y
>  107290 P
>  94242 A
>  85226 S
>  84923 f
>  74768 ,
>  73229 C
>  39531 J
>  36203 V
>  35707 k
>  34899
>  25991 E
>  24737 R
>  23948 F
>  20676 O
>  18179 x
>  16367 L
>  10159 ;
>   6930 z
>   5389 K
>   5047 B
>   4036 …
>   3421 ?
>   3283 X
>   2970 ¶
>   2596 j
>   2489 W
>   2334 q
>   2040 '
>   1776 Z
>    797 U
>    551 Y
>    313 !
>    240 )
>    240 (
>    199 Q
>     93 æ
>      5 }
>      5 {
>      3 Æ
>      1 ת
>      1 ש
>      1 ר
>      1 ק
>      1 צ
>      1 פ
>      1 ע
>      1 ס
>      1 נ
>      1 מ
>      1 ל
>      1 כ
>      1 י
>      1 ט
>      1 ח
>      1 ז
>      1 ו
>      1 ה
>      1 ד
>      1 ג
>      1 ב
>      1 א
>
> The format looks a bit nicer on the terminal.  Takes about 75 seconds
> to run on the file. A few simple lines in Python or the like only
> takes about 10s and is equally simple to whip up.
>
> --Greg
>
> On Sun, Jul 3, 2011 at 11:53 AM, David Haslam <dfhmch at googlemail.com> wrote:
>> A useful tool for analysing or editing source text files is BabelPad, the
>> Unicode Text Editor (for Windows).
>> http://www.babelstone.co.uk/Software/BabelPad.html
>>
>> One of the Menu Tool Options is Character Frequency.
>>
>> This can be very helpful to detect unexpected code points, such as when the
>> translators were inconsistent when they were editing.
>>
>> David
>>
>>
>>
>> --
>> View this message in context: http://sword-dev.350566.n4.nabble.com/Character-Frequency-tp3642222p3642222.html
>> Sent from the SWORD Dev mailing list archive at Nabble.com.
>>
>> _______________________________________________
>> sword-devel mailing list: sword-devel at crosswire.org
>> http://www.crosswire.org/mailman/listinfo/sword-devel
>> Instructions to unsubscribe/change your settings at above page
>>
>



More information about the sword-devel mailing list