<p dir="ltr"><br>
On May 21, 2014 8:00 AM, &quot;Jaak Ristioja&quot; &lt;<a href="mailto:jaak@ristioja.ee">jaak@ristioja.ee</a>&gt; wrote:<br>
&gt;<br>
&gt; -----BEGIN PGP SIGNED MESSAGE-----<br>
&gt; Hash: SHA1<br>
&gt;<br>
&gt; So this means that actually we want non-standard RTF (someone should<br>
&gt; update the wiki). Should we assume UTF-8? Are you sure we don&#39;t have any<br>
&gt; modules with ISO-8859-something encoded values?<br>
&gt;</p>
<p dir="ltr">The wiki states that the Unicode character is preferred,  at least for conf files, over the RTF escaped value. Specifically it must be Unicode encoded as UTF 8 or CP1252.</p>
<p dir="ltr">&gt; If we choose any ASCII superset encoding we have to consider at least<br>
&gt; the two points:<br>
&gt;<br>
&gt;   * Since the RTF control words and delimeters are specified in ASCII<br>
&gt; only, we need to decide whether how the bytes of the superset act as<br>
&gt; delimeters and parts of &quot;RTF&quot; control words. For example, whether the<br>
&gt; Unicode letter, number, spacing, punctuation, control etc characters<br>
&gt; constitute parts of RTF control words or act as delimiters.<br>
&gt;<br>
&gt;   * In case of encodings where characters may consist of multiple bytes<br>
&gt; (e.g. the variable-length UTF-8) we must consider the character<br>
&gt; bondaries. We can&#39;t just pass through any non-ASCII byte values. For<br>
&gt; example, the following bit sequence wouldn&#39;t make sense:<br>
&gt;<br>
&gt;   11100010 01011100 10000010 01110001 10101100 01100011<br>
&gt;</p>
<p dir="ltr">Did you literally split the individual bytes of the euro character around the other bytes?  What possibly valid encoding permits that? Is that a valid UTF 8 sequence? If not, then the file fails to be UTF 8 encoded and the engine either will error or otherwise behave in undefined ways due to invalid input. </p>

<p dir="ltr">--Greg</p>
<p dir="ltr">&gt; which is an UTF-8 encoded Euro sign, €, interleaved with bytes of the<br>
&gt; ASCII string &quot;\qc&quot;. It just doesn&#39;t make sense, whereas the following<br>
&gt; sequences would be correct:<br>
&gt;<br>
&gt;   11100010 10000010 10101100 01011100 01110001 01100011 (€\qc)<br>
&gt;   01011100 01110001 01100011 11100010 10000010 10101100 (\qc€)<br>
&gt;<br>
&gt; So depending on the encoding it were correct to detect such cases,<br>
&gt; otherwise we end up with invalid Unicode output.<br>
&gt;<br>
&gt; Blessings,<br>
&gt; Jaak<br>
&gt;<br>
&gt; On 21.05.2014 15:19, Chris Burrell wrote:<br>
&gt; &gt; I believe some conf files have direct unicode (rather than escaped<br>
&gt; &gt; sequences) in them and that is preferred.<br>
&gt; &gt;<br>
&gt; &gt; On 20 May 2014 23:28, &quot;Jaak Ristioja&quot; &lt;<a href="mailto:jaak@ristioja.ee">jaak@ristioja.ee</a><br>
&gt; &gt; &lt;mailto:<a href="mailto:jaak@ristioja.ee">jaak@ristioja.ee</a>&gt;&gt; wrote:<br>
&gt; &gt;<br>
&gt; &gt;     I&#39;ve never done BiDi, but I&#39;m not sure I need to take that into account<br>
&gt; &gt;     while fixing the RTF parsing. As I currently understand it, this<br>
&gt; &gt;     particular piece of code does not support any part from the RTF spec<br>
&gt; &gt;     dealing with bidirectional text handling. Hence all BiDi information<br>
&gt; &gt;     contained in the configuration file strings (e.g. About=) is contained<br>
&gt; &gt;     either in the plain ASCII text or the \u&lt;num&gt; Unicode escapes which this<br>
&gt; &gt;     algorithm should pass through unmodified.<br>
&gt; &gt;<br>
&gt; &gt;     ...except for HTML entities which should actually be escaped. This bug<br>
&gt; &gt;     in the algorithm I previously failed to notice. Additionally I forgot<br>
&gt; &gt;     that non-ASCII characters in the input string should also lead to<br>
&gt; &gt;     parsing failure.<br>
&gt; &gt;<br>
&gt; &gt;     Jaak<br>
&gt; &gt;<br>
&gt; &gt;<br>
&gt; &gt;     On 20.05.2014 21:01, David Haslam wrote:<br>
&gt; &gt;     &gt; Take care with Right to Left languages such as Hebrew.<br>
&gt; &gt;     &gt;<br>
&gt; &gt;     &gt; i.e. After any patches to the filter, please include some testing<br>
&gt; &gt;     for BiDi<br>
&gt; &gt;     &gt; text in the About= field and others.<br>
&gt; &gt;     &gt;<br>
&gt; &gt;     &gt; David<br>
&gt; &gt;     &gt;<br>
&gt; &gt;     &gt;<br>
&gt; &gt;     &gt;<br>
&gt; &gt;     &gt; --<br>
&gt; &gt;     &gt; View this message in context:<br>
&gt; &gt;     <a href="http://sword-dev.350566.n4.nabble.com/RTFHTML-filter-bugs-tp4653969p4653970.html">http://sword-dev.350566.n4.nabble.com/RTFHTML-filter-bugs-tp4653969p4653970.html</a><br>
&gt; &gt;     &gt; Sent from the SWORD Dev mailing list archive at Nabble.com.<br>
&gt; &gt;     &gt;<br>
&gt; &gt;     &gt; _______________________________________________<br>
&gt; &gt;     &gt; sword-devel mailing list: <a href="mailto:sword-devel@crosswire.org">sword-devel@crosswire.org</a><br>
&gt; &gt;     &lt;mailto:<a href="mailto:sword-devel@crosswire.org">sword-devel@crosswire.org</a>&gt;<br>
&gt; &gt;     &gt; <a href="http://www.crosswire.org/mailman/listinfo/sword-devel">http://www.crosswire.org/mailman/listinfo/sword-devel</a><br>
&gt; &gt;     &gt; Instructions to unsubscribe/change your settings at above page<br>
&gt; &gt;     &gt;<br>
&gt; &gt;<br>
&gt; &gt;<br>
&gt; &gt;<br>
&gt; &gt;     _______________________________________________<br>
&gt; &gt;     sword-devel mailing list: <a href="mailto:sword-devel@crosswire.org">sword-devel@crosswire.org</a><br>
&gt; &gt;     &lt;mailto:<a href="mailto:sword-devel@crosswire.org">sword-devel@crosswire.org</a>&gt;<br>
&gt; &gt;     <a href="http://www.crosswire.org/mailman/listinfo/sword-devel">http://www.crosswire.org/mailman/listinfo/sword-devel</a><br>
&gt; &gt;     Instructions to unsubscribe/change your settings at above page<br>
&gt; &gt;<br>
&gt; &gt;<br>
&gt; &gt;<br>
&gt; &gt; _______________________________________________<br>
&gt; &gt; sword-devel mailing list: <a href="mailto:sword-devel@crosswire.org">sword-devel@crosswire.org</a><br>
&gt; &gt; <a href="http://www.crosswire.org/mailman/listinfo/sword-devel">http://www.crosswire.org/mailman/listinfo/sword-devel</a><br>
&gt; &gt; Instructions to unsubscribe/change your settings at above page<br>
&gt; &gt;<br>
&gt;<br>
&gt; -----BEGIN PGP SIGNATURE-----<br>
&gt; Version: GnuPG v2.0.22 (GNU/Linux)<br>
&gt;<br>
&gt; iQgcBAEBAgAGBQJTfKM/AAoJELozJlbjIn79gXpAAMxwoq17dvVzCikAplQUjON0<br>
&gt; xDJXlDFfKK14w8xj11NSUvJEPjVWlwTi82WzEplQBKfkxtFY09010ZB5IKotEtSP<br>
&gt; dcJMjzc4FmuJmPifB7s3gtEOQ81OThMArlnq/aFHvGj6+5D8qjFkQiqOzSJeaORS<br>
&gt; C8dPobXSnJkJ/g3zKCdJf/k5msphFbmuIQOD4Ovco2ZHHlukL8QNd8pt3RcPN4Hy<br>
&gt; BMxYx9glw3+YJK5Jj63isdsmOGLeRory3PDcHZoPJzu8zssW78Chlsgoh+xWlfkn<br>
&gt; zI5PdP1ARhq7K/kUnPp7jXx3LDFiEbmPjrNBi/A03k+n7s2oZWdxm9uBfEEq5VpB<br>
&gt; DpdCA19msaEE+fOWOyAAvvZstnCxYrrd01j+HxXUGoA4JHBBVQo01H5udfOdbiBu<br>
&gt; nSI5M0GUKBjSSfLSmrh2oTC0qniVMRw4t+IAIJU1chjfBCsoNAx6xTiDE8x+hpjd<br>
&gt; A+s8wvgBU0gNbqeOMvWXkHeOWSu7O0oPEp0vVl+6fUPPFDHGR1+2vPXLnCcbASwj<br>
&gt; pEJwls9IBis7touUlIt4stlois1Imtw8zKGXXU8h0UmSgRHK0G2Ck8clNptClkMY<br>
&gt; +9xP+TGXZI0q+WlzA7M4aD2puQAiJ0iJTm/kV+QGF/1RiaWNGWTG7Oxfufz5XdDn<br>
&gt; xqTrAkYoVw3a+ZRgZPs4YbyK3ysVqncvAOFKuqLcEEwiA4zEYztGxPMAhcypQJFH<br>
&gt; n6ORlF3/Kmkukj3eapanznmcvoZ+H/APKNWmo2b+TZ10WABCtZVDO+pd1Ed+l2U5<br>
&gt; EytGhMYEqNSMqV109k3It9Ll7a8GVQa6k7AX8/BSXlh6/GaaoIzkSgGJBFAU8Zsj<br>
&gt; dW7u6O7wBOTBmE+lUUrwA3igveDhTDhzjORE7Ek74xkhoNVwh1DmqWwJGZbIGb5R<br>
&gt; 47yWwxql4pqS4jq3M+TM8SUZaeY/NTjRTn+WLFBGahKVH5Gg/NiB6onfBBRLyYwK<br>
&gt; iorFYngEhpKDNJBPp8rfSIg4NxhbupwG9B1Bbrdg6Kj+E+kGsXDuDkBWQEgf1Jwv<br>
&gt; 3XbiDBEjUf2wr4TdbUx9GrwrBNP7q9YW0RmbQGlvIahVwtr3/PJGhiU/kS47fAZf<br>
&gt; HQMac1US7eYgtW5hzH/YG+41cCI9J0byZBEuSJS2GuSd0LD0Of4bPLxyOxiXqvTU<br>
&gt; kwSPIQwsBOZpFIA5Qfc35x5KxVqCGUYBvXhglpZtZGlGr8uIPpshc1gz9ukCejuz<br>
&gt; 754upiYTlCzocKpvPbER9QpMZFYb+iDTdc4bU8whmxkP8ATKSDQmYIqUS2ohLKV8<br>
&gt; co5X0741kRaG5oNOBBrM7kn/9nWgFNspFBkJAvGLbD8h6R8S11cu7INrXzJjxv/e<br>
&gt; bCAxGXb2UQXXUe18FCYeqUvl5VdQOQt3f7gja3XbitCKkJjUA6i7t1+5vjuMQsAY<br>
&gt; NFliiFxNeNjNE4hIIpvA7G3N+2t0W8IjGsystXm6ONN0lM78eLZLLlsrfkPi8NgR<br>
&gt; Nydc78zEJfGr8APkiYleIYTi6ftgtDrI9927wNWqgIPqO4vqA1TZngX8wx6YPJou<br>
&gt; uF8cSnI0PlcOfEKtsBgZedOpbZlqAt61wvMGMW0YUfiL5LhuP95KQekqDMMBDCQX<br>
&gt; mGMehJHRJ5PvoDt8485lGOWdwXn6T7PlakZ1UCtYeMV0Nx2PfPBfU7bnCwSRFQKg<br>
&gt; vpUhPCkW5qpvlkBLOpPLwkqcZGiSyLL/YSGp6cVExeeQVHc2hI169zGY9dUHBEMN<br>
&gt; CaKwI9Wjn5V95bax3gsMlHnY9c1TB/6yLWnVEJAilm5ijgWW5KxstWoJMd/OptY8<br>
&gt; QvbsOA7K36HfwOwNCblQCGbUrPjikhXTw8ew1aap4OHqGIKUWCMm3z/eHOPRU5mD<br>
&gt; Ce2Z86vwYb9T2PcyqUiZOs1WW9TBZx70Hr2JQmRwgMyWpT4DERjofP83IA8vxZdP<br>
&gt; 9uKT4j+EBUGoI2zGgE2lapLL/VWrzt6OBMv5iUmR4OIFLdnHevAAy5w53c4+tWjs<br>
&gt; SNmjAz8tW5FWiVFR99FQBN6KWXIjKdJGQl+zccOlE0zBQe2grnqFmUeuuBbPiojb<br>
&gt; Wch+hqrKDX/VLr/gIP9EErMJ7ZvZ7st+gwPZlFwC7Evf3OCrUnRYIbMI6iLGLoZ6<br>
&gt; c9YLbK67hj1Ho+X99XTeoQj8l2V14TSRCFZBmO7Os5L2kXOEiw0yeV8Dn87LJPFp<br>
&gt; 4VcfgFGLi9FRnI36K4+h5JWoyhrGhNHrHsO60Xs2U3a02fRfeUgn/T1Xf0xXbVMC<br>
&gt; gX8zJ3aC15pUy/dJaqJ4HIszzPe5ErO7J9GB7AhjVnx8pEE0xayoJkA4VM0YF8Lk<br>
&gt; b/IF04rm/dNlsLL7zRzdGpr2uo9esMzFJDYcHnhInhaE7t2iGR4+cgUdRJKA7NJW<br>
&gt; ZumxNz3a1EjeZHRLqRxfT8O6Cc55hG4GwVO7JxUnXJtRMx+ENXZslf4ExGdhcTdf<br>
&gt; ntjsfngGemyKYv8aMJ9pDlLFVyR+91xSpFp8QYRDtcP14y5Dfh/jh4Kmdu0BqTzt<br>
&gt; Wt0KUUZQlx8Qu8XJbatPiieDmjtQ8HPmhsHQAA+QmLzrhEmakrAjTfpWq5eNYQeQ<br>
&gt; ei6tawFllPyuNrez2BOP3nfXuSBlfn2+yBfi3H1mJc8urrFwDtt/zqTHdoOtyCNO<br>
&gt; PVaqMROmVzgdKg7yyXTBek3UBe8TxMWigvepRvxkGlmMZQkW42/5ft0269esY/bw<br>
&gt; tuy57vDPyvQfrJzpN62y<br>
&gt; =RNpJ<br>
&gt; -----END PGP SIGNATURE-----<br>
&gt;<br>
&gt; _______________________________________________<br>
&gt; sword-devel mailing list: <a href="mailto:sword-devel@crosswire.org">sword-devel@crosswire.org</a><br>
&gt; <a href="http://www.crosswire.org/mailman/listinfo/sword-devel">http://www.crosswire.org/mailman/listinfo/sword-devel</a><br>
&gt; Instructions to unsubscribe/change your settings at above page</p>