Monday, October 19, 2009

Adventures In Java – The Perils Of Encoding & Decoding

— Matt Harris @ 8:23 pm

A little while ago, I decided to get off my duff (figuratively speaking) and finally learn how to program Java. I installed the latest version of the JDK and Netbeans three weeks ago, did the usual “Hello, World” type programs and then began working in earnest.

I decided that I wanted to do a fanfiction downloader, similar to this program. I condsidered use Netbeans’ GUI builder, but decided to do the GUI coding by hand for my first project, for educational purposes. It was educational…and a pain.

Anyway, I had the following after a couple of hours of work:

Screenshot of Alpha Fiction Downloader

Once that was done, this weekend I began writing the backend code to download stories from fanfiction.net. After another couple of hours, I had something that would download a story – buggy and no error trapping to speak of, but a pretty decent alpha. As part of my testing, I decided to run it outside of the Netbeans IDE. It didn’t work. The stories would download, but they had gibberish in the place of quotes and some other punctuation.

Having be around the web a few times, I figured this was an encoding issue. What I didn’t understand is why it would work perfectly when I ran the program from inside Netbeans and screw up the encoding when I didn’t. I spent hours playing around with the URL and HttpURLConnection classes, without any success. I then stumbled upon a webpage that mentioned that the Scannerclass (which I used to download the webpage) would default to the system’s charset. I added some code to show the current system charset via the Charset.defaultCharset() method. Turns out the inside Netbeans, the charset is “UTF-8”. When I ran program outside the IDE, the charset was “windows-1252”.

Thinking I had solved the problem, I changed the Scanner constructor to use “UTF-8” encoding. I ran the program outside the IDE and the output html was still messed up, but in a different way. I switched the encoding inside my browser for UTF-8 to windows-1252 and everything looked perfect. Turns out the PrintWriter class also uses the system charset as the default. Added UTF-8 encoding to the PrintWriter constructor and things looked good. I decided to test the output file in Firefox (my default browser is Opera) and it rendered just fine. When I tried it in IE 6.0, more garbage. Opera and Firefox recognized the file as being UTF-8, but IE defaulted to my system settings. Added a couple of lines to the header of the file being generated, specifying that it was UTF-8, and even IE 6 read it fine.

I have to admit, I never expected the IDE to use different encoding internally than the system. Despite the several hours of frustration this caused, I had a fun and educational weekend.

No comments
Archives

  • 1530s Europe Campaign (15)
  • Books (7)
  • BTVS & Angel (24)
  • Cormaria (22)
  • Doctor Who (1)
  • Fanfiction Recs (19)
  • General (36)
  • Harry Potter (19)
  • Humor (16)
  • Java (1)
  • Microsoft Office (2)
  • Northridge Pathfinder (24)
  • OpenOffice.org (1)
  • Pathfinder (30)
  • Politics (4)
  • Programming (6)
  • Programs (5)
  • Reviews (2)
  • Role-Playing Games (D&D et. al.) (85)
  • Science (1)
  • Site News (120)
  • Smallville (2)
  • Stargate (2)
  • Tips (4)




  • Powered by Wordpress