Pocket PC: A Story
When I first got my Pocket PC, one of my first questions was, What good freeware is available for it? There is a great site called FreewarePPC listing over 1000 freeware programs, with user-supplied ratings out of five stars. The problem is that the site provides no way to sort the ratings, to tell which programs are highly regarded in the Pocket PC community.
Well, the first step was to download the entire website to my PC so I could process the data. This was accomplished by viewing the source of the Contents page in Firefox, pasting it into XEmacs, creating a macro to extract the URL's for each program page, creating a macro to turn those URL's into a batch file of lynx calls, then executig the batch file. To get an idea of the progress of the download, I created an XEmacs macro to auto-generate echo statements for each lynx command.
A couple of hours later, downloading was complete. I chose to use Ruby as my parsing language because of its native regex syntax. At this point I realized that the regex parsing would be much easier if the HTML were pre-parsed to an intermediate format using XEmacs macros. This I did, after using cat to concatenate all of the HTML files and grep to weed out lines I didn't need.
Now it was easy for me to write a Ruby script to parse the intermediate data into little Program objects containing @name and @ratings. It was now trivial to sort the programs, but the question became: by what criteria? There are two factors -- average rating and number of ratings -- and I didn't want to get into Bayesian statistics to combine them. It then struck me that programs with many ratings may generally have many high ratings. This turned out to be true. So I simply sorted by number of ratings, and got good results: Frequently rated (and highly rated, it turns out) programs appeared at the top; programs with a single ratinq, at the bottom.
Next came some refinements. Eventually wishing to publish the results onthe web, I created two versions: one with screenshots, and one without. I also hyperlinked the program title (and screenshot) to the original FreewarePPC page, from which the user could then download the program.
Next problem: How to get the word out on this great information I had compiled? I created a blog entry for it, with a description of why I did it and some basic statistics. I sent emails to a few prominent tech bloggers and PDA sites, including Jim Karpen, JK On The Run, Cool Tools, BoardGameGeek, alt.comp.freeware, and Aximsite. Evidently the word did get out, as my daily blog hits jumped from 10 to 700 at its peak. I also found a couple dozen excellent freeware titles for my own use.
Well, the first step was to download the entire website to my PC so I could process the data. This was accomplished by viewing the source of the Contents page in Firefox, pasting it into XEmacs, creating a macro to extract the URL's for each program page, creating a macro to turn those URL's into a batch file of lynx calls, then executig the batch file. To get an idea of the progress of the download, I created an XEmacs macro to auto-generate echo statements for each lynx command.
A couple of hours later, downloading was complete. I chose to use Ruby as my parsing language because of its native regex syntax. At this point I realized that the regex parsing would be much easier if the HTML were pre-parsed to an intermediate format using XEmacs macros. This I did, after using cat to concatenate all of the HTML files and grep to weed out lines I didn't need.
Now it was easy for me to write a Ruby script to parse the intermediate data into little Program objects containing @name and @ratings. It was now trivial to sort the programs, but the question became: by what criteria? There are two factors -- average rating and number of ratings -- and I didn't want to get into Bayesian statistics to combine them. It then struck me that programs with many ratings may generally have many high ratings. This turned out to be true. So I simply sorted by number of ratings, and got good results: Frequently rated (and highly rated, it turns out) programs appeared at the top; programs with a single ratinq, at the bottom.
Next came some refinements. Eventually wishing to publish the results onthe web, I created two versions: one with screenshots, and one without. I also hyperlinked the program title (and screenshot) to the original FreewarePPC page, from which the user could then download the program.
Next problem: How to get the word out on this great information I had compiled? I created a blog entry for it, with a description of why I did it and some basic statistics. I sent emails to a few prominent tech bloggers and PDA sites, including Jim Karpen, JK On The Run, Cool Tools, BoardGameGeek, alt.comp.freeware, and Aximsite. Evidently the word did get out, as my daily blog hits jumped from 10 to 700 at its peak. I also found a couple dozen excellent freeware titles for my own use.
0 Comments:
Post a Comment
<< Home