2718.us blog » statistics http://2718.us/blog Miscellaneous Technological Geekery Tue, 18 May 2010 02:42:55 +0000 en hourly 1 http://wordpress.org/?v=3.0.4 Randomizing by Random-Comparison Sorting (Revisited) http://2718.us/blog/2010/02/24/randomizing-by-random-comparison-sorting-revisited/ http://2718.us/blog/2010/02/24/randomizing-by-random-comparison-sorting-revisited/#comments Wed, 24 Feb 2010 21:09:28 +0000 2718.us http://2718.us/blog/?p=215 Yesterday, I posted the results of my quick exploration of whether sorting the list {0,1,2,3,4} using a comparison function that randomly returns < or > (with equal probability).  My exploration was prompted by a report on the non-uniformity of the distribution of the random orderings of the browsers in Microsoft’s EU browser ballot.  I had said that it seemed likely that the distribution would vary based on the sorting algorithm used.

Today, I have data (and code) that confirms the distribution is sorting-algorithm-dependent.  For each sorting algorithm, 1,000,000 instances of the list {0,1,2,3,4} were sorted with a random comparison function and the relative frequencies (rounded to the nearest whole percent) of each number in each position were computed.

Mathematica’s Sort[]
position/number 0 1 2 3 4
first 18% 12% 12% 12% 46%
second 18% 24% 18% 18% 24%
third 20% 20% 26% 20% 12%
fourth 22% 22% 22% 28% 6%
fifth 22% 22% 22% 22% 12%
BubbleSort
position/number 0 1 2 3 4
first 36% 28% 20% 10% 6%
second 28% 32% 22% 12% 6%
third 20% 22% 32% 18% 10%
fourth 12% 12% 18% 38% 20%
fifth 6% 6% 10% 20% 60%
QuickSort (random pivot)
position/number 0 1 2 3 4
first 20% 20% 20% 20% 20%
second 20% 20% 20% 20% 20%
third 20% 20% 20% 20% 20%
fourth 20% 20% 20% 20% 20%
fifth 20% 20% 20% 20% 20%
MergeSort
position/number 0 1 2 3 4
first 24% 24% 26% 12% 12%
second 26% 24% 18% 16% 16%
third 18% 18% 22% 20% 20%
fourth 16% 16% 18% 26% 26%
fifth 16% 16% 18% 26% 26%
SelectionSort
position/number 0 1 2 3 4
first 6% 6% 12% 26% 50%
second 12% 12% 20% 32% 24%
third 20% 20% 26% 20% 12%
fourth 30% 30% 20% 12% 6%
fifth 30% 30% 20% 12% 6%

The distributions are significantly different among these sorts.  QuickSort appears to provide a uniform distribution.  I believe that this is because QuickSort will only compare a particular pair of elements once, whereas each of the other sorting algorithms may compare a given pair of elements more than once (and with a random comparison function, receive a different result from one time to the next).

Here is the Mathematica notebook I used to generate this data: Randomize by Sorting.nb.  As noted in the file, some of the code for the sorting algorithms was taken from other locations and may be/is subject to their copyrights and/or license terms (I reasonably believe that this use complies with their licenses and/or constitutes fair use.  Also, some algorithms exhibited improper behavior when trying to sort lists with duplicate entries using a normal comparison function as noted in the file, though this should have no effect on the data above.

]]>
http://2718.us/blog/2010/02/24/randomizing-by-random-comparison-sorting-revisited/feed/ 1
The EU Browser Ballot and Random Sorting http://2718.us/blog/2010/02/23/the-eu-browser-ballot-and-random-sorting/ http://2718.us/blog/2010/02/23/the-eu-browser-ballot-and-random-sorting/#comments Wed, 24 Feb 2010 02:09:44 +0000 2718.us http://2718.us/blog/?p=212 An Ars Technica “etc” post linked to a TechCrunch article (apparently based on a Slovakian article, but I didn’t look into the Slovakian article to be sure) that talks about the ordering of the browsers in Microsoft’s EU Browser Ballot not being uniformly distributed.  At a glance at the Javascript that does the randomizing of the browsers (randomly orders the top 5, and randomly orders the rest), it appears to randomize by calling the Javascript array sort with a comparison function that returns < half the time and > the other half of the time.  I believe that this is likely the underlying cause of the non-uniformity of the orderings.

The second result in a google search for “javascript sort” says:

To randomize the order of the elements within an array, what we need is the body of our sortfunction to return a number that is randomly <0, 0, or >0, irrespective to the relationship between “a” and “b”. The below will do the trick:

//Randomize the order of the array:
var myarray=[25, 8, "George", "John"]
myarray.sort(function() {return 0.5 - Math.random()}) //Array elements now scrambled

This is almost exactly the method of randomization used in the browser ballot javascript.

To test the results of this randomization technique, I applied it 1,000,000 times to the list {0,1,2,3,4} in Mathematica and tabulated the relative frequencies of each number in each position. (Rounded to the nearest whole %).

position/number 0 1 2 3 4
first 18% 12% 12% 12% 47%
second 18% 24% 18% 18% 24%
third 20% 21% 27% 20% 12%
fourth 22% 22% 22% 28% 6%
fifth 22% 22% 22% 22% 12%

At a glance, it appears that the distribution is far from uniform.  My quick attempt at re-learning how to use the Χ2 test gave a probability less than 1×10-100000 that this data matched a uniform distribution (if someone can confirm/fix that, please comment).

I used the Mathematica Sort[] command to do the sorting.  I don’t know what algorithm that uses.  It appears that the algorithm used by Javascript’s sort() varies from browser to browser, though the browser ballot would be displayed in IE8 by default.  I suspect that the distribution is highly dependent on the sorting algorithm used, though I cannot readily verify it [edit: I verified it].  Regardless, this seems to be a very poor way to generate a random ordering.

]]>
http://2718.us/blog/2010/02/23/the-eu-browser-ballot-and-random-sorting/feed/ 1
Statistics on LiveJournal-based Sites v2.0 http://2718.us/blog/2008/10/22/statistics-on-livejournal-based-sites-v20/ http://2718.us/blog/2008/10/22/statistics-on-livejournal-based-sites-v20/#comments Wed, 22 Oct 2008 18:05:39 +0000 2718.us http://2718.us/blog/?p=111 The reworking of my site that shows comparative statistics on every site based on the code from LiveJournal is now up and live and at a new URL:  http://lj-stat.2718.us/.  Moreover, there are now graphs of the data over time.  The data is updated at noon and midnight central time (U.S.).

One of the things that took the most work to get right was the thickness of the graph lines.  Because of the nature of the graphs, it was an absolute necessity that the lines be drawn with antialiasing enabled.  PHP’s interface to GD (or perhaps it’s GD itself?) ignores the line thickness setting when antialiasing is enabled.  The solution I eventually settled on is to, more or less, draw several one-pixel-wide lines next to and on top of one another to get the appearance of a thicker line.

As an aside, I’m using the technique mentioned here for permanently redirecting the old URL to the new URL:

… if you actually moved something to a new location (forever) use:

<?php
 header("HTTP/1.1 301 Moved Permanently");
 header("Location: http://example.org/foo");
?>
]]>
http://2718.us/blog/2008/10/22/statistics-on-livejournal-based-sites-v20/feed/ 0
An Overhaul of LJ-Stat http://2718.us/blog/2008/10/12/an-overhaul-of-lj-stat/ http://2718.us/blog/2008/10/12/an-overhaul-of-lj-stat/#comments Sun, 12 Oct 2008 12:03:44 +0000 2718.us http://2718.us/blog/?p=108 I’m currently working on an overhaul of LJ-Stat.

It looks like there’s some issue in using curl_multi_exec() in PHP with too many requests at once causing some requests to fail strangely, potentially accounting for the lack of data from several sites that are clearly not down and clearly provide stats.txt.  My current workaround is to do the requests in smaller blocks.

I’m also trying to provide more detail as to why there aren’t stats for the sites that don’t have stats.

But the biggest development is that there will probably be graphs of the data over time.  I say “probably” because while the code is pretty much written, I’ve only been storing historical data for about a day so far (in the past, only the most recent data was kept), so it’s hard to tell whether the graphs will look okay with a lot of data and whether producing the graphs will put a significant load on the server.  The data will probably update more regularly and more frequently–likely noon and midnight CT.

Also, if anyone knows for sure if Bloty, IziBlog, and/or LiveLogCity are still alive or definitively dead, I’d like to know.  Oh, and CommieJournal seems to be looking at the posibility of moving to a different codebase, though I can’t for the life of me see why anyone would want to try to move thousands of accounts from the LJ codebase to something incompatible and with a different working paradigm.

]]>
http://2718.us/blog/2008/10/12/an-overhaul-of-lj-stat/feed/ 0
Ads on the LJ-Stat page? http://2718.us/blog/2008/04/30/ads-on-the-lj-stat-page/ http://2718.us/blog/2008/04/30/ads-on-the-lj-stat-page/#comments Thu, 01 May 2008 02:53:20 +0000 2718.us http://2718.us/blog/?p=30 I’m wondering if I’d gain anything from putting a small Google AdSense unit or maybe an AdSense link unit on the LJ-Stat page.  And by “gain anything” I mean get a few cents to help pay my hosting bills.  It could be relatively unobtrusive…  It’s just that, thus far, I’ve avoided putting any ads on 2718.us.  Well, that, and that my best experience with AdSense is monetizing the visits of people who mistakenly ended up on my site by providing them with ads for what they really wanted, rather than what’s actually on my site.

]]>
http://2718.us/blog/2008/04/30/ads-on-the-lj-stat-page/feed/ 0
Limitations of lj-stat http://2718.us/blog/2008/04/13/limitations-of-lj-stat/ http://2718.us/blog/2008/04/13/limitations-of-lj-stat/#comments Sun, 13 Apr 2008 20:23:47 +0000 2718.us http://2718.us/blog/?p=16 To the best of my knowledge and research, my LJ-code-base Site Statistics page (lj-stat) has the most comprehensive list of sites running off of LiveJournal’s codebase (if you know of any that I’ve missed, please let me know).  The main point, though, is the comparative statistics.  This is where things get strange.  LJ and most of the sites provide a pretty statistics page at /stats.bml and in most (or all?) instances, stats.bml says at the top (this is from LJ itself)

Raw data can be picked up here.

where “here” links to /stats/stats.txt.  On at least one site, stats.bml has this text, but stats/stats.txt returns a 404.  On at least one site, both stats.bml and stats/stats.txt return 404.  Since it looks to me like the whole point of providing stats.txt was to provide a more machine-readable set of stats that didn’t require loading a full web page and screen-scraping, I have no intention of trying to screen-scrape the info I want.

Now, to make things even stranger, some sites are missing what I’d call “key” stats from their stats.txt files.  In particular, the one I care most about is the “active in some way in the past 30 days” measure since I think that’s the best measure of the vitality of a site (well, either that, or what portion of the total userbase it represents).  Stranger still is that some sites report numbers in stats.txt that not only don’t match stats.bml, but make no sense whatsoever (DeadJournal perpetually reports only 10 accounts updating in the past 30 days, even though stats.bml has more sensible numbers).

Unrelated to the content of stats.txt is the “Speed Index” column–based on the rate of transfer reported by libcurl when retrieving stats.txt, where the speed index of a site is given as the percentage of the fastest transfer rate.  What I don’t quite understand is how InsaneJournal is always at least twice as speedy as any other site, often at least 4x or 6x the speed.  It actually made me wonder if my server and theirs were somehow in the same datacenter or something, but there are at least a dozen hops between us (which is more than from my server to some other LJ-based sites), so maybe it does have something to do with the servers themselves and not just network conditions.

Please let me know if you have any suggestions about enhancements to lj-stat.  Also, feel free to try to convince the sites that don’t provide stats.txt to start providing it and to try to get sites where the numbers are clearly wrong to try to fix it.

]]>
http://2718.us/blog/2008/04/13/limitations-of-lj-stat/feed/ 0
Statistics on LJ and LJ-Clone Sites http://2718.us/blog/2008/04/06/statistics-on-lj-and-lj-clone-sites/ http://2718.us/blog/2008/04/06/statistics-on-lj-and-lj-clone-sites/#comments Mon, 07 Apr 2008 00:13:55 +0000 2718.us http://2718.us/blog/?p=5 http://2718.us/lj-stat/ is a page giving some comparative statistics on various LJ-code-based sites.

The underlying data is updated approximately daily. All numbers are based on what is supposed to be the “raw” data at /stat/stats.txt, even though on some or all sites there are significant discrepancies between the numbers reported in /stats/stats.txt and those shown at /stats.bml.

If you know of other LJ-code-based sites that you’d like to see added, please comment with the name/url. Also comment if you have any suggestions as to design, features, or other statistics you’d like to see.

]]>
http://2718.us/blog/2008/04/06/statistics-on-lj-and-lj-clone-sites/feed/ 0