Wednesday, June 23, 2010

Downloads of The Machinery of Freedom

A few days ago, my son Patri asked me how many times the file had been downloaded, suggesting that he expected it was in the thousands—a guess that struck me as implausibly large. My ISP provides summary statistics on traffic generated by Webalizer, which I link to at the bottom of my web page. But when I looked at them, I discovered that they had last been updated on June 9th. Since I didn't web the book until June 17th, that wasn't very useful. I called my ISP, they said there must be something wrong since the statistics were supposed to update daily, and they would look into it.

I was still curious, so I went to the logs folder on my site and downloaded access_log, which was current. It's a text file, so I loaded it into Word, found the text representing a download of the Machinery pdf, and used Word's search and replace feature to count how many times the text appeared. To my astonishment, the result was over fourteen thousand.

I wasn't sure I believed it, so I located a Mac program for analyzing weblogs ("Traffic Report"), and downloaded it—the program permits a free trial for a month. It agreed with my previous result, showing 14,526 hits for the file. Since it's a single pdf, hits ought to be downloads. I then sent enthusiastic emails to Patri, my agent, and my contact at Open Court, which published Machinery—and until recently wouldn't let me web it.

This morning, I took a look at the summary provided by my ISP. It had been updated. But its figure for downloads, although still more than I would have expected a few days back, was much lower—2867. On the other hand, its figures for daily traffic showed it roughly doubling the day after I webbed the Machinery pdf, an increase of almost 2000 hits a day, and staying up thereafter. Assuming those are all Machinery downloads, that would be a total of about ten thousand—much larger than their figure, lower than the Traffic Report figure.

My current (optimistic) guess is that the Traffic Report figure is correct and there is still something wrong with the Webalizer figure. An alternative possibility is that the two sources are reporting on different information. Perhaps, for example, Webalizer has some way of filtering out hits from web spiders—although it's hard to believe that that could represent so large a difference, and I am not seeing a similar difference on the figures for other files.

It occurred to me that some readers of this blog probably know a lot more than I do about analyzing web traffic, hence this post. I'm hoping that one of you can suggest a plausible explanation for the discrepency between the number produced by Welalizer and the number produced by Traffic Report, and some way of testing whether the explanation is correct.

---

P.S. (6/27/10) Update on downloads:

According to my ISP's software, Machinery has had 4357 hits and a total download of 142264K, for an average of 330K per hit. The file shows on the web site as 517K, which suggests that on average I'm getting about two downloads for every three hits. This may, as some have suggested, reflect the use of a download accelerator or something similar, with some people downloading the file in several pieces. The corresponding figures for Salamander are 255 hits, 142264K downloaded, for an average of 558K; the file is 776 K, so a similar ratio.

For the moment I'm assuming those numbers are correct; I don't yet have an explanation of why my earlier analysis of the access log produced figures so much larger. Both the software I used and the software my ISP provides give statistics in terms of hits, not as an estimate of number of downloads, so I would think both would have been affected in the same way by anything that made the number of hits substantially larger than the number of downloads.

It looks as though downloads are continuing at a rate of hundreds, but not thousands, a day.

As before, suggested explanations of my data from those more familiar with the subject are invited.

11 Comments:

At 1:22 PM, June 23, 2010, Blogger Max said...

Two things that come to mind are (1) Webalizer may be looking at stale data, or (2) Webalizer may be filtering out some requests.

It's common to set up log files to automatically roll over after a set number of requests or a particular amount of time passes. The benefit to doing this is that old log files can be compressed or even stored on a different machine to save space on your web server (nobody wants their webserver to crash because the disk is full, and they definitely don't want their disk to be full solely because of out-of-control log files). So it's possible that your log files roll over daily and that Webalizer is set to look at the rolled over log files. Please note: this is a guess, I'm not sure ohow people ususally set up Webalizer. But it seems possible that Webalizer is simply looking at a stale file, and potentially will catch up to today.

I'm assuming you're using Apache. According to http://httpd.apache.org/docs/1.3/logs.html#accesslog (well, http://httpd.apache.org/docs/1.3/logs.html#common ), Apache's access_log logs successful requests and unsuccessful request ("This is the status code that the server sends back to the client. This information is very valuable, because it reveals whether the request resulted in a successful response (codes beginning in 2), a redirection (codes beginning in 3), an error caused by the client (codes beginning in 4), or an error in the server (codes beginning in 5). ... The last entry indicates the size of the object returned to the client, not including the response headers. If no content was returned to the client, this value will be '-'. To log '0' for no content, use %B instead."). It's possible that Webalizer is smart enough to filter out unsuccessful requests, or simply requests that returned no data. However it seems odd that 6 out of 7 requests would fail, and thus get filtered out by Webalizer.

However, if somebody loads the Machinery of Freedom in their browser today, but does not save the file on their machine, the browser will probably save the file to their cache (I say "probably" because each browser has different caching rules). If that same person comes back while the PDF is still in their cache, the browser will ask the webserver if the file has changed (using a HEAD request), and if the file has not changed, use the PDF from the cache. This may mess up your statistics. If 6 out of every 7 requests return no content other than the headers (i.e., the last field is a dash or a 0) this may be the culprit. Webalizer may be filtering out requests where nothing is downloaded, most likely because the PDF is in the user's cache.

 
At 3:27 PM, June 23, 2010, Blogger Seth said...

If somebody gets it using a download accelerator (e.g. DownloadStudio), they might get it in multiple parts, each of which looks like one hit.

 
At 5:19 PM, June 23, 2010, Blogger Vadim Iaralov said...

Another thing, whatever the final estimate it would be slightly too high compared to number of readers (I think that's what publishers care about, downloads just a proxy estimate!). If the book is free, I might read a chapter at home, a chapter at work, another chapter on a laptop/phone etc. Instead of carrying around the file with me or printing the whole thing, it may be easier to re-download it every time for those uses. That is, 1 reader causing multiple downloads in the same way 1 reader causes multiple website hits (these multiple downloads by 1 person would probably count as separate IP's and be indistinguishable, though).

On the other hand, if I were to buy a paperback and carry it around, because of transaction costs, I wouldn't go buy a new copy after work just because I left the old one at home in the morning!

 
At 6:45 PM, June 23, 2010, Blogger Adriaan said...

14 000 seems reasonable. I know that the word spread on facebook like a fire in a dry haystack.

 
At 8:25 PM, June 23, 2010, Blogger Joel Davis said...

Personally, I don't put much stock in webalizer at all. I've never used it in anything nearing "production" but my sense of the information it presents me with is that it's either way too high or way too low.

The access_log is probably the more reliable of the two, but (like Seth said) there are plenty of situations where multiple access_log entries can be expected to be made (downloading programs, web spiders, etc) I'm not sure if access_log stores HEAD entries, but if it does then any RSS reader that noticed that there was a link to a PDF in the post may have created an entry. At any rate, it's probably in the lower end (to compensate for the double or triple entries) but still in the neighborhood of the access_log's report.

 
At 9:21 PM, June 23, 2010, Anonymous Anonymous said...

Seth is right. Download "accelerators" will skew the count upwards. These appear in the log as a series of hits at nearly the same time, from the same IP address, getting the same file.

 
At 9:57 PM, June 23, 2010, Anonymous Kid said...

I think you first want to filter for 2xx return codes and then you probably want to filter on unique ip addresses.

You can do that in one line of perl but I'm not sure if the easiest way is also the fastest way to count it.

 
At 10:20 PM, June 23, 2010, Anonymous Kid said...

cat access_log | sed -ne "/.*\"GET.*The_Machinery_of_Freedom_.pdf HTTP[^ ]*\"/p" | sed -ne "/[^ ]* [^ ]* [^ ]* \[.*\] \".*\" 2.*/p" | sed -ne "s/\([^ ]*\).*/\1/p" | sort -u | wc -l

If you have access to anything similar to a Linux shell: the above works for me, but I didn't test performance on thousands of entries.

 
At 8:02 PM, June 24, 2010, Blogger M@ said...

Kit, David said "I located a Mac program for analyzing weblogs", so he has access to "something like a Linux shell" (a BSD shell).

David, "Terminal.app" is the "something" (/Applications/Utilities/Terminal.app). When you open Terminal.app you current directory will be your home directory, so putting the access_log there (in your home directory) will make Kit's counter work.
For your interest, the various parts of that line of shell are separated by "|" characters, and are: stream out the file | filter the stream for lines that GET the pdf | filter for lines that record a 2xx status code | from those lines show only the originating IP address | strip out duplicates | count lines.

 
At 9:54 PM, June 25, 2010, Anonymous Anonymous said...

Sorry, I just used it as the download speed test.

 
At 11:57 PM, June 26, 2010, Anonymous Henry said...

The above comments are like a foreign language to me, so the important part - have we worked out how many downloads there have been?

 

Post a Comment

Links to this post:

Create a Link

<< Home