Thursday, July 01, 2010

A World of Search

"having written a book it is always tempting to explain it, a project usually more interesting to the author than to his audience."

After writing that sentence in my previous post, it occurred to me that there probably is an audience, although not a large one, that would be interested in the explanations—while the book was not a success, some readers clearly liked it a lot. Unfortunately, most of them don't read my blog, so if I had let myself go on at greater length it would have been to the wrong readers. That is unfortunate, since it prevents a conversation that both sides would have enjoyed, an exchange in our mutual advantage.

There is a solution, and it is one that may be very gradually taking form. I routinely do searches of both Usenet and the Web, using search strings designed to spot references to me while filtering out at least some of the references to other people with the same name. The current term for the practice is egosurfing, the older term, named after an early Usenet practitioner, "kibozing." With the search already set up it takes only a single click, although I then have to spend a minute or so looking through the results to find the ones that are actually worth looking at.

There is no reason why the practice could not be extended. I am a fan of various writers, including Heinlein, Kipling, G.K. Chesterton, Tolkien, Orwell, and others. I could have a search string that spotted any online reference to any of them. Extending it further, one could imagine a (very long and complicated) search string designed to spot any reference online to anything of interest to me—for example, an author going on at length about a book I read and liked.

There is, however, a practical problem. The string "economist OR anarchist OR harald OR libertarian "David Friedman" -rec-arts-sf-* -concerned-scientists" turns up some hits irrelevant to me, but filtering them out takes only a minute or so. A string designed to locate everything of possible interest to me would produce an enormous volume of hits, and looking through them each day for the tiny fraction I actually wanted to read would be a more than full time occupation.

Which means that we need smarter searches, procedures that will do almost all of the filtering in advance, providing me each day with links to the ten or twenty new online items that it is actually likely I will want to read.

Google, are you listening?


Mike Gogulski said...

When Kibo says "You are allowed," he really means "You are allowed."

I remember thinking, oh, back around 1995, that the big search engines were going to start supporting proper regular expressions oh, any day now. I've got my own problems trying to aggregate Bradley Manning/Collateral Murder stories and blog posts without being exposed to high school soccer stars, the latest Celtics draft picks and someone else who does something with balls. Fortunately my own surname is rare enough to make my own egosurfing quite efficient. Clearly, your mileage varies!

Glen Whitman said...

The term I've heard most often for this practice is "vanity Googling."

Anonymous said...

Have you seen Google Alerts? It at least notifies you daily or weekly on new search results for a given search. It's especially useful to use for your own name.

It can also track e.g. blogs or news, letting you know if your name comes up in those.

- Finnish libertarian

dWj said...

"Dean Jens" is also easier than "David Friedman". I have a google alert set on myself; it doesn't pop things up very often.

The alerts may be a better place for this than the regular search box.

I think you'll need some sort of feedback mechanism here; start with your own estimation of what interests you, have google give you a way to mark false positives, and maybe even let it flag hits in which it has less confidence so you can offer reassurances. If you stumble across something it didn't find, there would be a way to go back and say, "it would have been nice if you had picked this up", with an optional attempt to suggest keywords that might have cued it.

I think the technology for this, as of today, would function passably well; the problem may be convincing potential suppliers of the existence of sufficient potential demanders.

Jonathan said...

You might care to have a look at Jasper Fforde's Web site, It's slightly amateurish because he does it himself, but it's extensive in scale and gives plenty of background information and 'extras' about his novels, including what he calls 'book upgrades' in which he publishes corrections to each book.

I also recommend his novels, especially the first (The Eyre affair) and the latest (Shades of Grey). But they're very British, and may not be your kind of thing.

Bruce said...

I was thinking what DWJ was thinking: some sort of individualized filter. It could do an analysis of the webpages you like and don't like and then try to figure out some factors --- perhaps even unknown to you --- that makes it more likely that you will choose certain pages and not choose others.

William B Swift said...

If you do this as regularly as your post suggests, do one fairly thorough (and time consuming) search, then just search for new (since last check) posts to reduce the number you need to sift through.

David Friedman said...

To William and Anonymous:

1. I have a google alert set, and have for quite a while.

2. My standard search was for items new within the past 24 hours. That works less well than one might expect--in practice the same web page shows up multiple days, I presume because something on the page has been updated.

3. My biggest problem, to which there is probably some simple solution I haven't thought of, is this blog. Quite a lot of the hits I get are to pages that link to this blog but don't say anything about it or me. I don't want to simply add "NOT Ideas", since some posts I want to see might also mention the blog.

Matt said...

David and Mike:

The combination of a Google Alerts feed and Yahoo Pipes may give you what you want, though Pipes are pretty geeky. This is the setup for a Google Alerts feed (a link for which you get from your alerts manage page) passed through a pipes filter block using a couple of regular expressions. Pipes supports much more complexity than just that, but it's probably an improvement over what you're currently doing (David), since Google will handle removing duplicates for you, but you can leave the alert permissive and leave detailed filtering for the pipes.
… that may have been nearly unintelligibly geeky. Let me know if you want more detail than that.

Giles said...

I have a bunch of Google alerts set up for different topics -- an "ego" one for myself, a few for different work-related keywords, others for hobbies -- and have them all feeding though (via RSS -- you can set up GA to use RSS instead of email nowadays) into Google Reader (though of course any other news reader would work). It works pretty well, I don't think I miss much.

However, while my name isn't particularly common, there are a number of others who share it whose news I don't need to hear about. I've been thinking that it would be possible to set up some kind of Bayesian filtering, similar to spam traps. A server somewhere would pick up the Google Alerts RSS feed, run it through the filter, and then generate a new feed, which only included the items that the filter OKed. The filtered feed would be the one that you would add to your RSS reader. Added to each post in the feed would be "like" and "don't like" links, which you would click on to train the filter.

Does that sound like the kind of thing that would work for you (Dr Friedman and fellow readers)? Technically it sounds like something a competent coder could knock together in Python in a day or so if it was just for one person -- I speak as a fairly competent Python coder. The important question is, of course, whether Bayesian filtering would be as good at separating interesting from non-interesting hits as it has been at filtering out spam. And I imagine it might be harder to scale than spam filtering, though I'm sure the guys at Google could handle that if they chose to...