10 Jan 2012
Dick Berk is using his statistical superpowers to fight crime. Seriously. Here is my favorite paragraph.
Drawing from criminal databases dating to the 1960s, Berk initially modeled the Philadelphia algorithm on more than 100,000 old cases, relying on three dozen predictors, including the perpetrator’s age, gender, neighborhood, and number of prior crimes. To develop an algorithm that forecasts a particular outcome—someone committing murder, for example—Berk applied a subset of the data to “train” the computer on which qualities are associated with that outcome. “If I could use sun spots or shoe size or the size of the wristband on their wrist, I would,” Berk said. “If I give the algorithm enough predictors to get it started, it finds things that you wouldn’t anticipate.” Philadelphia’s parole officers were surprised to learn, for example, that the crime for which an offender was sentenced—whether it was murder or simple drug possession—does not predict whether he or she will commit a violent crime in the future. Far more predictive is the age at which he (yes, gender matters) committed his first crime, and the amount of time between other offenses and the latest one—the earlier the first crime and the more recent the last, the greater the chance for another offense.
Hat tip to Alex Nones.
10 Jan 2012
When it comes to computing, history has gone back and forth between what I would call the “owner model” and the “renter model”. The question is what’s the best approach and how do you determine that?
Back in the day when people like John von Neumann were busy inventing the computer to work out H-bomb calculations, there was more or less a renter model in place. Computers were obviously quite expensive and so not everyone could have one. If you wanted to do your calculation, you’d walk down to the computer room, give them your punch cards with your program written out, and they’d run it for you. Sometime later you’d get some print out with the results of your program.
A little later, with time-sharing types of machines, you could have dumb terminals login to a central server and run your calculations that way. I guess that saved you the walk to the computer room (and all the punch cards). I still remember some of these green-screen dumb terminals from my grad school days (yes, UCLA still had these monstrosities in 1999).
With personal computers in the 80s, you could own your own computer, so there was no need to depend on some central computer (and a connection to it) to do the work for you. As computing components got cheaper, these personal computers got more and more powerful and rivaled the servers of yore. It was difficult for me to imagine ever needing things like mainframes again except for some esoteric applications. Especially, with the development of Linux, you could have all the power of a Unix mainframe on your desk or lap (or now your palm).
But here we are, with Jeff buying a Chromebook. Have we just taken a step back in time? Is cloud computing and the renter model the way to go? I have to say that I was a big fan of “cloud computing” back in the day. But once Linux came around, I really didn’t think there was a need for the thin client/fat server model.
But it seems we are going back that way and the reason seems to be because of mobile devices. Mobile devices are now just small computers, so many people own at least two computers (a “real” computer and a phone). With multiple computers, it’s a pain to have to synchronize both the data and the applications on them. If they’re made by different manufacturers then you can’t even have the same operating system/applications on the devices. Also, no one cares about the operating system anymore, so why should it have to be managed? The cloud helps solve some of these problems, as does owning devices from the same company (as I do, Apple fanboy that I am).
I think the all-renter model of the Chromebook is attractive, but I don’t think it’s ready for prime time just yet. Two reasons I can think of are (1) Microsoft Office and (2) slow network connections. If you want to make Jeff very unhappy, you can either (1) send him a Word document that needs to be edited in Track Changes; or (2) invite him to an international conference on some remote island. The need for a strong network connection is problematic because I’ve yet to encounter a hotel that had a fast enough connection for me to work remotely over on our computing cluster. For that reason I’m sticking with my current laptop.
09 Jan 2012
I don’t mean to brag, but I was an early Apple Fanboy - not sure that is something to brag about now that I write it down. I convinced my advisor to go to all Macs in our lab in 2004. Since then I have been pretty dedicated to the brand, dutifully shelling out almost 2g’s every time I need a new laptop. I love the way Macs just work (until they don’t and you need a new laptop).
But I hate the way Apple seems to be dedicated to bleeding every last cent out of me. So I saved up my Christmas gift money (thanks Grandmas!) and bought a Chromebook. It cost me $350 and I was at least in part inspired by these clever ads.
So far I’m super pumped about the performance of the Chromebook. Things I love:
- About 10 seconds to boot from shutdown, instantly awake from sleep
- Super long battery life - 8 hours a charge might be an underestimate
- Size - its a 12 inch laptop and just right for sitting on my lap and typing
- Since everything is cloud based, nothing to install/optimize
It took me a while to get used to the Browser being the operating system. When I close the last browser window, I expect to see the Desktop. Instead, a new browser window pops up. But that discomfort only lasted a short time.
It turns out I can do pretty much everything I do on my Macbook on the Chromebook. I can access our department’s computing cluster by turning on developer mode and opening a shell(thanks Caffo!). I can do all my word processing on google docs. Email is just gmail as usual. Scribtex for latex (Caffo again). Google Music is so awesome I wish I had started my account before I got my Chromebook. The only thing I’m really trying to settle on is a cloud-based code editor with syntax highlighting. I’m open to suggestions (Caffo?).
I’m starting to think I could bail on Apple….
08 Jan 2012
A few data/statistics related links of interest:
- Eric Lander Profile
- The math of lego (should be “The statistics of lego”)
- Where people are looking for homes.
- Hans Rosling’s Ted Talk on the Developing world (an oldie but a goodie)
- Elsevier is trying to make open-access illegal (not strictly statistics related, but a hugely important issue for academics who believe government funded research should be freely accessible), more here.
08 Jan 2012
Here’s a question I get fairly frequently from various types of people: Where do you get your data? This is sometimes followed up quickly with “Can we use some of your data?”
My contention is that if someone asks you these questions, start looking for the exits.
There are of course legitimate reasons why someone might ask you this question. For example, they might be interested in the source of the data to verify its quality. But too often, they are interested in getting the data because they believe it would be a good fit to a method that they have recently developed. Even if that is in fact true, there are some problems.
Before I go on, I need to clarify that I don’t have a problem with data sharing per se, but I usually get nervous when a person’s opening line is “Where do you get your data?” This question presumes a number of things that are usually signs of a bad collaborator:
- The data are just numbers. My method works on numbers, and these data are numbers, so my method should work here. If it doesn’t work, then I’ll find some other numbers where it does work.
- The data are all that are important. I’m not that interested in working with an actual scientist on an important problem that people care about, because that would be an awful lot of work and time (see here). I just care about getting the data from whomever will give it to me. I don’t care about the substantive context.
- Once I have the data, I’m good, thank you. In other words, the scientific process is modular. Scientists generate the data and once I have it I’ll apply my method until I get something that I think makes sense. There’s no need for us to communicate. That is unless I need you to help make the data pretty and nice for me.
The real question that I think people should be asking is “Where do you find such great scientific collaborators?” Because it’s those great collaborators that generated the data and worked hand-in-hand with you to get intelligible results.
Niels Keiding wrote a provocative commentary about the tendency for statisticians to ignore the substantive context of data and to use illustrative/toy examples over and over again. He argued that because of this tendency, we should not be so excited about reproducible research, because as more data become available, we will see more examples of people ignoring the science.
I disagree that this is an argument against reproducible research, but I agree that statisticians (and others) do have a tendency to overuse datasets simply because they are “out there” (stackloss data, anyone?). However, it’s probably impossible to stop people from conducting poor science in any field, and we shouldn’t use the possibility that this might happen in statistics to prevent research from being more reproducible in general.
But I digress…. My main point is that people who simply ask for “the data” are probably not interested in digging down and understanding the really interesting questions.