In 2005 I was interviewing for a job as Risk Manager with Genworth Financial. I was working a gig up in Armonk, NY so I hopped a car to the GNW office and met with Mark Griffin, at that point the Chief Risk Office (CRO) for GNW. After some small talk, Mark asked me the single most interesting interview question I’ve ever been asked. I don’t recall the exact wording, but the gist was:
In a previous post I discussed my frustrations with trying to get Dropbox or Spideroak to perform BOTH encrypted remote backup and AND fast two way file syncing. This is the detail of how I set up for two machines, both Ubuntu 10.10, to perform two way sync where a file change on either machine will result in that change being replicated on the other machine. I initially tried running Unison on BOTH my laptop and the server and had the server Unison set to sync with my laptop back through an SSH reverse proxy.
I love the portability of a laptop. I have a 45 min train ride twice a day and I fly a little too, so having my work with me on my laptop is very important. But I hate doing long running analytics on my laptop when I’m in the office because it bogs down my laptop and all those videos on The Superficial get all jerky and stuff. I get around this conundrum by running much of my analytics on either my work server or on an EC2 machine (I’m going to call these collectively “my servers” for the rest of this post).
It’s been pointed out to me that I haven’t had any blog posts in a while. It’s true. I’m fairly slack. But in the last few months I’ve changed jobs (same firm, new role), written an R abstraction on top of Hadoop, been to China, and managed to stay married. While that sounds pretty awesome, I’m nothing compared to Hideaki Akaiwa. And you may have heard that the R Cookbook by Chicago’s own Paul Teeter has been printed!
I’ve been messing around with using Amazon Web Services for a while. I’ve had some projects where I wanted to upload files to S3 or fire off EMR jobs. I’ve been controlling AWS services using a hodgepodge of command line tools and the R system() function to call the tools from the command line. This has some real disadvantages, however. Using the command line tools means each tool has to be configured individually which is painful on a new machine.
I’m a huge O’Reilly Media fan boy. I can’t hide it. I hear Tim O’Reilly speak at conferences and I think to myself, “Screw being president, I want to be Tim O’Reilly.” I’ve been a subscriber to their online book services called Safari Books Online for years. Every month I see the bill for $43 come through and I think to myself, “Self, that’s the best $43 you spent all month.
A few months ago I switched my laptop from Windows to Ubuntu Linux. I had been connecting to my corporate SQL Server database using RODBC on Windows so I attempted to get ODBC connectivity up and running on Ubuntu. ODBC on Ubuntu turned into an exercise in futility. I spent many hours over many days and never was able to connect from R on Ubuntu to my corp SQL Server.
Over at stats.stackexchange.com recently, a really interesting question was raised about principal component analysis (PCA). The gist was “Thanks to my college class I can do the math, but what does it MEAN?” I felt like this a number of times in my life. Many of my classes were focused on the technical implementations they kinda missed the section titled “Why I give a shit.” A perfect example was my Mathematics Principles of Economics class which taught me how to manually calculate a bordered Hessian but, for the life of me, I have no idea why I would ever want to calculate such a monster.
[caption id=“attachment_825” align=“alignleft” width=“250” caption=“André-Louis Cholesky is my homeboy”][/caption] When I did a brief post three days ago I had no plans on writing two more posts on correlated random number generation. But I’ve gotten a couple of emails, a few comments, and some Twitter feedback. In response to my first post, Gappy, calls me out and says, “the way mensches do multivariate (log)normal variates is via Cholesky. It’s simple, instructive, and fast.
So after yesterday’s post on Simple Simulation using Copulas I got a very nice email that basically begged the question, “Dude, why are you making this so hard?” The author pointed out that if what I really want is a Gaussian correlation structure for Gaussian distributions then I could simply use the mvrnorm() function from the MASS package. Well I did a quick ?mvrnorm and, I’ll be damned, he’s right! The advantage of using a copula is the ability to simulate correlation structures where the correlation is different for different levels of values.