I recently wanted to run Apache Airflow on my laptop as an upgrade to cron. I don’t really want all the full power and fury that Airflow brings, but I’m not one to shy away from killing a gnat with a sledge hammer. I use conda to manage environments on my mac and I wanted to keep airflow in a conda environment. So I ran the following which sets up a conda environment called airflow then installs airflow in that environment.
There’s a charming little brain teaser that’s going around the Interwebs. It’s got various forms, but they all look something like this: This problem can be solved by pre-school children in 5-10 minutes, by programer – in 1 hour, by people with higher education … well, check it yourself! ![:)](http://girlsaregeeks.com/WPApp/wp-includes/images/smilies/icon_smile.gif) 8809=6 7111=0 2172=0 6666=4 1111=0 3213=0 7662=2 9313=1 0000=4 2222=0 3333=0 5555=0 8193=3 8096=5 7777=0 9999=4 7756=1 6855=3 9881=5 5531=0 2581=?
I had someone ask me about fitting a beta distribution to data drawn from a gamma distribution and how well the distribution would fit. I’m not a “closed form” kinda guy. I’m more of a “numerical simulation” type of fellow. So I whipped up a little R code to illustrate the process then we changed the parameters of the gamma distribution to see how it impacted fit. An exercise like this is what I call building a “toy model” and I think this is invaluable as a method for building intuition and a visceral understanding of data.
A bad analogy can frame an entire conversation improperly. This is one of those “anecdotes from a middle-aged man posts.” So take it with a grain of salt. A number of years ago I worked in the risk management team for an insurance company that sold long term care (LTC) insurance. LTC insurance is a private product that covers home health care and nursing home care if the policyholder is unable to take care of themselves on their own.
In a previous post I discussed my frustrations with trying to get Dropbox or Spideroak to perform BOTH encrypted remote backup and AND fast two way file syncing. This is the detail of how I set up for two machines, both Ubuntu 10.10, to perform two way sync where a file change on either machine will result in that change being replicated on the other machine. I initially tried running Unison on BOTH my laptop and the server and had the server Unison set to sync with my laptop back through an SSH reverse proxy.
I love the portability of a laptop. I have a 45 min train ride twice a day and I fly a little too, so having my work with me on my laptop is very important. But I hate doing long running analytics on my laptop when I’m in the office because it bogs down my laptop and all those videos on The Superficial get all jerky and stuff. I get around this conundrum by running much of my analytics on either my work server or on an EC2 machine (I’m going to call these collectively “my servers” for the rest of this post).
It’s been pointed out to me that I haven’t had any blog posts in a while. It’s true. I’m fairly slack. But in the last few months I’ve changed jobs (same firm, new role), written an R abstraction on top of Hadoop, been to China, and managed to stay married. While that sounds pretty awesome, I’m nothing compared to Hideaki Akaiwa. And you may have heard that the R Cookbook by Chicago’s own Paul Teeter has been printed!
I’ve been messing around with using Amazon Web Services for a while. I’ve had some projects where I wanted to upload files to S3 or fire off EMR jobs. I’ve been controlling AWS services using a hodgepodge of command line tools and the R system() function to call the tools from the command line. This has some real disadvantages, however. Using the command line tools means each tool has to be configured individually which is painful on a new machine.
A few months ago I switched my laptop from Windows to Ubuntu Linux. I had been connecting to my corporate SQL Server database using RODBC on Windows so I attempted to get ODBC connectivity up and running on Ubuntu. ODBC on Ubuntu turned into an exercise in futility. I spent many hours over many days and never was able to connect from R on Ubuntu to my corp SQL Server.
Over at stats.stackexchange.com recently, a really interesting question was raised about principal component analysis (PCA). The gist was “Thanks to my college class I can do the math, but what does it MEAN?” I felt like this a number of times in my life. Many of my classes were focused on the technical implementations they kinda missed the section titled “Why I give a shit.” A perfect example was my Mathematics Principles of Economics class which taught me how to manually calculate a bordered Hessian but, for the life of me, I have no idea why I would ever want to calculate such a monster.
[caption id=“attachment_825” align=“alignleft” width=“250” caption=“André-Louis Cholesky is my homeboy”][/caption] When I did a brief post three days ago I had no plans on writing two more posts on correlated random number generation. But I’ve gotten a couple of emails, a few comments, and some Twitter feedback. In response to my first post, Gappy, calls me out and says, “the way mensches do multivariate (log)normal variates is via Cholesky. It’s simple, instructive, and fast.
So after yesterday’s post on Simple Simulation using Copulas I got a very nice email that basically begged the question, “Dude, why are you making this so hard?” The author pointed out that if what I really want is a Gaussian correlation structure for Gaussian distributions then I could simply use the mvrnorm() function from the MASS package. Well I did a quick ?mvrnorm and, I’ll be damned, he’s right! The advantage of using a copula is the ability to simulate correlation structures where the correlation is different for different levels of values.
A friend of mine gave me a call last week and was wondering if I had a little R code that could illustrate how to do a Cholesky decomposition. He ultimately wanted to build a Monte Carlo model with correlated variables. I pointed him to a number of packages that do Cholesky decomp but then I recommended he consider just using a Gaussian Copula and R for the whole simulation.
I do some work from home, some work from an office in Chicago and some work on the road. It’s not uncommon for me to want to tunnel all my web traffic through a VPN tunnel. In one of my previous blog posts I alluded to using Amazon EC2 as a way to get around your corporate IT mind control voyeurs service providers. This tunneling method is one of the 5 or so ways I have used EC2 to set up a tunnel.
I’ve been continuing to muck around with using R inside of Amazon Elastic Map reduce jobs. I’ve been working on abstracting the lapply() logic so that R will farm the pieces out to Amazon EMR. This is coming along really well, thanks in no small part to the Stack Overflow [r] community. I have no idea how crappy coders like me got anything at all done before the Interwebs. One of the immediate hurdles faced when trying to use AMZN EMR in anger is that the default version of R on EMR is 2.
I’m kinda blown away by the number of folks who have joined the Chicago R User Group (RUG) in the last few weeks. As of this morning we have 65 people signed up for the group and 25 who have said that they are planning on attending the meetup this Thursday (yes, only 3 days away!) I’m very pleased that this many people in Chicago find the R language interesting and/or valuable.
On Tuesday May 4th at 9:30 PM central, 10:30 eastern, I’ll be giving a live online presentation as part of the Vconf.org open conference series. I’ll be speaking about R and why I started using R a couple years ago. This is NOT going to be a technical presentation but rather an illustration of how an R convert was created and why R became part of my daily tool set.
Back in November 2009 Wired wrote an article about some grad students who decided to try to stochastically model throwing darts. Because I don’t actually read printed material I didn’t see the article until a couple of months ago. My immediate thought was, “hey, I drink beer. I throw darts. I build stochastic models. Why haven’t I done this?” Well we all know why I haven’t done this. I have a job and a 2 year old daughter and I like my wife.
[caption id=“attachment_673” align=“alignleft” width=“169” caption=“Morris Day, y’all! “][/caption] I think we all know that Morris Day was talking about when he wrote the lyrics to “The Bird”: Yes! Hold on now, this dance ain't for everybody. Just the sexy people. White folks, you're much too tight. You gotta shake your head like the black folks. You might get some tonight. Look out! That’s right, he was talking about the new R User Group in Chicago!
The future of math is statistics… and the language of that future is R: I’ve often thought there was way too little “statistical intuition” in the workplace. I think Author Benjamin would agree.
Rumor has it that Joe Adler, author of the O’Reilly Book R in a Nutshell, has joined Linked In as a data scientist. But that does not keep him from still pumping out some interesting content over at OReilly.com. His latest article is about lookup performance in R. He does a great job giving code samples and explaining what he is doing. Worth reading, for sure.
Stop wasting time reading my drivel. You need to head over the the DataWrangling.com blog and read Peter Skomoroch’s interview with Bradford Cross of FlightCaster. Peter wrote up this interview back in August 2009, so I’m a little late to this party. There’s some really great quotes in this interview. Here’s a few of my fav quotes from Cross: At Google, the research scientists prototype in python and R, and then port to C++ for the real scalable map reduce runs.
[caption id=“attachment_594” align=“alignleft” width=“261” caption=“This blog’s name in Chinese! “][/caption] I just came back from the future and let me be the first to tell you this: Learn some Chinese. And more than just cào nǐ niáng (肏你娘) which your friend in grad school told you means “Live happy with many blessings”. Trust me, I’ve been hanging with Madam Wu and she told me it doesn’t mean that. So how did I travel to the future to visit with Madam Wu, you ask?
One of my primary uses for R is to build stochastic simulations of insurance portfolios and reinsurance treaties. It’s not uncommon for each of my simulations to take 20 seconds or more to complete (if you’re doing the math, that’s 55 hours for 10K sims or, approximately 453 games of solitaire) . Initially I ran my sims in R running on an Oracle VirtualBox (Oracle now owns Virtualbox! gasp ) running Ubuntu.
It’s common knowledge that I struggle wrapping my head around the apply functions in R. That is illustrated very clearly in the following discussion on Stack Overflow: Dirk’s comment is actually spot on. I’ve asked the same damn question at least 4-5 times. Only I didn’t really understand it was the same question. That’s one of the problems of not really being good at something; it’s hard to think abstractly about it.
So for the rest of this conversation big data == 2 Gigs. Done. Don’t give me any of this ‘that’s not big, THIS is big’ shit. There now, on with the cool stuff: This week on twitter Vince Buffalo asked about loading a 2 gig comma separated file (csv) into R (OK, he asked about tab delimited data, but I ignored that because I use mostly comma data and I wanted to test CSV.
Tonight (October 29, 2009) at 5:30 PM is the Chicago R meetup at Jaks tap. Here’s more info. I’ll be making a presentation based on my earlier blog post about plyr. The presentation will only be 8 minutes long so I’ve had to pick and choose my info carefully. OK, who am I kidding? I had a couple of Schlitz (in a bottle!) for lunch over at Boni Vinos and slammed some slides together rather haphazardly.
I’m not dead yet! Although it has been rumored that I am. The new job is going great and I’m thrilled to be with a new firm doing interesting work alongside smart people. It makes me seem smarter by simple association. There’s been a lot going on recently in the R user community. There was anR flash mob of Stack Overflow which resulted in a noticeable increase in the number of R questions and answers in SO.
So one glace at my user logs shows the truth: no one gives a rat’s rump thatI just quit my job; you just love you some Twitter R code. And I’m nothing but an attention whore, so come get some! So in mylast ‘Twitter with R’ post I gave you some code I’d written ripped off that allowed you to update your status from R. That’s kinda cool, but really just for annoying your friends, tweeting when your code is finished running or, as Eva pointed out in the comments, maybe Tweeting the outcome of a routine.
[caption id=“” align=“alignleft” width=“188” caption=“Pretty Normal”][/caption] Dave, over at The Revolutions Blog,posted about the big ‘ol list of graphs created with R that are over at Wikimedia Commons. As I was scrolling through the list I recognized the standard normal distribution from the Wikipedia article on the same topic. Below is the fairly simple source code with lots of comments. Here’s the source. Run it at home… for fun and profit.
So I have started following the #RStats tag in twitter. Prior to a week ago I had never Twitterbated so I thought I would give it a go since I am not one to shy away from new technology… much. I think of Twitter like a call in radio show where I get to cut off callers when they annoy me. Well one of the interesting things I ran across was this tweet that pointed to this page about posting tweets from R.
I’ve been struggling for a while on which database to use for my working data. I used to use MS Access quite a lot. The problems with MS Access include but are not limited to: * 2 GB file size limit, at least historically * Versions change with each edition of MS Office * Sort of tough to write SQL scripts * Very little automation, ie compression, backup, etc. * Windows only I used Oracle for a few years as a result of my previous employer being an Oracle shop.
In honor of me moving to Chicago, the powers who abide have decided to hold the first annual “R/Finance conference for applied finance using R” conference in Chicago this year. The dates are April 24-25, 2009. R/Finance 2009: Applied Finance with R To those who made the decision on location, I’m pleased but slightly embarrassed that you let my relocation decision have such a profound impact on your venue choice.
One of the many things that I sit around pondering when I should be doing productive things is the idea of analytical workflow. I have only worked with one analytical guru who I felt really gave thought and structure to workflow and its impact on analyist productivity. When I talk about workflow I mean the whole process from the time the analytical guy thinks, “Hey, I need to understand the velocity of new purchases between different types of sales campaigns.
So Andrew Gelman hates box plots. Not that you should give a buck what Gelman thinks. I’m just setting this blog post up, OK. So stick with me. Gelman also thought this XKCD cartoon was NOT funny : There’s some correlation as well as causation. I could be wrong, but I suspect that the reason Gelman does not like the XKCD cartoon is because he’s very literal, as geeks can be.
Recently when reading the Revolutions R Blog they talked about getting to the actual source of routines in R. They linked to an R News PDF from a couple years back. The actual text of the article is buried 43 pages into the PDF. To increase awareness and for my own future reference I’m going to reprint Uwe Ligges’ article here. # **Accessing the Sources** by Uwe Ligges ## Introduction One of the major advantages of open source software such as R is implied by its name: the sources are open (accessible) for everybody.
[caption id=“attachment_92” align=“alignleft” width=“165” caption=“I love ya, SAS! “][/caption] Tibco acquired Insightful last year. Many folks have reported that S-Plus (the closed source implementation of the S language) was dang near financially killed by the success of R (the open source implementation of the S language). I may be slow on the update, but today I learned that a company named World Programming is making a product, WPS, which is an alternative implementation of the SAS base language.
After writing the previous entry about Taleb and Mandelbrot I was thinking, “hey, I should be able to create the Mandelbrot set using R.” I’ve never actually tried to code the Mandelbrot set, but it seems easy enough. Well I messed with it for 2 maybe 3 seconds and then Googled [Mandelbrot set r] and, no surprise, I found a great implementation that not only produces the Mandelbrot set, it produces it with animation!
Robert Gentleman, left, and Ross Ihaka, the creators of R. Picture courtesy of the New York Times For those of us who use R on a regular basis it was pretty neat to see a mainstream media piece on the R language. Ashlee Vance of the New York Times did a good job on the piece. It is hard to explain to your average kitchen table reader of the Grey Lady why a computer programing language is important.