Recently, I felt I have to give an overview to my colleagues of what was possible with R. There are a lots of presentation out there but I wanted something more suited to this audience. Here is what I talked about.
Why R?¶
R is a great language for data analysis and statistics
- Open-source and multi-platform
- Very expressive with high-level constructs
- Excellent graphics
- Widely used in academia and business
- Very active community
- Documentation, FAQ on stackoverflow
- Great integration with other tools
Why is R a pain for computer scientists ?¶
- R is not really a programming language:
- You should always avoid to code with it
- The semantic and syntax is uggly
- It is interpreted, hence slow
- Documentation is for statisticians
- Default plots are
cumbersome(meaningful) - Summaries are
cryptic(precise) - Steep learning curve even for us, computer scientists whereas we generally switch seamlessly from a language to another! That's frustrating! ;)
Do's and dont's¶
R is high level, I'll do everything myself
- CTAN comprises 4,334 TeX, LaTeX, and related packages and tools. Most of you do not use plain TeX.
- Currently, the CRAN package repository features 4030 available packages.
- How do you know which one to use ??? Many of them are highly exotic (not to say useless to you). http://www.r-bloggers.com/
- Lots of introductions but not necessarily what you're looking for so I'll give you a short tour.
You should quickly realize though that you need proper training in statistics and data analysis if you do not want tell nonsense.
Install and run R on debian¶
apt-cache search r
Err, that'is not very useful :) It's the same when searching on google but once the filter bubble is set up, it gets better…
sudo apt-get install r-base
R
Install a few cool packages¶
- ddply, reshape and ggplot2 (Hadley Wickham) http://had.co.nz/
1 2 3 | |
- knitR (Yihui Xie) http://yihui.name/knitr/
1 | |
IDE¶
Interactive use is nice but quickly becomes painful so at some point, you'll want an IDE.
Emacs is great but you'll need Emacs Speaks Statistics
sudo apt-get install ess
Soon you may fall in love with org-mode. Although part of it comes in the base emacs installation, you will need the full package to have babel working.
sudo apt-get install org-mode
You may want to give a try to Rstudio or R-commander
sudo apt-get install rtudio r-cran-rcmdr
Roadmap¶
- Read data into a dataframe
1 2 3 4 5 | |
- Explore the content:
names?namesto get the help on a functionhead,tail, df[2:5,] (do not forget the ",") ,dim,length,df$toto, df[df$toto=="t",] (again, do not forget the final ",")str,summary- Reshape it a little
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | |
- Plot a bunch of things
1 | |

1 2 | |

- Aggregates and replot
1 2 3 | |

Take away¶
- R is a great tool but is only a tool. There is no magic. You need to understand what you are doing and get a minimal training in statistics.
- It is one of the building block of reproducible research (the reproducible analysis block) and will save you a lot of time.
- Read at least Jain's book: The Art of Computer Systems Performance Analysis. There are many other references on Jean-Marc's shelves or on mine.
Entered on [2012-09-12 mer. 13:00]