R data.table in 5 minutes


Hi. I’m Sharon Machlis at IDG, here with
episode 36 of Do More With R: data.table in 5 minutes. Data.table is a package known for speed and
power for data wrangling and analysis. Fans say its syntax is both concise and consistent.
That syntax is also a bit different than either base R or the tidyverse. Let’s dive in and see how it works. A data table object is a type of data frame,
but with special features. There are a couple of ways to create one. Here I’ll load the
package and import a CSV file with about 645K rows and 20+ columns using data.table’s
fread() function. You can see fread() is super fast. Looking
at the object’s class, mydt is both a data frame and a data table. You can also turn an existing data frame into
a data.table with the as.data.table() function. I’ll create a data frame with base R’s
read.csv and then convert it. OK let me get rid of
the copies. Let’s take a look at the structure of mydt:
Now I’ve got a data.table here, too. We’ve got one row per flight, with info
like flight date, origin info, destination info, and some time and delay info. What if
I want to look at flight delays only from Boston to San Francisco by airline? Here’s the basic syntax for a LOT of things
you’ll want to do with data in data.table: Mydt open bracket I, j, by close bracket Which means: Start with mydt, subset or reorder
using I, calculate using j, and order by by. So. If I want to subset this data for the
origin being Logan Airport, which is BOS, all I have to do is put ORIGIN==BOS in that
I section. What if I want to look at flights from Boston
and to San Francisco? I just add a second condition I’ll run that code and save it in a new
variable called myresult. To calculate the average (or mean) delay in
minutes, I put the calculation in the j part. What I want is the mean of the ARR_DELAY_NEW
column, which gives delay in minutes. I need na.rm=TRUE to remove all the not
availables. Let me run that. You see now I have a single average for all
the delays. Next, I want the average delay by airline.
For that, I use the by part. Let me run that code. Hmmm. Those Airline carrier names aren’t
very intuitive. I don’t have much time to go into joining tables, but quickly so you
can see how easy it is . . . I’ve got a separate lookup table CSV with
airline codes and the airline names that I’ll import. Now here’s one way to do the join.
I’ll first set joining (and indexing) keys for each table – OP_UNIQUE_CARRIER for the
original data and Code for the lookup table. Then it’s just my lookup table with my original
data table in brackets. [ There are other ways to join data — which
I’ll explore in a future video since I’m out of time for this episode. Thanks for watching!
For more R tips, head to the Do More With R page at go dot infoworld dot com slash more
with R, all lowercase except for the R You can also find the Do More With R playlist
on the YouTube IDG Tech Talk channel — where you can subscribe so you never miss an episode.
Hope to see you next time!

, , , , , , , , , , , , ,

Post navigation

Leave a Reply

Your email address will not be published. Required fields are marked *