Python and R packages you're probably not using, but should.

• Strategy

Learn more about the world of Data Science with this blog from one of our MicroStrategy experts. Review different Python and R packages, such as Keras and R6, that will help you with your next data science project. Start driving innovation at your company with these tips and tricks.

Python and R packages you’re probably not using, but should.

About the author > >
Read Time | Approximately 4 min

Between Python and R, Data Scientists have access to over 190,0001,2 open source packages that can be freely integrated in their projects. Chances are, you’re already using some of the must-haves in your projects today. How about SciKit-Learn, NumPy, randomForest, dplyr, and ggplot2? Those should sound familiar.

Hidden among those 190,000 packages are countless useful ones. Here are some awesome Python and R packages that you might not have come across but should take a look at before you start your next data science project. Ready? Let's get going!

Explore different Python packages

Seaborn

Python doesn’t have a built-in graphics library that’s designed for data visualization. Matplotlib has become the go-to visualization library for many users, but it leaves much to be desired in terms of aesthetics. Enter Seaborn. Just like ggplot2 did for R, Seaborn lets users create more appealing visualizations using a simpler syntax. Take a look at the gallery for proof.

NetworkX

A must-have if you’re working with network graph data (social networks, computer networks, citation networks, etc.). NetworkX is extremely comprehensive, well-documented, fast, and includes lots of graph traversal and graph algorithms. It also has its own plotting library so you visualize your networks (just don’t try to plot all of the internet – your laptop might catch on fire).

Keras

TensorFlow is extremely popular for deep learning, but it has a very dense API (and some odd terminology) which makes getting started a bit harder than it needs to be. Keras abstracts the TensorFlow API into a simpler one, making it easier to iteratively build and train deep learning networks. It might seem strange but to Keras, TensorFlow is just the computational backend. What that means is you can swap-out the deep learning framework for another one if your requirements change. Currently, Keras supports TensorFlow, Theano, and CNTK.

Flask

With Flask and its extensions – namely Flask-RESTful– you can build a complete web service in Python. But when would a Data Scientist need to do that? Imagine you’ve trained a SciKit-Learn model and the business requirements state that you need to provide on-demand predictions to a third-party application. Flask allows you to deploy the model as a webservice and give the model its own API for scoring. The third-party application could call the API and get predictions in return.

NLTK

A common rule-of-thumb is that 80% of valuable business data is unstructured text data3. That said, you might be able to get more value from your data with some basic text-processing. NLTK, short for “Natural Language Toolkit,” is a must-have for anyone working with text data in Python. Some of the things NLTK can do: tokenization (breaking text up into lists of words), stemming (finding word-roots, e.g. converting “congratulatory” to “congratulate”), and part-of-speech matching (e.g. verb, noun, adjective). It also comes with large corpuses, like WordNet, that can be used in machine learning tasks like sentiment analysis.

Get what you need for R

data.table

An R data frame is a 2-dimensional list of equal-length vectors. Translation: it’s a table of rows and columns and it’s the most common way of storing data in R. Native R data frames are very resource intensive on large datasets and its difficult to perform SQL-like aggregations or tabulations on them without the assistance of other packages. The data.table package alleviates all of that and, because its written in C, is blazing fast even on huge datasets. It does have a bit of an odd syntax, but once you get the hang of it, you probably won’t go back to regular data frames. Try this cheat sheet to get started.

caret (and caretEnsemble)

The implementation of machine learning algorithms in R varies widely. Algorithms in near-identical packages might use entirely different parameter names or different defaults for equivalent operations. That means you have to be very methodical when using a new or unfamiliar R package to train a model. Luckily, the caret package does a great job of reigning in all of that inconsistency through a common modeling syntax and it supports over 200 algorithms. Add in caretEnsemble and you’ve got something that’s pretty much on-par with SciKit-Learn (Python) in terms of capability.

R6

What does every traditional application developer ask when trying R for the first time: “Where’s the OOP?” Because R has roots in statistics academia, some programming principals – like object-oriented – aren’t quite there. If you’ve written S3 or S4 objects for packages in R, you should give R6 a look. R6 is very close to OOP, is easy to use, and in fact is much faster to execute by R. Finally, R6 methods are very easy to understand because they’re essentially just functions arranged in a R list.

prophet

If you’re doing time-series analysis, you’re probably using the forecast package. Before the prophet package (developed by Facebook) came along, there wasn’t really much of a debate as to which package to use for time-series work. Compared to forecast, which uses ARIMA methods, prophet uses additive models with discrete seasonality effects. Whatever you do, don’t forget to test your forecasts on out-of-sample data!

profvis

Optimizing R code for performance isn’t usually one of the most urgent things on a Data Scientists’ to-do list. But, if it were, how would you do it? The profvis package – developed by RStudio – provides a very easy to use performance profiler. Assuming you’re using RStudio, all you need to do is wrap your function, snippet, or script in the profvis function call, profvis({…}), and hit run. You’ll get an interactive analysis of your code’s CPU and memory utilization which you can also save as an HTML document for sharing with teammates. But, from there, it’s up to you to make your code faster.

This article presented 10 different packages for R and Python that can help with model training, data visualization, code profiling, and more. While you’re here, we’d enjoy hearing about some of your favorite packages or tips. Feel free to leave your thoughts in the comments below.

Sources for Industry Insights

A special thanks to the following sources.
1 - https://pypi.org/
2 - https://cran.r-project.org/web/packages/available_packages_by_name.html
3 - https://en.wikipedia.org/wiki/Unstructured_data
4 - https://www.southwales.ac.uk/courses/msc-data-science/

Source for Top Article Image

Comment

0 comments

Details

Knowledge Article

Published:

May 22, 2019

Last Updated:

September 22, 2022