Data Science makes extensive use of the predictive capabilities of machine learning (ML) algorithms. Python, on the other hand, provides a convenient environment for experimenting with these algorithms because of its readability and efficiency. And the abundance of libraries makes it an even more attractive solution. A framework is an interface or tool that allows developers to simply create machine learning models without diving into the underlying algorithms. A library is a set of files containing code that can be imported into your application. A framework can be the set of libraries needed to build a model without understanding the specifics of the underlying algorithms. However, developers need to know how these algorithms work in order to interpret the results correctly.
Table of Contents
#10 Matplotlib
Matplotlib is an interactive cross-platform library for creating two-dimensional diagrams. It can be used to create high-quality graphs and charts in several formats. Advantages:
- Flexibility. Supports Python and IPython, Python scripts, Jupyter Notebook, web application servers, and many interface tools (GTK+, Tkinter, Qt, and wxPython).
- Provides a MATLAB-style interface for creating diagrams
- Object-oriented interface gives full control over axis properties, fonts, line styles, and so on.
- Compatible with various graphics engines and operating systems.
- Often used in other libraries, such as Pandas.
Disadvantages:
- Having two different interfaces (object-oriented and MATLAB-style) can be confusing to the novice developer.
- Matplotlib is a library for visualization, not data analysis. For the latter, it needs to be combined with others, such as Pandas.
Official documentation: https://matplotlib.org/stable/index.html. Tutorials on matplotlib in Russian: Installing matplotlib and graph architecture / plt 1.
#9 Natural Language Toolkit (NLTK)
NLTK is a framework and a set of libraries for developing symbolic and statistical natural language processing (NLP). The standard toolkit for NLP in Python. Benefits:
- The library contains graphical tools as well as data examples.
- Includes a book and a set of examples for beginners.
- Provides support for various ML operations such as classification, parsing, tokenization, and so on.
- Works as a platform for prototyping and building research systems.
- Compatible with several languages.
Disadvantages:
- To work with NLTK you need to understand how to work with strings. However, documentation can help with this.
- Tokenization comes at the expense of breaking text into sentences. This has a negative impact on performance.
Official documentation: https://www.nltk.org/.
#8 Pandas
This is the Python library for high-performance yet comprehensible data structures and data analysis tools in Python. Benefits:
- Expressive, fast and flexible data structures.
- Supports aggregation, concatenation, iteration, reindexing, and visualization operations.
- Flexible and compatible with other Python libraries.
- Intuitive data management with a minimal set of commands.
- Supports a wide range of commercial and academic domains.
- Performance.
Disadvantages:
- Built on matplotlib, which means a beginner should be familiar with both to understand what is best to use for a particular problem.
- Less suitable for n-dimensional arrays and statistical modeling. Better to use NumPy, SciPy, or SciKit Learn for that.
Official documentation: https://pandas.pydata.org/pandas-docs/stable/index.html. Brief documentation with examples: Introduction to the pandas library: installation and first steps / pd 1. Lessons on Pandas in Russian: Fundamentals of Pandas №1 // Reading files, DataFrame, data selection.
#7 Scikit-Learn
This library is based on matplotlib, NumPy and SciPy. It provides several tools for data analysis and mining. Advantages:
- Simple and efficient.
- Quickly improving and updating.
- Variety of algorithms, including cluster and factor analysis and principal components method.
- Can extract data from images and text.
- Can be used for NLP.
Disadvantages:
- Designed for teacher-assisted learning and does not work well in non-teacher-assisted learning (e.g. Deep Learning).
Official documentation: https://scikit-learn.org/stable/.
#6 Seaborn
A library for creating statistical graphs in Python. It is based on matplotlib and has integration with pandas data structures. Benefits of
- Offers more visually appealing graphs compared to matplotlib.
- Offers built-in graphs that matplotlib does not.
- Uses less code for visualization.
- Excellent integration with Pandas: a combination of data visualization and analysis.
Disadvantages:
- Builds on matplotlib, so you need to understand which library to use in which case.
- Relies on default themes, so the result is not as customizable as matplotlib.
Official documentation: https://seaborn.pydata.org/.
#5 NumPy
NumPy adds multidimensional array and matrix processing to Python, as well as large datasets for high-level mathematical functions. It is commonly used for scientific calculations. Consequently, it is one of the most used Python packages for machine learning. Advantages:
- Intuitive and interactive.
- Offers Fourier transforms, capabilities to generate complex numbers, and other tools to integrate computer languages like C/C++ and Fortran.
- Versatility – other machine learning libraries, such as scikit-learn and TensorFlow, use NumPy arrays as source values; and Pandas has NumPy under the hood.
- Serious community contribution to development.
- Simplifies complex mathematical implementations.
Disadvantages:
- Can be overly complex – not worth using if you’re happy with regular Python lists.
Official documentation: https://numpy.org/. Tutorials on NumPy in Russian: Introduction and installation of the NumPy / np 1.
#4 Keras
A very popular machine learning library in Python, providing a high-level neural network API that runs on top of TensorFlow, CNTK or Theano. Benefits:
- Great solution for experimentation and rapid prototyping.
- Portable.
- Offers a lightweight representation of neural networks.
- Easy to use for modeling and visualization.
Disadvantages:
- Slow because it requires creating a computational graph before performing operations.
Official documentation: https://keras.io/. Lessons on Keras in Russian: Advantages and limitations of Keras / keras 1.
#3 SciPy
Popular library with different modules for optimization, linear algebra, integration and statistics. Benefits:
- Suitable for image management.
- Provides simple processing of mathematical operations.
- Offers efficient mathematical operations including integration and optimization.
- Supports signal processing.
Disadvantages:
- The name SciPy hides both a stack and a library. However, the library is part of the stack. This can be confusing.
Official documentation: https://www.scipy.org/. Introduction to SciPy in Russian: Guide to SciPy: what it is, and how to use it.
#2 Pytorch
A popular library based on Torch, which, in turn, is made in C and wrapped in Lua. Originally created by Facebook, but now used by Twitter, Salefsorce, and many other organizations. Benefits:
- Contains tools and libraries for computer vision, natural speech processing, deep learning, and more.
- Developers can perform calculations on tensors using GPU acceleration.
- Helps create computational diagrams.
- Simulation process is simple and transparent.
- The standard define-by-run mode is more like classic programming.
- Uses familiar debugging tools such as pdb, ipdb, or the PyCharm debugger.
- It uses a lot of pre-made models and modules that can be combined with each other.
Disadvantages:
- Because PyTorch is relatively new, there aren’t many online resources. This makes it difficult to learn from scratch, though it’s still fairly intuitive.
- It’s not that ready to be fully functional compared to TensorFlow.
Official documentation: https://pytorch.org/.
#1 TensorFlow
Originally developed by Google, TensorFlow is a high-performance library for data flow graph computing. Under the hood, it is more of a framework for creating and running calculations that use tensors. TensorFlow is most often used in neural networks and deep learning. This makes it one of the most popular libraries. Benefits:
- Supports reinforcement learning and other algorithms.
- Provides computational graph abstraction.
- Huge community.
- Provides TensorBoard, a tool for visualizing models right in your browser.
- Ready to run.
- Can be deployed on multiple CPUs and GPUs.
Disadvantages:
- Much slower than other CPU/GPU frameworks.
- Steep learning curve compared to PyTorch.
- Computational graphs can be slow.
- Not commercially supported.
- Not a great toolkit.
Official documentation: https://www.tensorflow.org/. Right now the course is 50% off!
Conclusions
Now you know the differences in Python libraries and frameworks. You can evaluate the advantages and disadvantages of the most popular machine learning libraries.