Landscape of High-performance Python to Develop Data Science and Machine Learning Applications

Python has become the prime language for application development in the Data Science and Machine Learning domains. However, data scientists are not necessarily experienced programmers. While Python lets them quickly implement their algorithms, when moving at scale, computation efficiency becomes inevitable. Thus, harnessing high-performance devices such as multicore processors and Graphical Processing Units (GPUs) to their potential is generally not trivial. The present narrative survey was thought as a reference document for such practitioners to help them make their way in the wealth of tools and techniques available for the Python language. Our document revolves around user scenarios, which are meant to cover most situations they may face. We believe that this document may also be of practical use to tool developers, who may use our work to identify potential lacks in existing tools and help them motivate their contributions.


INTRODUCTION
Python is one of the most used computer programming language nowadays: it is ranked in the first position on the PYPL (PopularitY of Programming Language) index [65] and first position on the TIOBE index [83] in 2022.
It is intensively used in the growing domains of Data Science (DS), scientific computation, data analytics, and Machine Learning (ML).It is used as the successor of the many data-centric and scientific computation programming languages such as R, Fortran, and Matlab.One of the main reasons behind this success in data science stands on its many DS and ML focused libraries such as NumPy, Pandas, TensorFlow, Scikit-learn, SciPy, and MatplotLib.Given the amount of data being collected and processed within the DS and ML contexts, most Python high-performance libraries have been developed outside Python by using statically typed languages such as C++, Fortran, and/or CUDA.This review is initially motivated by our practice of Python programming in the domains of DS and ML, where improving performance is critical to obtain timely results [17].This initial work led us to realize that a systematic review focused on high-performance Python with the practice of DS and ML in mind is lacking from the current literature.As our work is driven by pragmatic concerns, we believe that a narrative review is the most suitable format.

Approach and context
As with any narrative review, we will do a qualitative evaluation of the diverse extant approaches in the domain of performance improvement of Python programs.To lead this survey, the authors relied on their expertise, in the related fields, including industrial experience and research communities: ML, DS, software engineering.Their different backgrounds and institutions, as well as years of experience in the aforementioned fields, helped considering both academical and practical point of view, in a way as broad and relevant as possible.
However, the landscape of tools and techniques in this scope is very diverse, and may be distinguished according to a large number of facets (e.g., level of automation, close ties with a peculiar DS task, expected amount of effort to put in use).As a result, there is no obvious hierarchy in these facets which would drive the structure of the taxonomy presented in this paper.Instead, we focus on three common usage scenarios, which are meant to be mostly mutually exclusive, while covering most situations met in practice by data scientists.
• The developed algorithm may involve some non-standard data structure, such as a special kind of knowledge graph.To avoid the effort of searching for an existing library that can be adapted to fit the requirements, a pure prototypical Python algorithm may have been developed by a data scientist to solve a theoretical or practical problem.However, upon validation, the algorithm's performance significantly degrades when applied to larger datasets.As a result, there is a need to explore alternative approaches for more efficient computations.
• Most commonly, DS practitioners face situations close to canonical problem involving standard data structures such as numerical matrices or graphs.Practitioners will then design an algorithm to solve the problem and implement it using popular numerical Python libraries such as Numpy or Pandas.After validating the algorithm, they need to apply it to larger data sets, but runtime becomes excessive.
• Finally, the algorithm may still be only on paper, and instead of boldly starting implementing it in vanilla Python, the data scientist may look for the right library or framework to directly maximize computational efficiency at implementation time, even if it involves learning to master a Domain Specific Language (DSL) or non-standard constructs.
In these three scenarios performance is sought, but from differing starting points, and with variable will to invest in mastering sophisticated tools.For instance, in the two first scenarios, implementations are already developed, and the goal is to look for cost-effective solutions to scale them up.In contrast, the third scenario is bound to a longer-term view, where the practitioner will tolerate to invest in the most appropriate tool from the start.
Given this context, the objective of this survey is to answer the following question: which approaches can be used to improve Python execution performance in the context of one of these three scenarios?

Search Strategy
Our search strategy started by using Google Scholar, which is a search engine that indexes metadata or even the whole text of scientific publications in multiple scientific repositories.In order to reflect our inclusion and exclusion criteria (see Table 1), we used the keyword Python combined to Data Science or Machine Learning, as well as other keywords Manuscript submitted to ACM associated to high-performance computation and code acceleration (see also Table 1).The first seed of relevant papers thus collected could point us out to other important works by inspecting the cited references.We also looked at implementations and repositories referred to in this set of selected papers.We also covered social networks and forums commonly used by data science practitioners and software engineers for exchanging and sharing, such as Reddit, Stack Overflow, Kaggle and Data Science Central.We applied our search strategy on reference code repositories and Python package indexes as well, mostly GitHub and PyPI.Scientific publications were primarily filtered by date between 2018 and 2022.Nonetheless, we also review some tools dated before that time frame if they have very high relevance to our scope or if they have been actively maintained.The summary of the search strategy is presented in Table 1.
As we have stated beforehand, we focus on performance improvement in the context of three identified profiles associated to the above defined scenarios.For the second scenario, in the associated section we motivate the subset of libraries under our focus and describe how the search strategy was specifically amended in that case.
Table 1.Search plan summary

Inclusion and Exclusion Criteria
The selected publications and tools must propose some performance improvement of the execution of Python programs that directly impacts Data Science or Machine learning tasks.Some tools with a wider scope may be included, provided that they clearly enable to accelerate the execution of DS or ML tasks, or relevant function in tier DS or ML packages.
We restricted the resulting scope to the Python realm.We are aware that many high-performance libraries and research exist independently of Python, but our choice is motivated by the relative monopoly of Python in the domains of DS and ML.Some of the approaches, mainly developed in C/C++ but accessible thanks to wrappers in Python, are obviously considered.While we focus on recent tools and contributions, we also report about seminal work if its lineage can be directly related to recent practice.A high number of works are from 2015 and later due to the relative acceleration of contributions in ML and DS recently.The diversity implied by our scenarios led us to consider all levels of granularity in terms of enhancing the performance of Python code: from very general code transformation approaches to numerical libraries widely used in the DS domain.
We focused on the default CPython interpreter, and thus did not consider alternative Python interpreters (e.g., Pyston [48]) intentionally.We address this point in a specific paragraph in the discussion (Section 6.3).Besides pure code optimization, we will also consider approaches exploiting multiple CPUs (and CPU cores), as well as GPUs, the Manuscript submitted to ACM latter being widely used in ML.Nevertheless, we did not dig into application specific or dedicated processors like Tensor Processing Units (TPUs) and Field-Programmable Gate Arrays (FPGAs).

PURE PYTHON PERFORMANCE IMPROVEMENT
In this section, we focus on tools and approaches that support acceleration of code in which the computationally intensive parts rely only on the default Python distribution (also called vanilla Python).In the context of our scenarios, this can be because the modelling of the problem at hand is not standard and thus not necessarily compliant with existing numerical computation libraries (e.g., custom knowledge graph).This can be also because the practitioner is more comfortable with vanilla Python for working on an implementation which sticks to some algorithmic formalism in the literature.
Acceleration approaches in this section will generally target Python at large, beyond traditional DS use cases.In this scenario, we thus assume that a working version of the code addressing the problem of the practitioner has already been developed.It works on small samples of data, but now needs to be accelerated to be scaled up to larger amounts of data.DS and ML tasks often involve loops performing the same operation on many chunks of data (e.g., when independently processing images or database records).The seminal way to accelerate such code is to implement the Single Instruction, Multiple Threads (SIMT) principle using tools derived from Message Passing Interface (MPI) or Open Multi-Processing (OpenMP) libraries.Tools in Python related to this approach are presented in Section 3.1.

Distributed memory and shared Memory approaches
MPI works with a distributed memory model, potentially exploiting a distributed network of machines.MPI is a message passing standard that defines the syntax and semantic of library routines to develop parallel applications.With MPI, computers running a parallel program can exchange messages.MPI was originally designed to develop programs in the languages C, C++, and Fortran.Nonetheless, some Python libraries offer the same bindings for MPI, e.g., MPI4Py [25] and PyPar [70].
Conversely, OpenMP works on a shared memory model for multi-core CPUs using program directives.The OpenMP standard [24] provides a set of code annotations and instructions for the compiler and a runtime library that extends Fortran and C/C++ languages to express shared memory parallelism.OpenMP is based on compiler directives, thus less intrusive in the code than MPI, i.e., not requiring a strong refactoring of the existing code base.Based on those directives it allows the compiler to parallelize chunks of code whose instructions can be shared among the processors.
OpenMP is supported by the most common compilers such as Clang, LLVM, and GCC.It supports loop-level, nested, and task parallelism.Commonly, annotations or directives of the OpenMP API are used in loops.OpenMP has two main related implementations in Python; one of the most famous is Pymp [45].It is a library that proposes a special language construction to behave like OpenMP.It relies on the system Fork mechanism instead of threads to make parallel computation.It tries to reduce its footprint by referencing memory and not copying everything in the forked process.The second one is PyOMP [46], which is based on Numba and offers a set of constructs similar to the OpenMP API.Nevertheless, the compilation pipeline for Python is a bit more complex: PyOMP uses Numba to generate code in LLVM, then machine code to be able to run it.
Manuscript submitted to ACM

Task-based approaches
Task-based frameworks are more recent tools to parallelize and distribute heavy computation.While they are generally powerful, their usage may imply heavy code refactoring.More complex framework, that impose a specific development approach are detailed in Section 5.2.In this section, we cover libraries which rely on such frameworks, but aim at parallelizing an existing piece of code with minimal impact using decorators.A decorator is an instruction set before the definition of a function.A decorator indicates that a function (associated to the decorator) must transform a user function (the decorated function) and extend the behaviour of the latter function without explicitly making modifications.
Decorators are used to express parallelism, by indicating that these functions are going to be treated as tasks.The decorated code is analyzed and converted (if applicable) into a suitable version for parallelization.Falling into this category we found PyCOMPSs [82], Pygion [76] and Pykokkos [3], wrappers for COMPSs [81], Legion [9] and Kokkos [84], respectively.PyCOMPSs and Pygion share some similarities.Both libraries build a task dependency graph and perform analysis to define the order of task execution and the parallelism that can be achieved.Decorators are also similar, as PyCOMPSs and Pygion both use @task.On the other hand, Pykokkos translates Python code into the Kokkos API written in C++ and has more decorators to implement its programming model.In PyKokkos for example, functions can be decorated with @pk.workunit.These functions can run in parallel by passing them as argument to the function parallel_for.PyKoKKos also has support for using GPUs with CUDA.
In Jug [20], a task is defined as a Python function, and its arguments take values or outputs of another task.Using the @taskgenerator decorator, Jug performs an analysis on a task dependency graph to define the execution order and parallelization of the tasks.Parallelization is achieved by running more than one Jug process for distributing the tasks and using synchronization to get a result.As it is developed with Python, libraries such as Numpy and Scikit learn are compatible with Jug.
Pydron is a library to parallelize sequential Python code through decorators [52].Pydron targets multi-core, clusters, or cloud platforms.First, it translates the decorated functions in Python into an intermediate representation with a data-flow graph structure.The graph is analyzed by a scheduler which defines the order tasks are going to run by putting them in a queue, some tasks being scheduled to run in parallel.When a task is finished, the scheduler must be informed and based on the available information it changes the execution graph.The tasks are distributed to be executed on worker nodes.There is a distribution system in charge of managing the hardware resources, commonly a Python interpreter is launched per CPU core, each in charge to execute a given task.

Program transformation and compilation
Besides annotations, directives, and decorators for parallelization mentioned in previous sections, program transformation and compilation is another straightforward way to obtain better execution performance for an existing codebase with minimal work overhead for the practitioner.These approaches rely on code analysis that can be either static (source code) or dynamic (based on traced execution) before proposing a transformation of the code into a target language to obtain a better performance in their execution.As such, they can provide performance improvement in a general programming context.
The prominent approach we have found is to guide or give hints to the transformation tool, often a compiler, regarding specific sections of the code that should be optimized.These hints are expressed by the user by typing variables Cython also provides parallelism mechanisms through the module cython.parallelusing OpenMP as back-end [24].
To use the parallel module, the GIL must be released.When the GIL is released, Python objects cannot be manipulated.
Therefore, a function that deals with Python objects cannot be directly invoked with parallel attributes: the data must be converted into Cython typed variables or memory views.Good candidates for Cython implementation are general mathematical operations, array operations, and loops.By just using static typing and replacing Python math operations, obtaining a speed up with Cython is highly probable, even if maximal gains require fairly good development skills.If not well exploited, the performance gain will only be marginal.Moreover, it requires a manual detection of the code parts that could really benefit from Cython, it will depend on the ability of a programmer to use profilers to find out the bottlenecks of the execution of a program.
JIT stands for Just-in-Time compilation, a technique employed by some programming languages and runtime environments to enhance code execution performance.Python is an interpreted language, which means that code is executed directly by the interpreter without prior compilation.While this approach offers flexibility and dynamic features, it can be slower compared to compiled languages.However, JIT compilation allows the interpreter to dynamically compile specific code sections into machine code just before execution.This runtime process optimizes the code for improved performance.Some alternative interpreters to CPython use JIT compilation to optimize their performance.
As we left alternative interpreters out of the main scope of our paper, please refer to Section 6.3 for more information in this area.
A highly popular JIT compiler for Python is Numba [44].Numba provides compilation of Python code for a faster execution.The user must use decorators to indicate code parts that should be improved by the compiler.A common function decorator in Numba is @jit and has the following parameters: nopython, parallel, and fastmath.If nopython is set to true, the JIT compiler would compile the decorated function so it will try to run without the involvement of the Python interpreter.The parallel flag enables Numba with a transformation pass that will attempt to automatically parallelize and/or perform other optimizations on the function or some parts of it.The fastmath flag relaxes some numerical rigor to gain additional performance and enables possible fast-math optimizations.By executing the code, the Numba JIT would attempt to apply the improvements we indicated with the decorators and their parameters.
The Numba compiler translates Python code into an intermediate representation, then it is translated to LLVM to finally emit machine code.The generated machine code is close in terms of performance to a traditional compiled language.Similarly to Cython, Numba only supports a subset of the Python language and some specific libraries like Numpy.Numba can convert a sequential code to be executed in parallel by multiple cores and in very limited cases to be executed in a GPU.Numba can also be used as a bridge to develop programs in Python to run in the GPU.It offers support for CUDA (Nvidia hardware), ROCm (AMD) and HSA (AMD and ARM).A big difference is that there are no automatic attempts to parallelize the code then.Instead, the user must re-factor the code to a style similar to C with CUDA.Numba can compile a restricted subset of Python code into CUDA kernels and device functions, HSA kernels, and ROCm device functions.In GPU programming, a kernel is a GPU function launched by the host (CPU and its memory) and executed in parallel on the device (GPU and its memory).A device function is a GPU function executed on the device which can only be called from the device.
Designed within the context of astrophysical applications, Hope specializes in numerical computations.Hope is a JIT compiler that uses the decorator @hope.jitwith the function to be translated.The decorated functions are parsed into a Python Abstract Syntax Tree (AST).The Python AST is converted into a Hope AST.Several optimizations may be applied to the Hope AST such as simplification of expressions, factorizing out subexpressions, and replacing the pow function for integer exponents.From the Hope AST, C++ code is generated and compiled into a shared library (.so file on Linux systems).The shared library is added to the cache, loaded, and executed.Hope validates the name of the functions and the types of the passed arguments and tries to match to what it has on the cache, if not found then the whole compilation process starts over.The data types used in the functions are inferred by static analysis of code, the AST, and the runtime analysis.The simplification of expressions and common sub-expression elimination is performed with the SymPy library [79].
Autoparallel [67] is a compiler for Python code to transform nested loops from sequential to parallel execution in a distributed computing infrastructure.It requires that the user adds a decorator on identified functions that contain nested loops.Autoparallel relies on PyCOMPSs [82] and PLUTO [12]; PyCOMPSs is a task-based programming model to develop applications with Python decorators (reviewed in Section 3.2), whereas PLUTO is a parallelization tool that automatically transforms affine loops using the polyhedral model [8].Autoparallel analyses code decorated with @parallel and for each affine nested loop that finds creates a Scop object.The Scop object is then parallelized by adding OpenMP-like decorators to the loops.Then, it converts the code into task format through PyCOMPSs by adding tasks configurations and data synchronizations.Finally, each nested loop is replaced by the generated code to be executed by PyCOMPSs in a distributed computing platform.

Automatic approaches.
Transforming software in view to maximize performance is difficult to perform fully automatically.Code translation and transpilation focus on analyzing the structure of the code and apply transformation patterns as means to circumvent the absence of supervision.
Due to the nature of Python as an interpreted language, an increase of performance can be obtained by just porting a Python program into a compiled language.Nonetheless, doing it manually is a cumbersome task.Therefore, some specialized libraries perform transpilation by translating Python code into a compiled language (most often C++).
Shed Skin uses static analysis by checking implicit types of variables.Therefore, Shed Skin requires that all variables are implicitly typed.In other words, they must only have one assignment, and multiple assignments of different types to the same variable is not supported.To use Shed Skin, a command must be used in a terminal and the file containing Python code is passed as an argument.The Shed Skin compiler generates the translated code in C++, a header file, and a make script to compile it.Moreover, a module can be compiled and invoked from another Python script.
Manuscript submitted to ACM Nuitka translates CPython instructions into a C++ program.Compiled code generated by Nuitka is executed along with the Python interpreter for the part that cannot be compiled.This means that compatibility with other libraries is supported while using Nuitka.No code modification is required.To use Nuitka the code must be compiled using the console through Nuitka commands along with the Python code filename.The code and executable files are generated and can be invoked directly or as stand-alone libraries.
Pythran converts Python code into C++ code.However, it goes beyond pure translation and performs code analysis and optimizations.Pythran receives as an input a Python module meant to be converted into a shared library.
On the front-end of Pythran, the Python module is converted into a Python AST.Then, the Python AST is converted into a Pythran internal representation (IR) which is a subset of the Python AST.During this conversion, code analysis steps and different transformations and optimizations are performed, aimed at generating a faster version of the code.Additionally, variable types may be inferred by static analysis.The back-end of Pythran turns Pythran IR into parametrized C++ code.Then, Pythran instantiates and compiles the generated code to build a native module.Pythran is compatible with Numpy expressions and applies optimizations such as expression templates, loop vectorization, and loop parallelization through OpenMP.
Transpyle [16] relies on transpilation to accelerate Python performance.The originality of this approach is to support multiple languages also as input, e.g., reusing a legacy optimized loop written in Fortran and integrate it in the transpiled Python code.Moreover, with the use of Python as the intermediate representation for compiling code from and into target languages (e.g., Fortran), it helps the Python developer to understand the complete process.It also works in a semi-automated mode with Python annotations, possibly guiding the compiler for better improvements (e.g., loop unrolling and vectorization).
ALPyNA [40] is a program transformation tool for Python which uses static and dynamic analysis of nested loops and generates CUDA kernels for GPU execution.The input code must contain vanilla Python code and optionally Numpy instructions.Currently, basic subscripting of single or multi-dimensional arrays is supported, i.e., no slicing or sequence indexing.ALPyNA performs analysis mostly on nested loops, where a performance bottleneck is more probable to occur.Other Python instructions are ignored and are executed by the Python interpreter.After static analysis, if loop bounds and data dependencies can be determined, ALPyNA generates untyped GPU kernels.Otherwise, loops are marked for analysis at runtime.For runtime analysis (and execution) the ALPyNA execution object must be used (obtained by the function that performs static analysis) to invoke the original functions.If possible, loop bounds and data dependencies are determined at runtime and GPU kernels are generated on the fly.ALPyNA relies on Numba to finalize and compile the GPU kernels.
Pyjion is a JIT compiler designed to improve the performance of Python by converting CPython bytecode into machine code [63].In the absence of Pyjion, CPython relies on a master evaluation loop known as the frame evaluation loop to sequentially execute opcodes (individual instructions within the bytecode).However, Pyjion's compiler consists of three key stages.First, it constructs a stack table that maps abstract types to each opcode position.Then, it compiles CPython opcodes into Common Intermediate Language (CIL) opcodes.Finally, it emits these CIL opcodes to the .NET Execution Engine (EE) compiler, which converts them into native machine code or assembly.Overall, Pyjion enhances the execution speed of CPython by leveraging various optimizations in a JIT compilation context.
Pyston [48] is an alternative Python interpreter that includes a JIT step as well as many performance optimizations for the execution of Python programs (discussed specifically in Section 6.3).Pyston-lite is a derivative of Pyston that just retains the JIT step, but with a focus on easier installation and setup.While Pyston aims to achieve the highest Manuscript submitted to ACM performance possible, Pyston-lite may not match the same level of performance as the full implementation of Pyston.However, it still offers improved speed compared to using CPython as the interpreter: the authors claim 10% acceleration on macrobenchmarks [48].It's important to note that Pyston is available for Python version 3.8.12,while Pyston-lite supports a wider range of Python versions: 3.7 to 3.10.
Let us note that in general, the performance of most tools in this section has been estimated by contrast to vanilla Python.While this suits the use case scenario motivating the section, in general performance does not increase monotonically by combining the usage of multiple tools.For example, while Pyston-lite significantly improves the performance of vanilla Python, we observed that using it in combination with Numba parallelization tended to degrade the performance of Numba obtained with CPython (e.g. the execution time of matrix multiplication code has more than doubled in some of our experiments).

ACCELERATING NUMERICAL LIBRARIES USAGE
In this scenario, an algorithm would have already been implemented by the data scientist, but contrasting with the previous section, it would not rely only on vanilla Python, and would also use Python numerical libraries.Indeed, it would have been recognized that the problem depends mostly on standard data structures such as float matrices, thus aiming at benefiting from associated out-of-the-box primitives (e.g., matrix decomposition algorithms).In this section, we focus on means to provide faster execution of such numerical libraries or APIs.
For the sake of clarity and legibility, we focus on the three main libraries used in DS to facilitate and accelerate the development of single-threaded numerical computation code: Numpy [34], Pandas [56], and Scikit-learn [60].These libraries are always among top results when looking for cornerstone libraries to carry out statistical and numerical computations needed for the practice of DS and ML.They are ranked 20 th , 27 th , and 103 rd in terms of monthly downloads on PyPI (Python Package Index) in 03/2023, respectively.It is worth mentioning that other libraries are widely used in DS and ML.However, they are tied to secondary tasks such as preprocessing (e.g., NLTK, ranked 334 th ) or visualization and plotting (e.g., matplotlib, ranked 114 th ).As this survey focuses on accelerating DS code, we do not directly cover these libraries in this section.Statsmodels is also a frequently mentioned library, but we did not retain it, as it is a bit further down in the PyPI download ranking (327 th ).
Besides approaches covered in other sections (e.g., compilation, transformation), in the context of these libraries we mainly found solutions implementing an API with the same signature (same inputs and same outputs) as the original but proposing better performance.We refer to these as drop-in libraries.The execution of those drop-ins can be done using multiple CPUs, GPUs, and/or with a more efficient implementation.It may eventually require minor modifications such as data copies and changing function parameters.Ideally, they bear minimal cost to the practitioner in terms of development overhead.In this section, we will review the three identified libraries and their performance enhanced counterparts.

Numpy
It and MPI for setting up a cluster.Closely related is DistNumPy [43] which implements parallel Numpy operations by also using MPI underneath.DistNumPy was deprecated and moved to Bohrium which is in active development.
Bohrium [42] is a runtime that maps Numpy array operations (universal functions, also known as ufuncs) onto different hardware platforms such as multi-core CPUs, GPUs, and clusters.To use Bohrium the user must either replace the Numpy library import with the bohrium library or launch a script with the command python -m bohrium myscript.py.Bohrium uses different techniques to speed up computations.For example, Bohrium supports lazy evaluation, this means that Numpy operations are regrouped for evaluation until a non-Numpy operation is found.Bohrium fully supports Numpy views, therefore no data copies are done when slicing arrays.When certain conditions are met, array operations are fused into a single kernel that is compiled and executed.Data copies between main memory and GPU memory are done only when the data is accessed through Python or a Python C extension.
Bohrium is built with components that communicate by exchanging a vector bytecode (an intermediate representation corresponding to the Numpy array operations).The instruction (original code) is passed to a Bridge component which generates the vector bytecode.This bytecode is passed to a Vector Engine Manager component which manages data location, ownership of arrays, and the distribution of jobs between vector engines.The Vector Engine component is an architecture-specific implementation to execute the bytecode.Non-Numpy or unsupported operations fall back into the regular CPython interpreter.While setting up Bohrium is fairly straightforward, experimentally we found that its value as a drop-in for Numpy remains limited: for example, the Singular Value Decomposition, commonly used in the practice of DS, even causes a crash instead of gracefully falling back to Numpy † .Also, we did not find cases where using the GPU as a backend did not lead to strong performance degradation.
D2O [78] is a middleware between Numpy arrays and a distribution logic.In that sense it is not a drop-in library, but an interface to provide parallel execution of Numpy array operations through the use of a distributed data object format.
The user can pass a Numpy array as an argument to create a distributed data object, along with options regarding distribution strategy.The distributed data object supports many Numpy instructions such as arithmetic operations, indexing, and slicing.D2O relies on MPI4Py to distribute the work (see Section 3).Therefore, to exploit parallelism with D2O the user must create an MPI job.The number of nodes can be specified on the command to run the Python program.For lower-level instructions the MPI library is accessible for code refactoring.

GPU acceleration.
Many Numpy functions boil down to the same operation applied to large vectors or matrices, and can thus exploit GPUs acceleration.CuPy [54] was designed to cover the API of Numpy as widely and transparently as possible.CuPy uses the Nvidia CUDA framework and other CUDA libraries for optimization such as cuBLAS, cuDNN, cuSPARSE.Given the differences of memory management between the main memory and GPU memory, for harnessing the library at its best, the user must manually indicate data copies, so that data is available in the GPU memory when CuPy functions are called.However, the process remains straightforward compared to CUDA programming.
For example, using CuPy in the context of a semi-supervised learning task allowed to divide baseline computation time by 6 with reasonable implementation efforts [17].For cases where the available functions are not enough, CuPy supports creating user defined CUDA kernels for two types of operations.One is for element-wise operations where the same operation is applied to all the data.The other operation is for reduction kernels, which folds all elements by a binary operator.† Tested with the latest version on PyPI on 03/2023 Manuscript submitted to ACM In the line of Numpy drop-in libraries for GPUs there is also PyPacho [5] and DelayRepay [51].PyPacho is based on PyCUDA and PyOpenCL.Although it is a promising tool, it is not as mature as CuPy and offers less compatibility.On the other hand, DelayRepay is a drop-in library and applies code optimization to accelerate its execution.DelayRepay has a delayed execution of Numpy operations because it analyzes them and tries to fuse them before execution.When a Numpy operation is found, it checks if its output is the input of another Numpy operation.If the rule is fulfilled, the operations are fused and the AST is modified.The Numpy operations are fused until a non-Numpy operation is found.
When a non-Numpy operation is found, the fused AST node is compiled into a GPU kernel and executed in the GPU.This is a main difference compared to CuPy which executes each operation individually.
Although not a drop-in library for Numpy, PyViennaCL [73] provides a set of equivalent operations to be executed in multi-core CPUs and GPUs.PyViennaCL is a wrapper for ViennaCL (written in C++) which is a linear algebra library and numerical computation to execute on heterogeneous devices.To use PyViennaCL, the user must import the library and use the constructs provided by the library.Similarly to DelayRepay, PyViennaCL uses delayed execution.
Arithmetic operations are represented by a binary tree and are computed only when the result of the computation is necessary.

Compilation-based.
The JAX [13] library provides composable transformations of Python programs based on Numpy.All JAX operations are implemented using the Accelerated Linear Algebra compiler (XLA) [74].JAX provides a set of equivalent functions to Numpy.Therefore, it can be used as a drop-in library for Numpy.Besides performance improvement based on vectorization and parallelization, JAX provides JIT compilation into GPU or TPU using the jit function.Another functionality is the evaluation of numerical expressions and generating derivatives (e.g., automatic differentiation by passing functions to the function grad), as commonly used by gradient methods for training neural networks.Another important functionality in JAX is vmap which is a mapping function to vectorize operations.The jit function can be applied to grad and vmap to obtain better performance results.An option specialized in speeding up numerical expressions written in Numpy is NumExpr [23].This library is compatible with a subset of Numpy operations.To use it, expressions are passed as a string to the library function evaluate.The expression is compiled into an object that contains the representation of the expression and the types of the arrays.To validate the expression, first it is compiled by the Python compile function, the expression is evaluated, and the parse tree is built.The parse tree is compiled into bytecode, and a virtual machine uses vector registers, each with the same fixed size.Arrays are handled as chunks, these chunks are distributed among the CPUs to parallelize Numpy operations.This approach has a better usage of cache memory and can reduce memory access, especially with large arrays.
In this inventory we may also mention work surveyed in the previous section 3.3 like AlPyNa, Pythran and Numba.
These tools have general applicability for Python performance improvement, but also provide performance improvements specific to Numpy.

Pandas
Pandas is a highly popular Python library for data analysis and manipulation.Its data frame format is widely used in DS, as it notably allows to handle heterogeneous data, time series and query-based manipulation, to name a few features.A data frame is a two-dimensional data structure that contains labelled axes: row and columns.It is the primary data structure used in data analysis tools.Nonetheless, Pandas operations usually only use one core at a Manuscript submitted to ACM time when doing computations.Thus, multi-core CPU and GPU oriented drop-in libraries have emerged to accelerate Pandas-like operations.
Vaex is a library that contains a set of packages meant to optimize memory usage when managing large datasets [14].
Vaex-core is a drop-in library for Pandas-like operations on data frames.Similarly to Bohrium [42] with Numpy, most operations on Vaex are lazily evaluated, they are computed only when needed.This reduces the amount of memory required compared to other similar libraries.Vaex also works with small chunks on data on the RAM, therefore, it can In local mode the number of partitions is by default equal to the number of available CPU cores.The Modin dataframe subsystem passes the data to the execution layer where different execution engines can be used such as Dask [71] or Ray [49] (see Section 5.2 for an introduction of the latter) which are in charge of the actual execution of computations on partitioned data in a task-based approach.The installation with the Ray backend is straightforward, with 40 times acceleration for a column concatenation benchmark, but mitigated results otherwise.Also, moderate code adaptation will generally be needed, as Modin return formats often slightly differ from their pandas counterparts.
cuDF [68] is a Pandas drop-in library that runs on the GPU.It is used for manipulating data with the GPU for data science pipelines.cuDF is a building block of RAPIDS, a platform to execute ML and DS tasks in GPUs (see Section 4.3).
Dataframes can be created, read from files, converted from Pandas dataframes and CuPy arrays.Some tools, though not drop-in libraries for Pandas as such, bear high similarity with Pandas, to such an extent that minor refactoring to the code can be used for the same purpose.Following this approach, we found Datatable [27] and Polars [62].Datatable is implemented in C++ and uses multithreading for certain operations to speed up processes.Polars lazily evaluates queries to generate a query plan and optimizes it so it can run faster and reduce the memory usage, possibly exploiting parallelism.Both libraries can also easily export to and from Numpy and Pandas formats.

Scikit-learn
Scipy [85] reuses the array format defined by Numpy, but aims at a more comprehensive coverage of general purpose mathematical and statistical concepts, such as linear algebra, statistical tests, signal and image processing.Scikit-learn builds upon Numpy and Scipy by implementing many models from the ML literature, such as regression, classification, and clustering models.Most models implement fit and predict functions, providing a unified API for the library.The estimator API provides a set of ML models with a similar syntax as Scikit-learn.An estimator is an abstraction of a ML model and typically implements two characteristic methods in Scikit-learn: fit and predict.To summarize, data is loaded into the Dataset format.An instance of an estimator object (representing the ML model) is created, and the fit function is invoked with its parameters.The estimator object is used to retrieve information of the trained model and generate predictions.Dislib authors report that they outperform MLlib [47] for a k-means clustering task as the sample size and the number of computation nodes used grows very large.

cuML.
RAPIDS [55] is a set of libraries for data manipulation and machine learning developed on top of the CUDA language, and thus aimed at the execution of DS pipelines in GPUs.In this set of libraries, cuML is strongly related to Scikit-learn.As CuPy aims at covering most of the Numpy API, cuML was created with the target to cover as much of the Scikit-learn API as transparently as possible.Similarly, as Scikit-learn is built on top of the Numpy and Pandas formats, cuML exploits the CuPy array and cuDF dataframe formats, respectively.Most of its API can also be executed in a distributed environment using Dask, a task-based framework relevant to the scenario presented in Section 5.The cuML developers report that GPU implementations can run up to 50 times faster than their scikit-learn CPU-based counterpart ‡ .

MLlib.
MLlib [47] is a ML library part of the Spark system.It is similar to Scikit-learn with a set of ML models and data processing instructions.Built on top of Spark it thus comes with the Spark installation and a Python API to use it.The implementation of algorithms is parallelized so that large data processing jobs exploit data distributed on Hadoop clusters.

STRUCTURING FRAMEWORKS
In this section, we consider high-performance libraries and frameworks which impose a specific way of thinking and programming to the practitioner and are thus preferably used right when implementation starts.

Deep Learning frameworks
Many models used in DS can be formalized as Directed Acyclic Graphs (DAG), e.g., Bayesian networks, probabilistic mixture models, and most notably, neural networks.A range of Python libraries, commonly referred to as deep learning frameworks, comes with specialized support and useful abstractions to practitioners needing to put this kind of models in action.Computations underlying DAGs are typically embarrassingly parallel: benefiting from high-performance computation devices such as multi-core CPUs or GPUs is therefore an implicit requirement of these libraries.Technically, they are symbolic mathematical libraries which allow to define arbitrary computational DAGs along which data is transformed.However, their deep learning label is often well deserved, as they provide many facilities specifically oriented towards neural networks, such as automatic gradients and back-propagation at DAG nodes, enabling fitting model parameters to input data.At runtime, the computational graphs and all functions which operate on them (e.g., custom loss functions and gradient optimizers) are compiled and loaded to the GPU.The training procedure then triggers kernel execution on the GPU.While these frameworks can also be run on multiple cores of a CPU when no compatible GPU is available, taking effective advantage of the GPU generally yields very significant speedup.For example, the computation time of a piece of Tensorflow code heavily relying on the CPU can be divided by 10 if correctly using the GPU API [17].
Tensorflow [1] is the most prominent in this range of tools.Besides offering a wide range of ready-to-use model architectures (sometimes even along pre-trained model weights), Tensorflow defines a comprehensive API to program custom components then compiled and loaded on the GPU, such as model structures, loss functions or optimizers.
As this code is meant to be loaded on the GPU, although it uses the Python syntax, it cannot be mixed with regular Python instructions, which causes additional implementation effort.In Tensorflow, the computational DAG is defined statically, so that its compilation and execution yields maximum performance at runtime.The explicit definition of the computational graph and its asynchronous execution on the GPU yields constructs which tend to diverge from Python standards.Mastering Tensorflow therefore takes some time and practice.
Torch is another deep learning framework, developed by Meta with the similar aim to support neural network model training.However, it is based on the Lua language, which is limiting its popularity.PyTorch [58] is the port of Torch to Python, motivated by the will to keep its API and basic principles.PyTorch came to the market after Tensorflow, but has gained momentum and is catching up in terms of popularity (8M monthly downloads vs 15M for Tensorflow according to PyPI statistics § ).Good documentation facilitates its adoption by newcomers, and it offers many readyto-use model architectures and pre-trained parameters.PyTorch has built-in high-level APIs, which are delegated to Keras in the case of Tensorflow.Pure Tensorflow requires significant non-standard boilerplate code development in comparison.
In Tensorflow, the computational graph is defined and compiled statically, and placeholder data is replaced at runtime.PyTorch offers more control at runtime, e.g., allowing to modify execution nodes at runtime in ways forbidden by Tensorflow, facilitating the implementation of sophisticated training loops.Language constructs are closer to Python standards, with object-oriented constructs meant to be familiar to experienced programmers.Overall, its APIs are less rigid, but this comes at the cost of more code to write, and generally slightly longer execution time for equivalent tasks.
This distinction between static and dynamic computational graphs has other consequences, first in the way Tensorflow and PyTorch handle variable-sized input data.Due to the static computation graph approach, doing so is difficult with Tensorflow.The Tensorflow Fold tier library offered limited support, but it is no longer maintained.In contrast, this is built-in in PyTorch.
Debugging PyTorch is also straightforward, while it is more difficult with Tensorflow due to the static graph definition.In the latter case, this requires mastering a specific debugging tool, tfdbg.To compensate, Tensorflow comes with Tensorboard, which packages visualization and monitoring tools.In PyTorch, to come up with equivalent features, custom graphs have to be built using e.g., matplotlib, or an interactive plotting library such as Dash.More facilities exist for distributed training in Tensorflow, as well as deployment to production servers, and embedding in limited resource devices such as mobile and Raspberry Pi using Tensorflow Lite.Finally, Tensorflow supports several languages beyond Python (including C++ and Java), while PyTorch focuses on Python.
Theano [4]  MXNet [18] claims high flexibility and scalability, notably supported and used internally by Amazon.Like Tensorflow, MXNet supports several languages beyond Python (C++, Python, R, Scala, Matlab), when PyTorch focuses on Python.It offers a flexible front-end, with an imperative API meant to be familiar to newcomers, and a symbolic API aimed at maximizing performance.However, it lacks high-level IO primitives compared to PyTorch and Keras, which is detrimental to quick adoption.

Distributed computation frameworks
An approach used by multiple Python intensive computation libraries is task-based parallelization, especially when large sets of data are involved.The task-based approach refers to a strategy where the work is divided into multiple tasks, these tasks are handled by a task manager which assigns them to threads that execute them.The execution of a program is a sequence of tasks and in some cases independent tasks can be executed in parallel.Usually, the taskbased approach is implemented with a queue of tasks, a thread-pool where threads wait for a task assignment, and some message protocol (i.e., MPI) to communicate data and instructions between tasks and the task manager.Though of general applicability, most libraries in this section impose in depth modifications to an existing codebase and require heavy software setup.This contrasts with task-based approaches reported in Section 3.2, the complexity of which being scaffolded using decorators.This makes them a more suitable choice if algorithm implementation has not started yet.
Directly relating to deep learning frameworks presented in the previous section, Horovod [75] aims at facilitating the usage of distributed resources (i.e., multiple computation nodes, potentially each holding multiple GPUs) by these frameworks.Indeed, deep learning frameworks are sometimes packaged with modules dedicated to distributed training, but, in the case of Tensorflow for example, they are rigid and difficult to set up.Horovod compensates this problem, while offering the support to multiple frameworks (including TensorFlow, Keras, PyTorch, and MXNet).Behind the scenes, Horovod relies on a message passing layer, which can be OpenMPI for example (presented in Section 3).The default is to use Gloo [37], a communication library developed by Meta.The authors claim up to 90% scaling efficiency, depending on the neural architecture at hand ∥ .Some task-based parallel Python libraries we found are wrappers of an already existing library in a different language.This is the case of torcpy [32] and Charm4py [30].Both libraries are wrappers of their C/C++ counterpart library; torcpy for TORC [33] and Charm4py for Charm++ [41].In both libraries the parallelism is expressed by using the library instructions and an API lets the programmer orchestrate asynchronous tasks and distributed objects.In torcpy, tasks are executed by launching multiple MPI processes using one or multiple worker threads.It builds upon MPI4Py (see Released 7/10/2019 ∥ https://horovod.readthedocs.io/en/stable/summary_include.htmlManuscript submitted to ACM Section 3.1), and implements and API meant to upscale the logic underlying multiprocessing or concurrent.futurespackages in order to benefit from high-performance clusters.Multiprocessing or concurrent.futurespackages and Python built-in packages aiming at overcoming limitations imposed by the GIL, by implementing a parallel map function, and allowing the asynchronous execution of functions in multiple parallel Python processes, respectively.Torcpy claims 90% computation efficiency on a Monte Carlo molecular simulation tasks with 1,024 compute nodes.While torcpy is based on Python dictionaries for its data management, in Charm4py multiple distributed objects are executed and coordinated in a unit called processing element.Distributed Python objects are implemented and allow remote method invocation using message passing.To overpass the GIL lock of only one thread, the implementation of Charm4py launches the Python executable in multiple nodes or even multiple times on the same node thanks to this distributed object mechanism.Unfortunately, it does not seem to run any more in recent Python environments * * .
There are also task-based parallel libraries written mostly or entirely in Python, such as Scalable Concurrent Operations in Python (SCOOP) [36], Parallel Python [57], Celery [21], and Playdoh [72].SCOOP is similar in spirit to MapReduce frameworks [28], as its API revolves around map and reduce functions.Asynchronicity and parallelization is enabled by a custom implementation of the built-in Python futures class.Specifically, their workers act as independent elements that interact with a broker to mediate their communications.The documentation gives extensive instructions to facilitate the deployment on high-performance clusters.Parallel Python also relies on its own library constructs to express parallelism by submitting job passing functions, and general execution information as parameters.The library is meant to overcome limitations imposed by the GIL when using Python's built-in threading library, and exploit computation clusters.The authors claim automatic discovery of computational resources, and their dynamic allocation.However, the library is available neither on PyPi, nor on Github, only on a website [57].Celery is a distributed task queue system, which can be used to complete heavy DS and ML computations, but is also meant to have wider applicability in view to support large business applications.Its main components are its broker and backend.The broker is responsible for managing communication between computation threads, and the backend provides the memory storage for queue management.The cost of this flexibility is that deploying Celery is much more complex that alternative approaches mentioned in this section, thus preferred for complex business logic, but probably not for prototypical DS and ML projects.It is actively maintained, and backed by a large community.The main feature of Playdoh [72] is a parallel and distributed map function, as most libraries in this section.It is mainly oriented towards numerical optimization and Monte-Carlo simulation, with some specialized functions in this area.Also, if the tasks are made of PyCUDA or CUDA code, Playdoh can distribute the work to several GPUs in parallel.It is worth noting that Parallel Python and Playdoh are not actively maintained, and not supported by Python 3+ interpreters.Formerly known as IPython.parallel,Ipyparallel [80] is a Python library for the development of task-based parallel applications.This package leverages the usage of IPython engines in parallel to run tasks.It has four main components: engine, hub, schedulers, and client.The engine is a subclass of the IPython kernel for Jupyter, and is responsible for running user tasks as commanded by a scheduler.The hub manages the cluster by keeping track of schedulers and clients.With this architecture, Ipyparallel allows to abstract potentially heterogeneous distributed computation facilities, accessed by multiple clients working collaboratively.However, its powerful abstractions requires significant boilerplate code, preventing straightforward adaptation of existing DS and ML projects.
Asynchronous function execution is central in Parsl [6].Its task-based distributed programming model is based on Parsl apps, which may be decorated Python functions (@python_app) or calls to shell commands (@bash_app).Like for * * Tested on 30/03/2023 with Python 3.9 Manuscript submitted to ACM torcpy [32] and SCOOP [36], task distribution is enabled by a custom futures implementation to manage asynchronous function execution in a distributed context.Its specificity is to facilitate function chaining, which enables parallelization of complex jobs.An Executor is deployed on each host of a distributed architecture, each managing several local workers.Available resources and task distribution are abstracted by a Data Flow Kernel on the client side.A set of launchers accommodate for various high-performance cluster types.Parsl is backed by an active community.
Ray [50] is a versatile task-based parallelization tool.On the one hand, it provides low-level facilities, based on primitives like actors and tasks that allow to define and manage distributed computations.Tasks are stateless and executed asynchronously, while actors represent stateful computations.On the other hand, it also provides high-level libraries for deep and reinforcement learning, data processing and analytics.Ray is widely adopted, serving as a parallel framework for other libraries like Modin, LightGBM, and Mars.While it provides direct support for frameworks such as Tensorflow and PyTorch, it can also act as a communication layer for Horovod.Ray is straightforward to install and get up and running in its simplest configuration (multi-core CPU), but requires investment for complex configurations with multiple computation nodes, each possibly holding multiple GPUs.It is therefore rather meant for practitioners with production needs.
Dace [10] is a Python library that translates Python code to C++ using the @dace decorator.It transforms the code into a Stateful DataFlow multiGraph (SDFG), supporting a subset of Python code, Numpy operators, and explicit data flows.The SDFG is a directed graph where nodes represent containers or computations, and edges indicate data movement.Dace has two types of containers: data (memory-mapped arrays) and stream (concurrent queues).Computation containers contain stateless functions.The Python to C++ compiler in Dace leverages the Python AST to infer types, shapes, and perform code analysis.SDFGs enable parallelism by grouping subgraphs, and optimizations are applied through graph transformations.Compilation involves inferring data dependencies, hierarchical code generation, and invoking the compiler for the desired output.While basic usage is simple on multi-core CPUs, complex scenarios require additional development.Dace heavily relies on the host software environment and may be harmed by version clashes.GPU and FPGA support is possible but requires advanced knowledge of the library.Dask [26,71] is a widely used task-based distributed computing library closely integrated with Numpy and Pandas.
While it shares a similar API with these libraries, and could arguably be considered as a drop-in library (see Section 4), it introduces task-based logic and incurs setup overheads.Dask provides APIs for arrays (similar to Numpy), data frames (similar to Pandas), and lists (similar to Python iterators).It utilizes task schedulers to split arrays or dataframes into smaller pieces, distribute work, and merge results.Computation is represented as a directed acyclic graph (DAG) where tasks are defined as function-argument tuples and can be executed concurrently.Dask supports various task schedulers for single or multiple nodes in a cluster, allowing customizable configuration.Although Dask does not directly support GPUs, it can schedule GPU-related work at the task level using Dask-cuDF, which extends the cuDF dataframe library (see Section 4) within Dask.
The Tuplex library [77] exhibits a similar API to Dask and utilizes optimized LLVM bytecode generation to achieve maximum acceleration.Tuplex performs a dynamic analysis, by considering both the code and the data for code generation.Impressive results have been reported, demonstrating up to 91x acceleration in intricate data pre-processing tasks involving User-Defined Functions encompassing operations like regular expressions and query joins.While the present paper primarily focuses on DS and ML jobs with intense numerical computations, it is noteworthy that complex pre-processing and business tasks can assume critical importance in the management of large-scale data within production environments.Therefore, Tuplex remains a relevant tool, although its improvements may not be considered spectacular in the context of typical research ML projects.3.In this context, it is assumed that the existing codebase relies on one of the most commonly used computation libraries: Numpy, Pandas or Scikit-learn.Tools and approaches in this section aim at enhancing or replacing these libraries.
In Table 3, we can see that most of the found approaches are drop-in libraries that replace as much as possible the syntax of the original library, keeping the same semantic but providing enhancement.Their usage is sometimes as simple as function call substitution.A few tools provide the exploitation of GPU devices for performance acceleration.
For maximal benefits, they require additional operations relating to memory movement between central and GPU   4. This section surveyed tools which deeply affect an existing codebase, and thus should preferably be used right when the implementation of a DS or ML algorithm start.As a counterpart, they generally provide many primitives which facilitate the work of the practitioner if they stick to the framework driving principles.We framed deep learning frameworks in this category, as they come with their very own logic to which the data scientist must adapt.In exchange from this effort, they come with high-level abstractions, and scaffold the access to GPU hardware so that maximal performance is obtained with minimal specific development effort.
In this section, we also gathered distributed computing frameworks.They generally have wider applicability compared to deep learning frameworks, and sometimes act as back-end for tools summarized in Section 6.1.1.However, when used in first intention, they come with specific code constructs which heavily constrain software development, as well as complex setup procedures to deal with variable cluster configurations.As a consequence, it is generally better to involve these tools when implementation starts.Using these frameworks then pays off in terms of the size of the data sets they can handle, which can be orders of magnitude larger than with other tools surveyed elsewhere in this article.Performance enhancement of programs is a wide subject including parallelization, and port between architectures and languages.Many tools and approaches exist outside the Python world, and beyond ML and DS.However, to deliver a consistent and organized view on the subject we restrain our subject to cover the three main scenarios that could occur from a data scientist's point of view.Indeed, this is a partial and oriented view on subject leaving space for further explorations.
It is true that software development is a fast-paced domain, and that best options today may be superseded by their competitors or new players only a couple of years after this paper has been issued.However, we believe that the scenarios that back the structure of our study are general enough to remain true even as new tools are developed and existing tools evolve.Also, our study served to highlight few basic facets of tools, to which new or evolving tools can be fairly straightforwardly attached (e.g.related to task-based frameworks in Section 5.2, or drop-in numerical libraries in Section 4).Therefore, even as time goes, we believe the insights we delineated will provide useful guidance to practitioners for years to come.
As previously stated, when we delimited our search scope, we deliberately excluded Python interpreters from our study as they are likely to interact with libraries mostly used in DS and ML domains.Yet there are many contributions in this area, which deeply affect vanilla Python efficiency: we briefly review them below.

Python interpreters
CPython serves as the primary testbed for new features and language enhancements in Python.This is because it is maintained by the same community that designs the language.The performance challenges encountered in Python are directly associated with the CPython implementation.The article by Zhang et al. [86] shows potential performance improvements achievable through different optimization approaches in standard CPython interpreters.These optimizations include techniques like dispatch, branch prediction, and array-style access.
There are different implementations of the Python interpreter, offering developers the flexibility to replace the default CPython interpreter with alternative options.The advantage of these alternative interpreters is that they can enhance the speed and performance of Python programs without any code modification a priori.However, it is important to note that alternative interpreters may have certain limitations and drawbacks that need to be considered.
Alternative Python interpreters include PyPy [66], Pyston [48], Cinder [29], and IronPython [53].However, not all of them offer complete coverage of the Python language.Additionally, some implementations are tied to specific Python versions.Cinder and Pyston, for example, support Python 3.8, IronPython supports Python 3.4, and PyPy supports Python 3.9 (which is relatively closer to the latest CPython versions).Historically, alternative implementations of Python have often lagged behind in terms of supporting the latest language versions and features.
Beyond being bound to Python 3.8, Pyston has no known compatibilities issues † † .Pyston provides a performance gain estimated between 10 to 35% on reported benchmarks, despite some overheads observed on specific tasks such as † † https://blog.pyston.org/2022/ was accessed on 24/05/2022 Manuscript submitted to ACM JSON loading ‡ ‡ .For some time, Pyston has been backed by Anaconda, a popular Python distribution tool.According to recent notes from one of the creators of Pyston, it is likely that their speedups will become fully integrated in a future CPython release § § .
Cinder is a JIT interpreter implemented in C++ that includes several performance optimizations such as bytecode inline caching, eager evaluation of coroutines, and a method-at-a-time JIT.According to a report from their maintainers, it can provide a speed up of 1.5 to 4 on many Python performance benchmarks .PyPy [66] is another alternative Python interpreter featuring a JIT compiler.According to benchmark results, with Pypy the acceleration ratio can range from 0.21 to 4.8 * * * with respect to CPython, with documented mixed results † † † .As already mentioned in Section 3.3, the widely used CPython interpreter does not include a built-in JIT compiler.Nonetheless, third-party libraries or tools, such as Pyston-lite (see Section 3.3.2) can be utilized to introduce JIT compilation into the Python workflow for performance optimization.
IronPython is an implementation of the Python programming language that runs on the .NET framework, providing seamless integration with .NET technologies and allowing the use of Python libraries.Its performance is comparable to CPython with variations depending on the specific task ‡ ‡ ‡ .It is important to note that IronPython is primarily designed for Windows environments and does not support the importation of C extension modules, which limits its compatibility with certain libraries like NumPy and SciPy.As a result, some third-party libraries commonly used in DS and ML may not be available in IronPython.
A common problem with Python interpreters is that standard libraries -that may depend on other libraries -are not necessarily compliant outside the CPython implementation, and even so, often require building shared libraries from source.This may make it hard to validate the approach for each library and framework, and can be cumbersome for the average practitioner.The Python ecosystem is primarily built around CPython, which means that some communitysupported projects, tools, and resources may not work seamlessly, with the added difficulty to keep up with the latest versions and language features.

CONCLUSION
Our article highlighted different approaches to enhance Python performances regarding three scenarios meant to cover most needs happening in the practice of DS and ML.Each scenario covers a peculiar stereotype of developer dealing with ML and DS tasks.They depict practitioner profiles that range from a very straightforward way of using Python (i.e., vanilla Python), by usage of standard numerical libraries, up to the use of large integrated frameworks.
By answering our research question, which approaches can be used to improve Python execution performance in the context of one of these three scenarios?, we have looked at the most relevant state of the art approaches, following a narrative review principle.Each scenario calls for specific solutions which may be addressed by different kinds of techniques.For each scenario, we highlighted how given tools may help them deal with their task.We also highlight the estimated complexity to set up those approaches, notably by the impact on the original code and in terms of learning curve.‡ ‡ https://www.phoronix.com/news/Pyston-2.1-vs-Python-3.8-3.9 accessed on 24/05/2022 § § https://blog.pyston.org/author/kmodzelewski,consulted on 03/2023 https://github.com/facebookincubator/cinder* * * https://speed.pypy.org/† † † https://pybenchmarks.org/u64q/benchmark.php?test=all&lang=python3&lang2=pypy3&data=u64q ‡ ‡ ‡ https://www.python.org/about/success/resolver/Manuscript submitted to ACM We have shown that for pure Python code acceleration, the practitioners have a large choice depending on their level of confidence and control they want to have on the performance improvement.For simple and fast results, but not optimal, they may look at a diverse range of straightforward techniques, some even fully automatic, involving compiler directives, code decorators, or transpilers.Better performance can be obtained with semi-automatic approaches, but they require more involvement from the developer, and a steeper learning curve for maximal gains.
In the case the codebase heavily relies on well-known numerical libraries, the most natural path is to investigate using drop-in libraries.Most of them mimic the API of the library they substitute to, so the learning curve is mild.
However, for maximal gains, the practitioner must address subtleties such as memory movements between central and GPU memories.
In the third scenario, the practitioner is starting the development from scratch.Therefore, approaches surveyed in this section are meant to be used right from the start of project development and put heavy constraints of code structure.This initial effort is traded with maximal gains in terms of performance, and minimal surplus of effort if the driving principles of the frameworks are enforced.
We expect this work to give a good comprehensive view and guide practitioners for choosing among the plethora of existing tools.While playing an important role in the structure of our document, we think that this discussion section can also act as a good starting point in this perspective, with all references and then available for deeper inspection.Also, all surveyed tools are reported in summary tables, one per user profile, emphasizing distinctive characteristics, and providing usage metrics as to the date this document was written.Though we tried to be as comprehensive as possible, some features of the surveyed tools may not have been covered.Also, we did not run and quantitatively compare the performance of all the surveyed tools, due to their number and diversity.It would be almost impossible to find a suitable common benchmark for any Python acceleration method and task dedicated tool.We also expect that our work could help new tool designers who aim at enhancing Python performance to get an overview of the current state of the art.
work with datasets larger than the typical RAM of a computer.It works best with files in HDF5, Apache Arrow, and Apache Parquet formats.A multi-core CPU drop-in implementation of Pandas is Modin[61].Modin can perform in a single node locally (multi-core CPUs) or in a cluster environment.Modin is based on a custom version of the Pandas data frame, and treats operations on the data frames as user queries which are compiled.Having a similar design as relational databases which work with relational algebra, Modin is designed to work with a custom data frame algebra, aiming at simplifying and optimizing operations on a data frame.The Pandas-like API instructions are translated into data frame algebra with optimizations if possible.Then, the optimized query is passed to a subsystem called Modin Dataframe which works as a middle layer between the query compiler and the actual execution back-end.A data frame can be partitioned by columns, rows, or by blocks depending on the operation required and the size of the data, enabling work distribution.

4. 3 .
1 dislib.Dislib[87] is a ML library for Python to be executed in high-performance computing clusters.Dislib is built on top of PyCOMPSs[82] (a task-based parallelization library presented in Section 5.2 in the context of the first scenario) and exposes two main components to the developers: 1) an interface for distributed data handling and 2) an estimator-based API.The data handling interface provides an abstraction to handle data as a dataset which can be Manuscript submitted to ACM divided in multiple subsets to be distributed and handled in parallel.Datasets can be given as Numpy arrays for dense data and Scipy Compressed Sparse Row matrices for sparse arrays.Its wrapping Dataset format is the input for the ML models.
[80]ne of the most used Python libraries, as it provides a multi-dimensional array format central to many other libraries.It also includes a set of routines for manipulating arrays with different operations, e.g., mathematical primitives, shape manipulation, and sorting.Numpy exploits BLAS and LAPACK and is therefore much faster than vanilla Python code.However, it under-utilizes parallel computer architectures.Several examples of Numpy drop-in libraries attempt to circumvent this issue.Legacy drop-in.Distarray[38]is a drop-in library for Numpy, which distributes the execution of Numpy operations across multi-core CPUs, clusters, or supercomputers.It depends in IPython.parallel[80](surveyed in Section 5.2) offers very similar features to Tensorflow and PyTorch, primarily aimed at defining and training neural network structures.It has been around since 2007, but its development has been stopped -the latest release dates back CommentsUsed by other libraries and software such as H5py, vtk, dask, and PcoketFlow.DS/ML practitioners may prefer using a library that hides low-level MPI instructions to obtain parallelism.Comments Used by other projects such as ANUGA, TCRM, and Wind multipliers.Helps to avoid the construction of memory buffers by using Pickle.Not compatible with Python 3+.Comments Only works with Linux x86-64 with one specific Python and NumPy version.It is distributed as a forked Numba package with a version out of the main branch.Used mostly in academic environments.The user needs to know task-based programming and use the definitions of the library.Comments To easily create a C extension for Python, including parallel processing.Used by many popular libraries such scipy, pandas, and numpy.Requires manual refactoring of code for performance improvement.Comments According to the statistics it is a highly popular package.It uses decorators to give hints to the compiler and has a dependency on LLVM.Comments Used mainly in academic environment as support for other packages such as PyCosmo.Only supports a subset of Python.Comments Used by at least 13 packages on Github to accelerate computations.Generates C++ code and optimizes it.The code needs to be compiled to be used in another project as a module.Usage is straightforward, but not frequently updated on PyPI.Needs to be built for a recent version, soon to be incorporated to CPython.Activity period format: month.year.★ First and last release on PyPI • * First and last commit on Github • ± First and last update on sourceforge Popularity: Github stars / PyPI last month downloads.Statistics as as of March 2023.

Table 2 .
Summary information of the tools from first scenario Second scenario: Accelerating numerical libraries usage.The tools relevant to this user scenario are summarized in Table memory.In the context of CuPy, it materializes as copying Numpy arrays in CuPy ones.Like Scikit-learn relies on Numpy and Pandas, cuML relies on CuPy and cuDF to offer a broad coverage of the former.Many drop-in alternatives exist for Numpy, which is explained by the very high popularity of Numpy as a building block for DS and ML code development, and as a dependency in other Python libraries.Comments Bohrium is used as support for other packages such as Veros and Weld.Drop-in replacement for Numpy but with limited coverage, claimed fall back to Numpy, observed crashes experimentally.Comments Good Numpy coverage but not complete.Used by other libraries and part of the RAPIDS ecosystem.The user must add instructions for data copies between the GPU memory and main memory.CuPy has many packages for different CUDA and AMD ROCm versions.Comments Uses Accelerated Linear Algebra compiler (XLA) [74].Used by many packages such as ColabDesign, trax, ml-workspace, and TensorNetwork.It can be used as drop-in library for Numpy.Nonetheless, JAX arrays are always immutable.Additionally, JAX is aimed to work best with functional programming.Comments Used by other popular packages such as Pandas, zipline, and osmnx.The API is not as extensive as Numpy.Requires to work with large arrays to overcome the overhead of compilation and usage.Comments Used by at least 33 Github packages such as neural-lifetimes, geospatial-ml, radis, and optimus.This tool prioritizes memory optimization for handling large datasets, which may impact performance.Comments Can use execution engines like Dask [71] or Ray [49].Used by at least 47 Github packages such as aws-sdk-pandas, ludwig, and pandera.Modin lacks full implementation of certain Pandas functions, invoking unsupported functions incurs overhead as execution falls back to Pandas and results are transferred back to Modin, requiring data type transformation.Comments It is part of rapids.ai a collection of software.cuDF is restricted to use only with NVIDIA GPUs.It can lead to certain memory constraints.Specifically, due to the relatively smaller size of GPU memory, users may encounter overheads related to memory swaps or copies, which can impact performance.Comments Compatible with CuPy for Numpy support.Used by at least 6 packages and 178 repositories on Github.Does not convert code to run into GPUs, helps with deployment and management of Dask workers.Comments Must be used within Spark.Part of Spark, installed as pyspark.MLlib is part of spark, and in PyPI it is named pyspark.Activity period format: month.year.★ First and last release on PyPI • * First and last commit on Github • ± First and last update on sourceforge Popularity: Github stars / PyPI last month downloads.Statistics as as of March 2023.

Table 3 .
Summary information of the tools from second scenario Third scenario: structuring frameworks.The tools relevant to this user scenario are summarized in Table CommentsUsed by 94 packages and 6,381 repositories on Github.Similar to PyTorch in terms of flexibility, lacks convenient IO primitives, and the release frequency has slowed down recently.Comments Used by 6 packages and 326 repositories on Github.Similar to a MapReduce framework, with extensive information available for deployment on high-performance clusters.Comments Used by 1,515 packages and 102,009 repositories on Github.Popular distributed task queue system with many features, but can be complex to deploy.Comments Used by 3 packages and 6 repositories on Github.Supports both single-machine and distributed computing.Powerful capabilities, but requires significant work overhead for complex use cases and in a GPU context.Activity period format: month.year.★ First and last release on PyPI • * First and last commit on Github • ± First and last update on sourceforge Popularity: Github stars / PyPI last month downloads.Statistics as as of March 2023.

Table 4 .
Summary information of the tools from third scenario Scope and Limitations of the StudyTo mitigate the risk of being biased by our own research we tried to be as open as possible following a simple narrative process.In addition, the narrative review allows us to provide DS and ML practitioners with an overall view on the different existing techniques.It is also sufficiently open to interest practitioners from related areas which make occasional usage of ML techniques, such as scientific computing.