1. Ruff – A quick linter

Has anyone not used linter in 2022?

Over the years, the community has recognized that linters are an important part of the software development process. They analyze source code for potential bugs and styling issues, providing valuable feedback and suggestions for improvement. Ultimately, they help developers write cleaner and more efficient code. To get the most out of this process, it's important to have a fast and effective linter.

Ruff is a very fast Python inter written in Rust. It is 10-100 times faster than existing linters and can be installed via pip.

Top 10 Python libraries not to be missed in 2022

Check the CPython codebase from scratch.

In addition to linting, Ruff can also be used as an advanced code conversion tool, with the ability to upgrade type annotations, override class definitions, sort imports, and more.

It is a powerful tool designed to replace a variety of other tools, including Flake8, isort, pydocstyle, yesqa, eradicate, and even a subset of pyupgrade and autoflake, while executing at lightning speed.

Definitely a highlight of adding to your arsenal in 2023!

2. python-benedict

Dictionaries are important data structures in Python, but working with complex dictionaries can be a challenge. The built-in dict type is powerful, but it lacks many of the features that enable accessing and manipulating nested values or converting dictionaries to and from different data formats. If you find yourself having trouble using dictionaries in Python, then python-benedict might be the solution you've been looking for.

Benedict is a subclass of the built-in dict type, which means it is fully compatible with existing dictionaries and can be used as a drop-in replacement in most cases.

One of the key features of Benedict is its support for keylists and keypaths. This makes it easy to access and manipulate values in complex dictionaries without having to manually dig nesting levels. For example:

d = benedict()
 
# set values by keypath
d['profile.firstname'] = 'Fabio'
d['profile.lastname'] = 'Caccamo'
print(d) # -> { 'profile':{ 'firstname':'Fabio', 'lastname':'Caccamo' } }
print(d['profile']) # -> { 'firstname':'Fabio', 'lastname':'Caccamo' }
 
# check if keypath exists in dict
print('profile.lastname' in d) # -> True
 
# delete value by keypath
del d['profile.lastname']

In addition to its key list and key path support, benedict provides a wide range of I/O shortcuts to handle a variety of data formats. You can easily read and write dictionaries from formats like JSON, YAML, and INI, as well as more specialized formats like CSV, TOML, and XML. It also supports multiple I/O operation backends, such as filepath (read/write), url (read-only), and cloud storage (read/write).

3.Memray — Memory analyzer

Optimizing your system's memory usage is critical to improving its performance and stability. Memory leaks can cause programs to consume more and more memory, reduce overall system performance, and eventually crash. While garbage collectors are generally easy to use, Python doesn't protect you from these issues. Nothing prevents circular references or inaccessible objects (for more information, read How garbage collection works); Even more so if we talk about C/C++ extensions.

Memory Analyzer can help you identify and fix these issues, making it an important tool for optimizing your program's memory usage. This is where Memray comes in handy!

It is a memory analyzer that tracks memory allocations for Python code, native extensions, and the Python interpreter itself, providing a comprehensive view of memory usage. Memray generates a variety of reports, including flame maps, to help you analyze the collected data and identify issues such as leaks and hot spots. It's fast and works with Python and native threads, making it a versatile tool for debugging memory issues in multithreaded programs.

Memray can be used as a command-line tool or as a library for more granular analysis tasks. Its real-time mode allows you to interactively check memory usage while a script or module is running, which is useful for debugging complex memory patterns.

Bloomberg's must-have tool for serious Python programmers.

4.Codon – Python compiler using LLVM

We all love Python because it's simple and easy to use, but sometimes we need a little extra speed. Even with all the optimizations, Python interpreters such as CPython can only go so far. The compiler comes in when further performance improvements are needed. They convert our Python code into machine code that the processor can execute directly, skip the interpreter step, and give us some significant performance gains.

Codon is a high-performance advance Python compiler that can even compete with C/C++ in terms of speed, with typical speedups reported to be 10-100x or more (on single-threaded). It can be used in larger Python codebases using @codon.jit decorators, or it can use Python interoperability to call pure Python functions and libraries from within Codon.

There is no such thing as a free lunch, so you may need to make some changes to your Python code so that it can be compiled by Codon. But the restrictions Codon imposes on the language ultimately lead to performance gains. However, the compiler will guide you in identifying and helping you resolve incompatibilities by generating detailed error messages.

If you're looking to speed up your code, it's definitely worth checking out.

5. LangChain – Build LLM-backed applications

Unless you've been living under a rock, it's clear to you that generative AI has been taking the globe by storm this year. A large part of this is Large Language Models (LLMs).

A lot of the code we've written over the past few years to solve AI problems can be thrown away and replaced with LLM (e.g. GPT-3 or its evolution – InstructGPT or ChatGPT – T5, or whatever). In addition, we have witnessed the birth of a new programming language for LLM interfaces: text-based prompts.

LangChain emerged to help harness the full power of LLM.

First: In any serious use of LLM, one usually doesn't need to think of prompts as one-off things, but rather a combination of these things: templates, user input, and input/output examples that LLM can use as references. LangChain helps you simplify this "prompt management" by providing interfaces to build different prompts directly from each component.

Second, in order to build prompts, sometimes you need to inject external knowledge (or even other models). For example, suppose you need to perform a database query to extract customer names for personalized emails. This is the concept of chains, and LangChain provides a unified interface for this.

Then there's the concept of getting and enriching data so that LLM can process your own data, as opposed to the "generic" data on which the model is trained.

You can do more with LangChain, such as being ready to switch to another model/provider without changing the code, building proxies with memory, etc.

Of course, the innovative tools we expect will grow significantly in 2023!

6.Fugue – Distributed computing made easy

If you're familiar with Pandas or SQL, you know that these tools are great for working with small to medium-sized data sets. But when dealing with larger amounts of data, distributed computing frameworks like Spark are often needed to process it efficiently. The problem is, Spark is completely different from Pandas or SQL. The syntax and concepts are completely different, and migrating code from one to another can be a challenge. That's where fugues come in.

Fugue is a library that makes it easier to use distributed computing frameworks like Spark, Dask, and Ray. It provides a unified interface for distributed computing, allowing you to execute Python, pandas, and SQL code on Spark, Dask, and Ray with minimal rewrites.

The best place to start is to use Fugue's transform() function. It allows you to execute a single function in parallel by bringing it into Spark, Dask, or Ray. See the following example where the map_letter_to_food() function is introduced into the Spark execution engine:

import pandas as pd
from typing import Dict
from pyspark.sql import SparkSession
from fugue import transform
 
input_df = pd.DataFrame({"id":[0,1,2], "value": (["A", "B", "C"])})
map_dict = {"A": "Apple", "B": "Banana", "C": "Carrot"}
 
def map_letter_to_food(df: pd.DataFrame, mapping: Dict[str, str]) -> pd.DataFrame:
    df["value"] = df["value"].map(mapping)
    return df
 
spark = SparkSession.builder.getOrCreate()
 
# use Fugue transform to switch execution to spark
df = transform(input_df,
               map_letter_to_food,
               schema="*",
               params=dict(mapping=map_dict),
               engine=spark
               )
 
print(type(df))
df.show()

<class 'pyspark.sql.dataframe.DataFrame'>
 
+---+------+
| id| value|
+---+------+
|  0| Apple|
|  1|Banana|
|  2|Carrot|
+---+------+

Fugue allows you to maintain a codebase for Spark, Dask, and Ray projects where logic and execution are completely separated, so that programmers don't have to learn each different framework.

The library has some other interesting features and a growing community, so be sure to check them out!

7.Diffusers – Generative artificial intelligence

2022 will forever be remembered as the year when generative AI breaks through the frontiers of the AI community and expands into the outside world. This is mainly supported by Diffusion models, which has gained a lot of attention due to its impressive ability to produce high-quality images. DALL· E2, Imagen, and Stable Diffusion are just a few examples of the diffusion models that have caused a stir this year. Their results inspire discussion and admiration because their ability to generate images pushes the boundaries of what was previously thought possible – even by AI experts.

Hugging Face's Diffusers library is a collection of tools and techniques for working with diffusion models, including stable diffusion models, which have proven particularly effective at generating highly realistic and detailed images. The library also includes tools for optimizing image generation model performance and analyzing image generation experiment results.

Getting started with text-to-image generation using this library might be as simple as removing the following lines of code:

import torch
from diffusers import StableDiffusionPipeline
 
pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16)
pipe = pipe.to("cuda")
 
prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt).images[0]

But diffusers don't stop at images. Features for audio generation and even molecular generation (!) are coming soon.

It looks like most open source implementations of the diffusion model will take advantage of the building blocks provided by the library.

8.LineaPy — Convert notebooks to production lines

Jupyter notebooks are a very useful prototyping and code development tool. They make it easy to mix code, drawings, media, and interactive widgets into a single document, making it easy to document and understand your code as you develop it. They are the perfect playground for experimenting with code and testing ideas, especially handy for data analysis and machine learning tasks. But as the idea progressed, people were happy with the results and wanted to put the code into production, the challenges began to show.

The code in a notebook may not always be built in a way that is easy to deploy to production. Should you use the notebook's code and rewrite, modularize, and follow best practices elsewhere? This is the most common method, but the downside is that if you want to improve or debug something later, you lose the interactivity of the notebook. So now you have to maintain both notebooks and separate production code, which is far from ideal.

Thanks to LineaPy, there's a path to doing better.

LineaPy is a Python library that helps you quickly transition from prototyping to creating powerful data pipelines. It can handle your messy notebooks and help clean up and refactor code to make it easier to run in orchestration systems or job schedulers such as cron, Apache Airflow, or Prefect.

It also helps improve repeatability: it has the concept of "artifacts," which encapsulate data and code that can help you trace back how value was created. At a high level, LineaPy tracks the order in which code is executed to form a comprehensive understanding of the code and its context.

The best part? The integration is so simple that it only takes two lines of code to run in the notebook environment!

Be sure to check out the documentation, which does a good job of showing the problem the tool solves with very comprehensive examples.

9.Whylogs – Model monitoring

As AI models create real value for businesses, their behavior in production must be continuously monitored to ensure that value persists over time. That is, there must be a way to show that the model predictions are reliable and accurate, and that the inputs provided to the model are not significantly different from the type of data used to train it.

But model monitoring isn't limited to AI models — it can be applied to any type of model, including statistical and mathematical models. Hence the usefulness of this draft pick.

WhyLogs is an open-source library that lets you record and analyze any type of data. It offers a range of features, starting with the ability to generate a summary of a dataset: whylogs profiles.

The profile captures statistics for raw data, such as distribution, missing values, and many other configurable metrics. They are computed locally using libraries and can even be incorporated to allow analysis of distributed and streaming systems. They form a representation of the data that conveniently does not require the disclosure of the data itself, but only the metrics derived from it – which is good for privacy.

Configuration files are easy to generate. For example, to generate a configuration file from Pandas, DataFrame you can do the following:

import whylogs as why
import pandas as pd
 
#dataframe
df = pd.read_csv("path/to/file.csv")
results = why.log(df)

But profiles are only useful if you decide what to do with them. For this, visualization is a must. You can install the viz module that can create interactive reports for you. pip install "whylogs[viz]" This is an example drift report from Jupyter Notebook:

Example drift report using the WhyLogs viz module.

You can also set constraints to be notified when your data doesn't meet expectations, effectively doing some kind of testing on your data/model. After all, isn't that what we were going to do from the start?

10.Mito – A spreadsheet in a notebook

Top 10 Python libraries not to be missed in 2022

In the age of data science, many people are moving from manually analyzing data in spreadsheets to writing code to do this. But there's no denying that spreadsheets are an attractive tool that offers a streamlined editing interface and instant feedback that allows for rapid iteration.

Many times, we use spreadsheets to process data. But if we want to do the same thing again with new data, we have to start from scratch! The code does this better, saving us valuable time.

Can we have the best of both worlds? Meet Meitu.

Mito comes in handy because it's a library that allows you to work with data in Jupyter Notebooks in a spreadsheet-like interface. It allows you to import and edit CSV and XLSX files, generate PivotTables and graphs, filter and sort data, combine datasets, and perform a variety of other data manipulation tasks.

The most important feature: Mito will generate Python code for each of your edits!

How Meitu works.

Mito also supports Excel-style formulas and provides summary statistics for columns of data. It aims to be the first tool in the data science toolkit, which means it is built as a user-friendly data exploration and analysis tool.

Tools like Mito lower the barrier to entry into the world of data science, allowing people familiar with software like Excel or Google Sheets (almost everyone?) to enter the world of data science. ) to start contributing code quickly.

If you find any of my articles helpful or useful to you, please like, retweet, or like them. Thank you!