Hidden treasures: Transform your machine learning workflow with these Python libraries

author：Medium meter AI 2023-10-19 23:50:00

The field of data science, in particular, has a large number of tools and libraries that allow us to extract knowledge from data. However, beneath the surface of commonly used Python libraries such as NumPy, Pandas, Matplotlib, Seaborn, scikit-learn, etc., there is a treasure trove of lesser-known but powerful libraries that can provide a significant boost to your machine learning workflow.

What are the common machine learning libraries?

Machine learning projects are mainly supported by libraries such as NumPy, Pandas, Matplotlib, Seaborn, and Scikit-Learn. NumPy has powerful array manipulation capabilities and is essential for numerical calculations. Pandas simplifies data processing and analysis through its DataFrame structure. To create engaging visual representations of data, Matplotlib and Seaborn are indispensable resources. Scikit-Learn provides a comprehensive toolkit for a variety of machine learning\ tasks such as clustering, regression, and classification. When these resources are leveraged together, data scientists are able to effectively review, evaluate, and model data.

Low-key machine learning library

➤ Panda analysis

The Python Panda Analysis module is a powerful tool for automating exploratory data analysis (EDA). It generates an extensive report with insightful information about the dataset, including which variables to keep and which to remove.

from pandas_profiling import ProfileReport

profile = ProfileReport(df, title='Dataset Report', explorative=True)
profile.to_notebook_iframe()

When used, the library provides several important insights into the data being processed. Some of these include:

Overview:*This section provides an overview of the dataset, including the number of variables and observations, and the different types of variables. *

Correlation—\ This section uses heat maps to illustrate the relationships between variables in a dataset. It allows switching between various correlation graphs, including Pearson's r, Kendall's τ, Spearman's ρ, and Phik (φk).

Warning\ — This section contains considerations for variables with a large number of zero, NaN values, and high cardinality categorical variables.

➤ Missing

This amazing tool for visualizing missing data in datasets is the Python missingno package. It provides a number of features to help analysts and data scientists understand the distribution and presence of missing values in their data. It offers some amazing visualizations such as:

Matrix Diagram —\ This depicts the entire presence of NULL/NAN/missing values in the dataset.

pip install missingno
import missingno as msno

msno.matrix(df)

Bar Chart —\ This creates a bar chart of the count of values in each feature column, excluding missing values in each feature column.

msno.bar(df, color="dodgerblue", sort="ascending", figsize=(8,6), fontsize=8)

Heatmap —\ This can be used to create correlation heatmaps to identify and analyze correlations between columns.

msno.heatmap(df)

Dendo Graph —\ This provides a hierarchical representation of correlations between different features and the presence of missing information.

msno.dendrogram(df)

➤ Picarette

The open-source, low-code PyCaret machine learning module for Python attempts to automate the key processes involved in machine learning algorithms that perform \evaluation and contrast\ regression and classification. It's designed to shorten the time it takes to go from hypothesis to insight, so both experienced data scientists and novices can benefit from it.

Many machine learning project phases, including feature engineering, model training, evaluation, and data preprocessing\, are automated by PyCaret. In addition, it provides the ability to compare multiple machine learning models and fine-tune hyperparameters.

You can install PyCaret using pip:

pip install pycaret

To classify using PyCaret, you can do this in the following ways:

from pycaret.classification import *
clf = setup(data, target='target_column')
best = compare_models()

To use PyCaret for regression, you can do this in the following ways:

from pycaret.regression import *
clf = setup(data, target='target_column')
best = compare_models()

➤ Pendulum

A very useful tool for managing dates and times in Python is the pendulum pack. It provides a more robust and developer-friendly API when working with dates and times than Python's built-in datetime module. You can use Pendulum to quickly complete activities such as formatting, parsing, and arithmetic operations on dates and times. It is an effective tool for managing time-related data in your application because it also provides features such as time zone\ and duration calculation\.

Some of the key features of the library include:

DateTime instance —\ You can build a DateTime object using the now() function to get the current date and time, or the datetime()\ function to get a specific date and time. Using the local()\ function, you can also build a DateTime instance with a specific time zone.

import pendulum
dt = pendulum.datetime(2020, 11, 27)
print(dt)
local = pendulum.local(2020, 11,27)
print(local)
print(local.timezone.name)

Time zone conversion —\ The library's in_timezone() and convert()\ functions make it easy to switch between multiple time zones.

utc_time = pendulum.now('UTC')
kolkata_time = utc_time.in_timezone('Asia/Kolkata')
print('Current Date Time in Kolkata =', kolkata_time)
sydney_tz = pendulum.timezone('Australia/Sydney')
sydney_time = sydney_tz.convert(utc_time)
print('Current Date Time in Sydney =', sydney_time)

Datetime operations —\ Date and time can be adjusted using the add() and subtract()\ functions. Each method generates a new DateTime instance.

dt = pendulum.parse('1997-11-21T22:00:00', tz = 'Asia/Calcutta')
print(dt)
dt = pendulum.from_format('2020/11/21', 'YYYY/MM/DD')
print(dt)

Duration and Period Calculation—The duration()\ function in the library allows you to generate durations that can be added or subtracted from a DateTime instance. The period()\ function can also be used to determine the time interval between two DateTime instances.

time_delta = pendulum.duration(days = 2, hours = 10, years = 2)
print(time_delta)
print('future date =', pendulum.now() + time_delta)
start = pendulum.datetime(2021, 1, 1)
end = pendulum.datetime(2021, 1, 31)
period = pendulum.period(start, end)
print(period.days)

conclusion

In conclusion, hidden Python libraries for machine learning provide valuable tools to boost your projects. From efficient time management to enhanced data visualization, these resources extend your capabilities. Exploring these hidden gems can increase productivity and accuracy of your work\. As Python evolves, experimenting with lesser-known libraries is a smart move to stay ahead of the curve. Please let me know if you have tried any of the above libraries. I'd love to hear about your favorite low-key Python ML libraries\, more if they're not in this post! Let me know in the comments!