laitimes

Machine learning fundamentals

author:Medium meter AI

Any aspiring data scientist has a lot of worries when starting his/her career. Why do we have to focus on machine learning today, or why is machine learning being talked about so much? Is machine learning a recent recent? I think there are dozens of definitions available and dozens of reasons why someone should go into it. We don't want to reinvent the wheel here, but we want to address several important aspects related to aspiring data scientists.

Why Machine Learning

We know that a huge amount of data is generated every minute, such as retail payments, GPS, photos, blogs, videos, e-commerce, investments, insurance, healthcare, accounting, logistics, utilities, and many more. Because there is so much data, there is an opportunity to make predictions on all of these fronts. Predicting means preparing for the future by making the right decisions in the present.

Machine learning fundamentals

definition

Machine learning is the study of computer science, statistics, and mathematics that is used to make predictions or cluster data. The most widely used definition is that machine learning is an application of artificial intelligence (AI) that enables systems to automatically learn and improve experiences without explicit programming.

Machine learning fundamentals

glossary

  1. Independent variables: There is a set of variables/fields (often referred to as traits) that, when combined, give the output of the data. For example, multiple fields/variables can be used to predict rainfall forecasts anywhere, such as the geography of the region (tropical, coastal, mountainous, etc.), the months of the year, the state of the previous day, humidity levels, etc. These fields/variables are called dependent variables or traits. There are often multiple independent variables in any dataset.
  2. Dependent variable: The data point we intend to predict is the dependent variable. In the example above, the dependent variable will be the rainfall forecast (yes or no) or the amount of rainfall we expect (in mm\). There is usually 1 dependent variable in any dataset, but there can also be multiple dependent variables.

3. Dataset: The combination of independent variables and independent variables is called a dataset. In other words, around a business problem, many different data points are grouped together to call a data set. For basic machine learning problems, it's usually in tabular form: each row is a data entry, and each column is a feature (or argument). For example, in the rainfall forecast example above, combining 300 examples (300 days of data) or rows and 5 columns (across 4 features and 1 output) together is called a dataset. In the example above, rows are referenced for each day, while different dimensions such as humidity, temperature, altitude, day of the week, and so on are referenced columns.

  1. Training data: The complete dataset is divided into 3 parts, with training data typically being the largest part in the divided dataset. This is called training data because usually machine learning algorithms work on this set and create their models (technically called equations)
  2. Validation Data: This is the 02nd block (from a larger, full data set) that verifies the accuracy or correctness of the model created. The model or equation (created during training) is run on this validation set, and as the model is run, the model changes the hyperparameters to further improve accuracy.
  3. Test data: This is the final piece of the dataset where the model is run to predict the accuracy score
  4. Fitting data (or training): Whenever someone says that the data is fitting or the data is being trained, it means that a machine learning algorithm is creating a model or creating a generalized equation that the data can fit. For example, the equation for a circle in two-dimensional space is (x-h)2 — (y-k)2 = r2 where r is the radius and the center of the circle is (h,k). Now, this equation is a generalized equation and we put any x,y in it and it will create a circle. Similarly, after creating the model, the machine learning model gives the value of the dependent variable whenever we enter a new value for the independent variable.

8. Loss: Loss is the difference between the predicted value and the actual value of a single training record. This gives how far the predicted value differs from the estimate of the actual value

  1. Cost function: The cost function is the average of the losses of all training examples

10. Optimization: This is a process of minimizing losses by adjusting weights or parameters. It is achieved by a partial derivative (differentiation) of the ownership weight relative to the cost function

Parameter: A parameter is a weight associated with each argument. These weights change with each iteration of optimization. When these weights do not change or are significantly updated, we assume that the optimization is complete

Machine learning fundamentals

Categories of machine learning

Let's understand that machine learning works for data and also works for digital data. This means that all text must be converted into digital data and then machine learning algorithms applied. This will be discussed later in the process.

  1. Supervised learning: Such algorithms are needed when we assign independent variables (outputs) to each set of dependent variables (features or columns in the dataset). The problem is predicting the independent variable (output) given a new set of dependent variables (features).
  2. Unsupervised learning: This type of algorithm is required or used when clustering or segregating datasets is required. For example, if we divide the students of the school into categories based on their characteristics (address, height, weight, age, scores obtained in the last year, drawing skills, sports medals, musical skills, wearing glasses or not, etc.). Underlying data Among these traits, the unsupervised model can divide students into 3 or 4 categories, i.e., a) studious, b) athletes, and c) artists.

Reinforcement Learning: Reinforcement learning is a subset of machine learning in which the model calculates all possible paths/options to reach/compute the destination and then selects the path/option that provides a reward (positive score) with the least penalty (negative score).

algorithm

Below is a list of commonly used algorithms

  1. linear regression
  2. Logistic regression
  3. decision tree
  4. Support vector machines
  5. Naive Bayes
  6. kNN
  7. K-means
  8. Random forest
  9. Dimensionality reduction algorithms
  10. Gradient boosting algorithm

a) GBM的

b) XGBoost

Machine learning fundamentals

process

We have already explained this process in the data science section, and since machine learning is a part of data science, all the steps of machine learning are similar to the steps in data science. Read this article till the end, because the last step is your bonus item

  1. Data collection: Data collection is the foundational building block of any machine learning problem. Data can be collected in a structured format (databases, available datasets, internet history) or in an unstructured format (videos, blogs, etc.).
  2. Cleansing data: Cleansing/cleansing data refers to the process of data having no NULL values, data not having too many outliers, removing irrelevant columns, etc
  3. Exploratory data analysis: Visualize data with charts to identify patterns, outliers, or key insights that require further action in the next step (feature engineering).
  4. Feature Engineering: This step is done to achieve the right feature set with the help of — a) adding more records to the dataset, b) adding more features, c) grouping operations (e.g., maximum, minimum, pivot values) d) normalizing/scaling the data, e) logarithmic or exponential transformations, f) redesigning features for dimensionality reduction or identifying collinearity, g) monothermal encoding, and many more

5. Algorithm selection: There are multiple algorithms that can be applied to a single business problem, so we can test multiple algorithms. For example, in the case of classification, we can use logistic regression or decision trees or naïve Bayes, depending on which algorithm provides better accuracy.

  1. Modeling: This includes training the model, which means finding the right set of weights (associated with columns/features/data) to create generalized equations. It includes tuning the model with the help of cross-validation and updating hyperparameters. Then, evaluate the accuracy of the model by running it on invisible data, known as test data. Once the accuracy reaches the threshold of business expectations, it is moved to production.
  2. Production Deployment: Production deployment is a very critical element in any model and is rarely talked about/explained. Here are the tangents we need to consider when deploying the model

a) Was the model developed for web-based or device-based interactions?

b) Do we need real-time scaling of the model (expecting more users over a specific period of time) or will the users not change?

c) Does it need to be integrated with external devices such as webcams, e-commerce portals, etc.?

d) What are the security implications of using the model?

e) Do we want the customer to initialize the algorithm and call the prediction function every time, or do we want to use a REST API for our model?

Machine learning fundamentals

A master plan on how to learn machine learning in detail

  1. Use Youtube (first) to learn about supervised and unsupervised learning. Park reinforcement learning for a few months until you are confident in both supervised and unsupervised
  2. Enroll in several courses on Coursera, Udemy, or any other online platform. It doesn't hurt to look for free or any place where you can get financial aid, but take a look at the content and match it to the above

3. Practice some algorithms on multiple different datasets (you can get a lot of free datasets online)

  1. significant

One. Use Kaggle to create an account and download the dataset

b. Create a profile on GitHub and showcase your work on GitHub

Similarly, update your profile on LinkedIn

This is explained on a very basic level, and we recommend practicing at least 2-3 algorithms for each type of machine learning to start gaining confidence.

Read on