Photo by Wonderlane on Unsplash

Link to Kaggle Notebook

Use case of LSH

A classical application of similarity search is in recommender systems: Suppose you have shown interest in a particular item, for example a news article x. The semantic meaning of a piece of text can be represented as a high-dimensional feature vector, for example computed using latent semantic indexing. In order to recommend other news articles we might search the set P of article feature vectors for articles that are “close” to x.

In this case, for a large textual dataset containing millions of words, the problem is there may be far too many pairs of items…


Photo by Tiger Lily from Pexels

Link to full Code in Kaggle and Github

First a note on Ensemble Learning

The general principle of ensemble methods is to construct a linear combination of some model fitting method, instead of using a single fit of the method. The main principle behind the ensemble model is that a group of weak learners come together to form a strong learner, thus increasing the accuracy of the model.When we try to predict the target variable using any machine learning technique, the main causes of difference in actual and predicted values are noise, variance, and bias. Ensemble helps to reduce these factors (except noise, which is irreducible…


Photo by Nic Wood from Pexels

Full code in my Kaggle Notebook and Github

For knowing the detail mechanisms of how XGBoost works you may check my this blog post.

Here’s a small part from that blog.

What is Boosting

To understand the absolute basics of the need for Boosting algorithm, let’s ask a basic question — If a data point is incorrectly predicted by our first model, and then the next (probably all models), will combining the predictions provide better results? Such questions are handled by boosting algorithm.

So, Boosting is a sequential technique that works on the principle of an ensemble, where each subsequent…


Photo source attribution https://www.freepik.com/psd/education

First some fundamental formulas

The combination of n different things taken r at a time is denoted by nCr and used the formula for calculation is below


Image by Dr StClaire from Pixabay

Matrix Basic Definitions

A matrix A over a field K or, simply, a matrix A (when K is implicit) is a rectangular array of scalars usually presented in the following form:


Image by SeppH from Pixabay

Link to full Code in Kaggle and Github.

Platt Scaling (PS) is probably the most prevailing parametric calibration method. It aims to train a sigmoid function to map the original outputs from a classifier to calibrated probabilities.

So its simply is a form of Probability Calibration and is a way of transforming classification output into a probability distribution. For example: If you’ve got the dependent variable as 0 & 1 in the train data set, using this method you can convert it into probability.

Platt Scaling is a parametric method. It was originally built to calibrate the support vector machine…


Photo by Michael Burrows from Pexels

What is Log Transformation

Original number = x

Log Transformed number x=log(x)

For zeros or negative numbers, we can’t take the log; so we add a constant to each number to make them positive and non-zero.

Each variable x is replaced with log(x), where the base of the log is left up to the analyst. It is considered common to use base 10, base 2 and the natural log ln.

The log transformation, a widely used method to address skewed data, is one of the most popular transformations used in Machine Learning.

One of the main reasons for using a log scale is that…


Photo by Pixabay from Pexels

Link to full code in Kaggle Notebook

Link to Github with full code

One of the largest public available data sets with malware can be found in
the Microsoft Malware Classification Challenge. It consists
of over 400 GB of data, with both binary and disassembled code from the
use of the IDA disassembler and debugger.3 The binary malware has been
stripped of the PE-header to be made non-executable for security reasons.
This does limit the value of the data set, but they have prioritized the
potential security implications with hundreds of gigabytes of executable. malware available to anyone. …


Photo Credit- Pexel

Full Code in Kaggle and Github

Scikit-Learn offers two vehicles for optimizing hyperparameter tuning:
GridSearchCV and RandomizedSearchCV.

GridSearchCV performs an exhaustive search over specified parameter values for an estimator (or machine learning algorithm) and returns the best performing hyperparametric combination.
So, all we need to do is specify the hyperparameters with which we want to
experiment and their range of values, and GridSearchCV performs all possible
combinations of hyperparameter values using cross-validation. As such, we naturally limit our choice of hyperparameters and their range of values. …

Rohan Paul

Kaggle Master | ComputerVision | NLP. Ex Fullstack Engineer and Ex International Financial Analyst. https://www.linkedin.com/in/rohan-paul-b27285129/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store