# Use case of LSH

A classical application of similarity search is in recommender systems: Suppose you have shown interest in a particular item, for example a news article x. The semantic meaning of a piece of text can be represented as a high-dimensional feature vector, for example computed using latent semantic indexing. In order to recommend other news articles we might search the set P of article feature vectors for articles that are “close” to x.

In this case, for a large textual dataset containing millions of words, the problem is there may be far too many pairs of items…

# Why Bootstrapping is useful and Implementation of Bootstrap Sampling in Random Forests

Link to full Code in Kaggle and Github

# First a note on Ensemble Learning

The general principle of ensemble methods is to construct a linear combination of some model ﬁtting method, instead of using a single ﬁt of the method. The main principle behind the ensemble model is that a group of weak learners come together to form a strong learner, thus increasing the accuracy of the model.When we try to predict the target variable using any machine learning technique, the main causes of difference in actual and predicted values are noise, variance, and bias. Ensemble helps to reduce these factors (except noise, which is irreducible…

# XGBoost on Kaggle Donor Choose Dataset

Full code in my Kaggle Notebook and Github

For knowing the detail mechanisms of how XGBoost works you may check my this blog post.

Here’s a small part from that blog.

What is Boosting

To understand the absolute basics of the need for Boosting algorithm, let’s ask a basic question — If a data point is incorrectly predicted by our first model, and then the next (probably all models), will combining the predictions provide better results? Such questions are handled by boosting algorithm.

So, Boosting is a sequential technique that works on the principle of an ensemble, where each subsequent…

# Naive Bayes with Bag of Words on Kaggle Donor Choose Dataset

Link to full Jupyter Notebook in Kaggle and Github

# First some fundamental formulas

The combination of n different things taken r at a time is denoted by nCr and used the formula for calculation is below

# Matrix Basic Definitions

A matrix A over a field K or, simply, a matrix A (when K is implicit) is a rectangular array of scalars usually presented in the following form:

# Why Platt Scaling and implementation from scratch

Link to full Code in Kaggle and Github.

Platt Scaling (PS) is probably the most prevailing parametric calibration method. It aims to train a sigmoid function to map the original outputs from a classifier to calibrated probabilities.

So its simply is a form of Probability Calibration and is a way of transforming classification output into a probability distribution. For example: If you’ve got the dependent variable as 0 & 1 in the train data set, using this method you can convert it into probability.

Platt Scaling is a parametric method. It was originally built to calibrate the support vector machine…

# Why we do log transformation of variables and interpretation of Logloss

## What is Log Transformation

Original number = x

Log Transformed number x=log(x)

For zeros or negative numbers, we can’t take the log; so we add a constant to each number to make them positive and non-zero.

Each variable x is replaced with log(x), where the base of the log is left up to the analyst. It is considered common to use base 10, base 2 and the natural log ln.

The log transformation, a widely used method to address skewed data, is one of the most popular transformations used in Machine Learning.

One of the main reasons for using a log scale is that…

# Microsoft Malware Detection Kaggle Challenge — BIG-2015

Link to full code in Kaggle Notebook

Link to Github with full code

One of the largest public available data sets with malware can be found in
the Microsoft Malware Classification Challenge. It consists
of over 400 GB of data, with both binary and disassembled code from the
use of the IDA disassembler and debugger.3 The binary malware has been
This does limit the value of the data set, but they have prioritized the
potential security implications with hundreds of gigabytes of executable. malware available to anyone. …

# Implementing Custom GridSearchCV and RandomSearchCV without scikit-learn

Full Code in Kaggle and Github

Scikit-Learn offers two vehicles for optimizing hyperparameter tuning:
GridSearchCV and RandomizedSearchCV.

GridSearchCV performs an exhaustive search over specified parameter values for an estimator (or machine learning algorithm) and returns the best performing hyperparametric combination.
So, all we need to do is specify the hyperparameters with which we want to
experiment and their range of values, and GridSearchCV performs all possible
combinations of hyperparameter values using cross-validation. As such, we naturally limit our choice of hyperparameters and their range of values. … ## Rohan Paul

Kaggle Master | ComputerVision | NLP. Ex Fullstack Engineer and Ex International Financial Analyst. https://www.linkedin.com/in/rohan-paul-b27285129/