Use case of LSH

A classical application of similarity search is in recommender systems: Suppose you have shown interest in a particular item, for example a news article x. The semantic meaning of a piece of text can be represented as a high-dimensional feature vector, for example computed using latent semantic indexing. In order to recommend other news articles we might search the set P of article feature vectors for articles that are “close” to x.

In this case, for a large textual dataset containing millions of words, the problem is there may be far too many pairs of items to calculate the similarity of each pair. And also, we will have sparse amounts of overlapping data for all items. So in this case, LSH can be used for compressing the rows into “signatures”, or sequences of integers, which will let us compare published-papers or news-articles without having to compare the entire sets of words. …

Basic Multi-Armed-Bandit-Kaggle Santa-Competition

Link to my Kaggle Kernel and this one attempt to solve one Kaggle Challenge named Santa 2020 - The Candy Cane Contest

Multi-armed bandit problems are some of the simplest reinforcement learning (RL) problems to solve. We have an agent that we allow to choose actions, and each action has a reward that is returned according to a given, underlying probability distribution. The game is played over many episodes (single actions in this case) and the goal is to maximize your reward.

To explain further, how do you most efficiently identify the best machine to play, whilst sufficiently exploring the many options in real-time? This problem is not an exercise in theoretical abstraction, it is an analogy for a common problem that organizations face all the time, that is, how to identify the best message to present to customers (message is broadly defined here i.e. webpages, advertising, images) such that it maximizes some business objective (e.g. clickthrough rate, signups).

The classic approach to making decisions across variants with unknown performance outcomes is to perform multiple A/B tests. These are typically run by evenly directing a percentage of traffic across each of the variants over a number of weeks, then performing statistical tests to identify which variant is the best. This is perfectly fine when there are a small number of variations of the message (e.g. 2–4), but can be quite inefficient in terms of both time and opportunity cost when there are many. …

Term Frequency and Inverse Document Frequency with scikit-learn

Kaggle Notebook link with all the running code in this blog.

In this post, I shall go over TF-IDF Model and its implementation with Scikit-learn.

Traditional (count-based) feature engineering strategies for textual data belong to a family of models popularly known as the Bag of Words model. This includes term frequencies, TF-IDF (term frequency-inverse document frequency), N-grams, topic models, and so on. …

Dimensionality Reduction with t-SNE on MNIST dataset

What is t-SNE?

t-Distributed Stochastic Neighbor Embedding (t-SNE) is an unsupervised, non-linear technique developed by Laurens van der Maaten and Geoffrey Hinton in 2008.

The algorithm has two steps:

We initially construct a probability distribution in such a way that objects with a higher similarity have a higher probability to be grouped together than objects with lower probability. This is done over pairs of higher-dimensional objects.

We then construct a similar probability distribution over the lower-dimensional map so that the Kullback–Leibler divergence between the two distributions, with respect to their location on the map, is minimized.

Usually, the algorithm uses Euclidean distance as the base metric but it can be changed to fit the use of the programmer. These t-SNE clusters are dependent on chosen parameters and sometimes may show a cluster in non-clustered data. However, t-SNE is able to recover well- separated clusters when the correct parameters are chosen. …

Dimensionality-Reduction-PCA-and-Convolutional Neural Networks on Mnist Dataset-Kaggle Competition Notebook

In many practical applications, although the data reside in a high-dimensional space, the true dimensionality, known as intrinsic dimensionality, can be of a much lower value.

For example, in a three-dimensional space, the data may cluster around a straight line, or around the circumference of a circle or the graph of a parabola, arbitrarily placed in R³. In all previous cases, the intrinsic dimensionality of the data is equal to one, as any of these curves can equivalently be described in terms of a single parameter.

Below figure illustrates the three cases. Learning the lower-dimensional structure associated with a given set of data is gaining in importance in the context of big data processing and analysis. Some typical examples are the disciplines of computer vision, robotics, medical imaging, and computational neuroscience. …

Tensorflow Issue while model Training — Failed to get convolution algorithm

Many have gone through this issue and today I faced the it in my Ubuntu 20.04 machine.

Error : Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above

This normally iscaused by either an incompatibility in cuda, cudnn and Nvidia drivers or memory growth issue. The solution in here addresses the memory growth issue which was the case for me today.

This solution here worked for me.

Set the TF_FORCE_GPU_ALLOW_GROWTH environment variable to true. In your terminal, run this command.

`export TF_FORCE_GPU_ALLOW_GROWTH=true`

Other Details around versions in my Machine

My cuda version — you will know this by running `nvcc —…

Data representations for neural networks — Tensor, Vector and Scaler Basics

Link to Kaggle Notebook for this entire exercise

In general, all current machine-learning systems use tensors as their basic data structure. Tensors are fundamental to the field — so fundamental that Google’s TensorFlow was named after them. Even the text data or image data are converted to Numerical features for processing.

So what’s a tensor?

At its core, a tensor is a container for data — almost always numerical data. So, it’s a container for numbers. …

Exploratory Data Analysis with Haberman’s Cancer Dataset and the basic plotting techniques

Description of the Data

The Haberman’s survival dataset covers cases from a study by University of Chicago’s Billings Hospital done between 1958 and 1970 on the subject of patients-survival who had undergone surgery for breast cancer.

Label/Attribute Information:

• Age of the patient at time of operation — numerical
• Year of operation (based on 1900, numerical)
• Number of positive axillary nodes detected — See note below on this (numerical)
• Survival status (this is a class attribute) where 1 means — patient survived 5 years or longer and 2 means patient died within 5 years

A note on axillary lymph nodes and its relation with breast cancer diagnosis ?

Source

The lymphatic system is one of the body’s primary tools for fighting infection. This system contains lymph fluid and lymph nodes, which occur in critical areas in the body. Cancer cells sometimes enter and build up in the lymph system. …

Python Most Common Challenges

Link to Kaggle Notebook for all these exercises together

Q: convert-decimal-to-binary

Performing Short Division by Two with Remainder (For integer part)

This is a straightforward method which involve dividing the number to be converted. Let decimal number is N then divide this number from 2 because base of binary number system is 2. Note down the value of remainder, which will be either 0 or 1. Again divide remaining decimal number till it became 0 and note every remainder of every step. Then write remainders from bottom to up (or in reverse order), which will be equivalent binary number of given decimal number. …