A classical application of similarity search is in recommender systems: Suppose you have shown interest in a particular item, for example a news article x. The semantic meaning of a piece of text can be represented as a high-dimensional feature vector, for example computed using latent semantic indexing. In order to recommend other news articles we might search the set P of article feature vectors for articles that are “close” to x.

In this case, for a large textual dataset containing millions of words, the problem is there may be far too many pairs of items…

**Link to my Kaggle Kernel**** **and this one attempt to solve one Kaggle Challenge named Santa 2020 - The Candy Cane Contest

To explain further, how do you most efficiently identify the best machine to play, whilst sufficiently exploring the many options in real-time? This problem is not an exercise in theoretical abstraction, it is an analogy for a common problem that organizations face all the time, that is, how to identify the best message to present to customers (message is broadly defined here i.e. webpages, advertising, images) such that it maximizes some business objective (e.g. clickthrough rate, signups).

The…

**Kaggle Notebook link **with all the running code in this blog.

In this post, I shall go over TF-IDF Model and its implementation with Scikit-learn.

Traditional (count-based) feature engineering strategies for textual data belong to a family of models popularly known as the Bag of Words model. This includes term frequencies, TF-IDF (term frequency-inverse document frequency), N-grams, topic models, and so on. …

t-Distributed Stochastic Neighbor Embedding (t-SNE) is an unsupervised, non-linear technique developed by Laurens van der Maaten and Geoffrey Hinton in 2008.

We initially construct a probability distribution in such a way that objects with a higher similarity have a higher probability to be grouped together than objects with lower probability. This is done over pairs of higher-dimensional objects.

We then construct a similar probability distribution over the lower-dimensional map so that the Kullback–Leibler divergence between the two distributions, with respect to their location on the map, is minimized.

Usually, the algorithm uses Euclidean distance as the base metric but it…

**[ ****99.3% score in Kaggle Digit Recognizer Challenge****]**

In many practical applications, although the data reside in a high-dimensional space, the true dimensionality, known as intrinsic dimensionality, can be of a much lower value.

For example, in a three-dimensional space, the data may cluster around a straight line, or around the circumference of a circle or the graph of a parabola, arbitrarily placed in R³. In all previous cases, the intrinsic dimensionality of the data is equal to one, as any of these curves can equivalently be described in terms of a single parameter.

Below figure…

Many have gone through this issue and today I faced the it in my Ubuntu 20.04 machine.

Error : Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above

This normally iscaused by either an incompatibility in cuda, cudnn and Nvidia drivers or memory growth issue. The solution in here addresses the memory growth issue which was the case for me today.

**This solution ****here**** worked for me.**

Set the TF_FORCE_GPU_ALLOW_GROWTH environment variable to true. In your terminal, run this command.

`export TF_FORCE_GPU_ALLOW_GROWTH=true`

My…

**Link to Kaggle Notebook for this entire exercise**

In general, all current machine-learning systems use tensors as their basic data structure. Tensors are fundamental to the field — so fundamental that Google’s TensorFlow was named after them. Even the text data or image data are converted to Numerical features for processing.

At its core, a tensor is a container for data — almost always numerical data. So, it’s a container for numbers. …

The Haberman’s survival dataset covers cases from a study by University of Chicago’s Billings Hospital done between 1958 and 1970 on the subject of patients-survival who had undergone surgery for breast cancer.

Label/Attribute Information:

- Age of the patient at time of operation — numerical
- Year of operation (based on 1900, numerical)
- Number of positive axillary nodes detected — See note below on this (numerical)
- Survival status (this is a class attribute) where 1 means — patient survived 5 years or longer and 2 means patient died within 5 years

The lymphatic system is one of…

**Link to Kaggle Notebook for all these exercises together**

**Q: convert-decimal-to-binary**

Performing Short Division by Two with Remainder (For integer part)

This is a straightforward method which involve dividing the number to be converted. Let decimal number is N then divide this number from 2 because base of binary number system is 2. Note down the value of remainder, which will be either 0 or 1. Again divide remaining decimal number till it became 0 and note every remainder of every step. Then write remainders from bottom to up (or in reverse order), which will be equivalent binary number of…

DataScience | ML | 2x Kaggle Expert. Ex Fullstack Engineer and Ex International Financial Analyst. https://www.linkedin.com/in/rohan-paul-b27285129/