Ilia Zaitsev
{Software Developer & AI Enthusiast}

How to Fail a Coding Interview

Some time ago I attended to a coding interview for the position of Data Scientist at one start-up. I felt myself well-prepared and confident, practicing lots of programming puzzles, coding various Machine Learning techniques from scratch, and having several years of programming experience under the belt. What can go wrong?

This post tells a story of a very unusual coding interview and is my attempt to analyze its results to derive some useful insights even from the fault.

The Best Format to Save Pandas Data Frame

When working on data analytical projects, I usually use Jupyter notebooks and a great pandas library to process and move my data around. To store the data between sessions, I usually use binary formats that allow to preserve data types and efficiently encode their content.

However, there are plenty of binary formats to store data frames on disk. How can we know which one is better for our purposes? Well, we can try a few of them and compare! In this post, I do a little benchmark to understand which format is the best in terms of short-term storing of pandas data.

How to Build a Flexible CLI with Standard Python Library

The Python programming language is quite often used to write various CLI-based utilities and automation tools. There are plenty of third-party libraries that make this process very easy and straightforward. However, recently I’ve realized that very often I use the good old argparse when writing my snippets, and also there are lots of legacy code that utilizes this package. That’s why I’ve decided to create a single reference point for myself showing how to use it. In this post, we are going to take a close look at the library and gradually build a simple CLI to generate plots with matplotlib library.

Deep Learning Model Training Loop

The PyTorch is a fantastic and easy to use Deep Learning framework. It provides you with all the fundamental tools to build a machine learning model. It gives you CUDA-driven tensor computations, optimizers, neural network layers, and so on. However, to train a model one needs to assemble all these things into a data processing pipeline. Recently the developers released the 1.0 version of the framework, and I’ve decided it is a good time to try myself in writing generic training loop implementation. In this post, I’m describing this process and giving some interesting observations about its development.

Building Simple Recommendation System with PyTorch

Recently I’ve started watching lectures—a great online course on Deep Learning. In one of his lectures, the author discusses the building of a simple neural network based recommendation system with application to the MovieLens dataset. The lecture relies on the library developed by the author to run the training process. However, I strongly wanted to learn more about the PyTorch framework which sits under the hood of authors code. In this post, I am describing the process of implementing and training a simple embeddings-based collaborative filtering recommendation system using PyTorch, Pandas, and Scikit-Learn.

Classifying Quantized Dataset with Random Forest Classifier (Part 2)

In this post, we’re going to finish the work started in the previous one and eventually classify quantized version of Wrist-worn Accelerometer Dataset. There are many ways to classify datasets with numerical features, but Decision Tree is one of the most intuitively understandable ones and simple it its underlying implementation. We are going to build a Decision Tree classifier using Numpy library and generalize it to Random Forest — an ensemble of randomly generated trees, which is less prone to data noise.

Using K-Means Clustering to Quantize Dataset Samples (Part 1)

Clustering algorithms are used to analyze data in an unsupervised fashion, in cases when labels are not available or to get new insights about the dataset. The K-Means algorithm is one of the oldest clustering algorithms developed several decades ago but still applied in Machine Learning tasks. One of the ways to use this algorithm is to apply it for vector quantization, a process which allows reducing the dimensionality of analyzed data. In this post, I’m going to implement a simple implementation of K-Means and apply it to Wrist-worn Accelerometer Dataset.

Dogs Breeds Classification with Keras

Modern deep learning architectures show quite good results in various fields of artificial intelligence. One of them is images classification. In this post, I am going to see if one could achieve an accurate classification of images by applying out-of-the-box ImageNet pre-trained deep models from Keras Python package.

The post was originally published on Medium. Then an updated and improved version was ported to platform.

Generators-Based Data Processing Pipeline

Generators represent a quite powerful conception of gradually consumed (and probably indefinite) streams of data. Almost every Python 3.x developer has encountered generators (when used range or zip methods, for example). However, the generators could be used not only as sources of data but also as coroutines organized into data transformation chains. The post shows how to build generators pipeline which preprocesses images stored on external storage.

Using Python __new__ Method to Dynamically Switch Class Implementations

Each Python object has a set magic of methods which could be overridden to customize instance creation and behavior. One of widely used methods is __init__, which is used to perform newly created instance initialization. But there is one more magic method taking part in object creation, called __new__, which is actually creates class’s instance. The post explains how to use this method to dynamically switch implementation of class methods.

Deep Learning Machine Software: Ubuntu, CUDA, and TensorFlow

Recently I’ve decided to build a simple deep learning machine with single GTX 1080Ti GPU and based on Ubuntu 16.04. The machine’s assembling process was quite straightforward. But while deploying required software, a few minor issues had arisen. That would be helpful to have an instruction with the list of performed actions in case if the setup system would ever require re-deployment.