{Machine Learning, Data Science, AI}
PyCon & PyData Berlin 2022: Notes in the Margin

After a long pandemic break, PyCon DE & PyData is back again! I attended previous conferences remotely, and this year was the first time when I arrived to Berlin to be there in person. The conference was great, with plenty of amazing talks and insights from leading Python engineers and data scientists. And while the memories are still fresh, I would like to structure the collected notes and briefly summarize attended presentations.

Stress-Free Machine Learning

Building machine learning models isn’t easy. Heavy datasets and tricky data formats. A ton of hyper-parameters, models, optimization algorithms. Not talking about generic programming adventures like debugging, handling exceptions, and logging. It is especially true for the R&D (Research and Development) style of work when models, approaches (and sometimes, even the data itself!) can change very quickly. You wouldn’t like to invest a large amount of time and effort to build something that could become irrelevant very soon. But you also don’t want to turn your project into a pile of Jupyter notebooks and ad hoc scripts, with a ton of files scattered here and there across the file system. In this post, I would like to share an approach that worked for me pretty well during several quickly evolving projects I’ve worked on recently. We’ll pick up a couple of image datasets and try to find a model architecture that works best for them all.

Python Errors Done Right

Exceptions mechanism is widely adopted among modern programming languages, and Python is not an exception. (Pun intended!) Though this topic could seem obvious, like wrap your code blocks with try-catch clauses, and that’s all, there are some minor but important details, and taking them into account should make your code a bit cleaner. In this post, I’m going through some guidelines on how to structure error processing in Python that I derived from my personal experience.

Education, Title, Salary: How Are They Related?

A few months ago I tried myself in my first analytical competition on Kaggle. This competition was about analyzing the results of 2019 Kaggle’s Survey among its members about their levels of education, jobs, and salaries. In this notebook, I share my (rather simple) analysis of the dataset using box plots, pie charts, and heatmaps.

Time Series Classification with PyTorch

The Career Con 2019 data competition was all about time series classification. This kind of problems seem to be a great choice to apply Deep Learning models. However, even deep models cannot magically give you good results if the data wasn’t properly prepared. In this notebook, we’re going to build a simple solution based on one of the winning kernels using PyTorch to classify a type of surface using sensor measurements of a small mobile robot.

How to Fail a Coding Interview

Some time ago I attended to a coding interview for the position of Data Scientist at one start-up. I felt myself well-prepared and confident, practicing lots of programming puzzles, coding various Machine Learning techniques from scratch, and having several years of programming experience under the belt. What can go wrong?

This post tells a story of a very unusual coding interview and is my attempt to analyze its results to derive some useful insights even from the fault.

The Best Format to Save Pandas Data Frame

When working on data analytical projects, I usually use Jupyter notebooks and a great pandas library to process and move my data around. To store the data between sessions, I usually use binary formats that allow to preserve data types and efficiently encode their content.

However, there are plenty of binary formats to store data frames on disk. How can we know which one is better for our purposes? Well, we can try a few of them and compare! In this post, I do a little benchmark to understand which format is the best in terms of short-term storing of pandas data.

How to Build a Flexible CLI with Standard Python Library

The Python programming language is quite often used to write various CLI-based utilities and automation tools. There are plenty of third-party libraries that make this process very easy and straightforward. However, recently I’ve realized that very often I use the good old argparse when writing my snippets, and also there are many legacy projects that utilize this package. That’s why I’ve decided to create a single reference point for myself showing how to use it. In this post, we are going to take a close look at the library and gradually build a simple CLI to generate plots with matplotlib library.

Deep Learning Model Training Loop

The PyTorch is a fantastic and easy to use Deep Learning framework. It provides you with all the fundamental tools to build a machine learning model. It gives you CUDA-driven tensor computations, optimizers, neural network layers, and so on. However, to train a model one needs to assemble all these things into a data processing pipeline. Recently the developers released the 1.0 version of the framework, and I’ve decided it is a good time to try myself in writing generic training loop implementation. In this post, I’m describing this process and giving some interesting observations about its development.

Building Simple Recommendation System with PyTorch

Recently I’ve started watching fast.ai lectures — a great online course on Deep Learning. In one of his lectures, the author discusses the building of a simple neural network based recommendation system with application to the MovieLens dataset. The lecture relies on the library developed by the author to run the training process. However, I strongly wanted to learn more about the PyTorch framework which sits under the hood of authors code. In this post, I am describing the process of implementing and training a simple embeddings-based collaborative filtering recommendation system using PyTorch, Pandas, and Scikit-Learn.

Classifying Quantized Dataset with Random Forest Classifier (Part 2)

In this post, we’re going to finish the work started in the previous one and eventually classify the quantized version of Wrist-worn Accelerometer Dataset. There are many ways to classify the datasets with numerical features, but the Decision Tree algorithm is one of the most intuitively understandable and simple approaches in terms of its underlying implementation. We are going to build a Decision Tree classifier using Numpy library and generalize it to Random Forest — an ensemble of randomly generated trees that is less prone to the data noise.

Using K-Means Clustering to Quantize Dataset Samples (Part 1)

Clustering algorithms are used to analyze the data in an unsupervised fashion, in cases when the labels are not available, or to get the new insights about the dataset. The K-Means algorithm is one of the oldest clustering algorithms developed several decades ago but still applied in modern Machine Learning tasks. One of the ways to use this algorithm is to apply it for vector quantization — a process allowing to reduce the dimensionality of analyzed data. In this post, I’m going to implement a simple version of K-Means and apply it to Wrist-worn Accelerometer Dataset.

Dogs Breeds Classification with Keras

Modern deep learning architectures show quite good results in various fields of artificial intelligence. One of them is images classification. In this post, I am going to see if one could achieve an accurate classification of images by applying out-of-the-box ImageNet pre-trained deep models from Keras Python package.

The post was originally published on Medium. Then an updated and improved version was ported to Able.bio platform.

Generators-Based Data Processing Pipeline

The generators represent a quite powerful conception of gradually consumed (and sometimes infinite) streams of data. Almost every Python 3.x developer has encountered generators when used range or zip methods. However, the generators could be used not only as the sources of data but also as coroutines organized into data transformation chains. This post shows how to build a generators-based data pipeline processing the images from an external storage.

Using Python's __new__ Method to Dynamically Switch Class Implementations

Every Python object has a set of magic methods that could be overridden to customize an instance creation and behavior. One of the widely used methods is __init__ which is used to perform the newly created instance’s initialization. But there is one more magic method that takes part in the object creation. The method is called __new__. This method actually creates an instance of a class. The post explains how to use this method to dynamically switch the implementation of class’ logic while hiding it under the same class name.

Deep Learning Machine Software: Ubuntu, CUDA, and TensorFlow

Recently I’ve decided to build a simple deep learning machine with single GTX 1080Ti GPU and based on Ubuntu 16.04. The machine’s assembling process was quite straightforward. But while deploying required software, a few minor issues had arisen. That would be helpful to have an instruction with the list of performed actions in case if the setup system would ever require re-deployment.