Overparametrized Linear Models

Posted on July 09 2017 in Statistics • Tagged with overparametrization, linear models, statistics, rLeave a comment

Surprisingly, quite a few data scientists overlook the importance of linear regression and the problem of overparametrization. In this post, I'm going to describe mathematically what it means for a model to be overparametrized and the general strategies used by a statistician to resolve non-uniqueness.

Ranks, Subspaces, and Bases

Suppose …

Continue reading

Bayesian Zero-Inflated Poisson Model

Posted on July 05 2017 in Bayesian Statistics • Tagged with zero inflated poisson, mcmc, bayesian statistics, statistics, rLeave a comment

Wikipedia defines

The zero-inflated Poisson model concerns a random event containing excess zero-count data in unit time. For instance, the number of insurance claims within a population for a certain type of risk would be zero-inflated by those people who have not taken out insurance against the risk and thus …

Continue reading

Foursquare Location-Content-Aware Recommender System

Posted on June 26 2017 in Bayesian Statistics • Tagged with latent dirichlet allocation, hierarchical bayes, recommender system, statistics, rLeave a comment

Foursquare uses its unique location technology and foot traffic panel to produce personalized recommendations of spatial items such as restaurants. Recently I've read a paper LCARS: A Spatial Item Recommender System that I implemented from scratch in R.

The recommender system combines the querying user's interest and the local preference …

Continue reading

Predicting Free Public WiFi Locations in Seoul with Point Process Models using Foursquare API

Posted on June 25 2017 in Spatial Statistics • Tagged with point process models, spatial statistics, rLeave a comment

Introduction

The field of spatial statistics is experiencing a period of rapid growth, though less so compared to that of machine learning and artificial intelligence, with the onset of location technology and efficient solution techniques for high performance computing, which has begun to replace classical frequentist inference procedures developed years …

Continue reading

Part-of-Speech Tagging with Trigram Hidden Markov Models and the Viterbi Algorithm

Posted on June 07 2017 in Natural Language Processing • Tagged with pos tagging, markov chain, viterbi algorithm, natural language processing, machine learning, pythonLeave a comment

Hidden Markov Model

The hidden Markov model or HMM for short is a probabilistic sequence model that assigns a label to each unit in a sequence of observations. The model computes a probability distribution over possible sequences of labels and chooses the best label sequence that maximizes the probability of …

Continue reading

Generating Movie Reviews in Korean with Language Modeling

Posted on March 23 2017 in Natural Language Processing • Tagged with language modeling, markov chain, natural language processing, pythonLeave a comment

MovieReviews

A statistical language modeling is a task of computing the probability of a sentence or sequence of words from a corpus. The standard is a trigram language model where the probability of a next word depends only on the previous two words. But the state-of-the-art as of writing is achieved …

Continue reading

My Goals for 2017

Posted on March 19 2017 in Notes • Tagged with personalLeave a comment

I know it's a little late to make explicit my goals for this year, but think of it as my goals for the remaining three quarters of 2017. 2016 was the year of transitions as I left my first job to pursue my second in Korea. Moving from Boston to …

Continue reading

K-Nearest Neighbors from Scratch in Python

Posted on March 16 2017 in Machine Learning • Tagged with k-nearest neighbors, classification, pythonLeave a comment

MNIST

The \(k\)-nearest neighbors algorithm is a simple, yet powerful machine learning technique used for classification and regression. The basic premise is to use closest known data points to make a prediction; for instance, if \(k = 3\), then we'd use 3 nearest neighbors of a point in the test set …

Continue reading