First ever analytics project!!

Jose
4 min readJul 16, 2020

Soo Analytics…

Analytics deeply intrigues me. To me , there’s something almost magical in taking a bunch of numbers , processing them through a computer and then producing a conclusion from them. The process is also challenging, because at times the field can feel so ethereal that is hard to pin down exactly what it is and how an analytics project should look like.

So in this, my very first entry for medium, I intend to talk about my first project , how I approached it , what was I expecting to learn from it and what were my findings. I hope that anyone that sees it may be able to give me insights to improve my skills and keep on learning about this wonderful field

First some ground rules…

For this project I decided that I would not use any libraries or frameworks other than my Anaconda python compiler, why? Because I figured that , as I´m a student of Analytics and not a professional , I should make sure that I understood the basic concept of the algorithm I was implemented and I should be able to explain every line of code written and every implementation decision taken. Is harder to accomplish this with frameworks because they come prepackaged and they can become something of a “black box” in which the results are hard to interpret. For this reason, I coded the entire model myself, no python frameworks were used other than csv to read the data and matplotlib to plot a couple of nice graphs

The Model

So for my first Analytics project I chose to answer this question since I live in Colombia, is it possible to predict , given my gender and location , what would be the most likely crime I would be victim of? I chose three types of crime, rape , mugging and murder. They are violent crimes and people should, in general , be concerned with encountering this types of assault. They are also crimes that are deeply influenced by location and gender so , naturally a model that included these variables should be able to capture their influence in crime statistics.

The prediction is then, governed by a Naïve Bayes algorithm

Wow, math!. basically it means that my prediction (M(q)) is the multiplications of all the conditional probabilities that a given event q occurred conditional to a certain target instance t=l. This is multiplied by the global probability of the target event occurring (P(t=l)). My prediction for different target levels is then the maximum value of the Naïve Bayes calculation or each target level. To illustrate , my analytics dataset looks something like this

So if I want to predict what would be the most likely crime I would be victim of, if I´m a male and I live in Medellin my prediction model would look like

The maximum value from each of these models will give me the most likely crime that I can expect to be victim of, If I´m a male in Medellin

The Data

The Colombian National Police keeps extensive and clear records of a variety of crimes committed in the country, dating back from 2010. It is a rather complete data set and , thus , I figured it could be useful to answer my question

The raw data structure is shown here, what we basically have is a number of excel spreadsheets for each year and crime being listed in the database. Each crime has records dating back 10 years

The dataset contains a great number of features, notably the municipality in which the crime was committed, the date and time , whether the victim was male or female and his/her age amongst other relevant data. The abundance of features gave me the idea that I could probably predict the probability of a crime happening in a certain time, date and place. For each line of the data set, an instance of a crime is being recorded and each excel spreadsheet is exclusive for a single type of crime that means that the dataset looks something like this

My first goal was to filter the data to a more manageable format , to that effect I converted all the tables to CSV and deleted some of the redundant features such as the letterheads and some metrics the police calculation (total number of crimes committed and such) which were of no particular use for my analysis. Afterwards I wrote a simple python script to merge all the files in a single CSV which I intended to use as my Analytics Base Table.

Now that we´ve discussed basics, in my next entry I will talk about the fun part , results!!!

--

--

Jose

Engineer , Data and AI enthusiast . Amateur programmer