K-means Clustering and its real use-case in the Security Domain
What is K-means Clustering?
K-means clustering is one of the simplest and popular unsupervised machine learning algorithms. Typically, Unsupervised learning is the type of learning in which you don’t provide the target or outcome instead based on the stored historical data the Machine-learning model is trained. Here it forms clusters based on the nature of similarities of the data.
A cluster refers to a collection of data points aggregated together because of certain similarities. You’ll define a target number k, which refers to the number of centroids you need in the dataset. A centroid is the imaginary or real location representing the center of the cluster.
Every data point is allocated to each of the clusters through reducing the in-cluster sum of squares also known as the ‘Squared Euclidean Distance’ . In other words, the K-means algorithm identifies k number of centroids, and then allocates every data point to the nearest cluster, while keeping the centroids as small as possible.
The ‘means’ in the K-means refers to averaging of the data; that is, finding the centroid.
Use-case in security Domain:
Crime analysis is defined as analytical processes which provides relevant information relative to crime patterns and trend correlations to assist personnel in planning the deployment of resources for the prevention and suppression of criminal activities. It is important to analyze crime due to following reasons :
1. Analyze crime to inform law enforcers about general and specific crime trends in timely manner
2. Analyze crime to take advantage of the plenty of information existing in justice system and public domain. Crime rates are rapidly changing and improved analysis finds hidden patterns of crime, if any, without any explicit prior knowledge of these patterns.
The main objectives of crime analysis include:
- Extraction of crime patterns by analysis of available crime and criminal data
- Prediction of crime based on spatial distribution of existing data and anticipation of crime rate using different data mining techniques
3. Detection of crime
ARCHITECTURE:
There is need to used an open source data mining tool which can be implemented and analysis can be done easily. So here crime analysis is done on crime dataset by applying k means clustering algorithm using rapid miner tool.
The procedure is given below:
1. First we take crime dataset
2. Filter dataset according to requirement and create new dataset which has attribute according to analysis to be done
3. Open rapid miner tool and read excel file of crime dataset and apply “Replace Missing value operator” on it and execute operation
4. Perform “Normalize operator” on resultant dataset and execute operation
5. Perform k means clustering on resultant dataset formed after normalization and execute operation
6. From plot view of result plot data between crimes and get required cluster
7. Analysis can be done on cluster formed.
Approach Used:
K-means clustering is one of the method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.
Process:
1. Initially, the number of clusters must be known let it be k
2. The initial step is to choose a set of K instances as centers of the clusters.
3. Next, the algorithm considers each instance and assigns it to the cluster which is closest.
4. The cluster centroids are recalculated either after whole cycle of re-assignment or each instance assignment.
5. This process is iterated.
Example:
The dataset imported from Chicago police department included 6.54M rows of committed crimes from 2011 until present and 22 columns such as (id, date, type, location…).We are going to take this data and imported into NetBeans software and implement the k-means algorithm in java and here we aimed to implement the algorithm in the best way with an optimized code to ensure the efficacy and speed as well as the accuracy in the work. The next part is to choose the number of clusters and it is somehow subjective in general. As we saw, that we are dealing with a huge dataset, we decided to choose 10 clusters for better prediction. Then we can visualize the clusters that are created by k-means on the google map and determine the crime prone areas and each cluster is going to be denoted with a number indicates the crimes included in it. Then it will visualize the places where a specific crime occurs(such as there more theft incidents occurs at tourist destinations and that is the main work which is going to help the police to focus on some areas and take more precautions to prevent any future crimes and hence reduce the crime rate. We can also visualize all the crime locations of the city on the google map for good understanding of the situation. Additional feature is implemented to show the percentage of a specific crime that occur in specific part of the city and an example of that is “rape” incidents are twice in villages than cities. In addition to that, implementing a performance features are necessary to help to monitor the rate of the crime to see if it is increasing or decreasing. Finally, reports must be generated and can be download that contains the crime rate graphs and map images.
K-means clustering is an extensively used technique for data cluster analysis. Furthermore, it delivers training results quickly. However, its performance is usually not as competitive as those of the other sophisticated clustering techniques because slight variations in the data could lead to high variance.
Conclusion:
This system is primarily used to cluster the dataset and apply the k-means algorithm on the crime dataset for the purpose of crime analysis, and to visualize the clustering results and crime incidents on the map and to prepare crime rate reports. This process is designed to help predict future crime based on the historical dataset. More functionality can be applied to this project in the future as a potential research that will be more useful and provide more knowledge about the crimes and offenders. Various methods may be used in the field of crime analysis by use a particular algorithm to better predict outcomes and to compare the performance and efficiency of the various Machine learning algorithms.