30th October

Hierarchical clustering is a method in data analysis and statistics used to group similar data points or objects into clusters or groups. It’s called “hierarchical” because it organizes data in a tree-like or hierarchy structure, where clusters can contain sub-clusters.

Here’s how it works:

  1. Data Points: We start with our individual data points or objects. In our project, these could be individual records of police shootings.
  2. Pairwise Distances: Here we calculate the distances between all data points. In our case, it might be the geographical distances between different locations of police shootings.
  3. Agglomeration: Initially, each data point is considered as its own cluster. Then, the two closest clusters or data points are merged into a new cluster. This process continues iteratively, gradually forming a hierarchy of clusters.
  4. Dendrogram: As we merge clusters, we create a visual representation called a dendrogram. A dendrogram is like a tree diagram that shows how clusters are formed and nested within each other.
    Hierarchical clustering helps us group data in a way that shows relationships and hierarchies. It’s like organizing objects based on their similarities, where smaller groups are nested within larger ones. This technique can be useful for exploring patterns and structures within our dataset, such as identifying different categories or groupings of police shootings.

27th October

The Monte Carlo approach is a method used to solve complex problems through random sampling. In simple words, it’s like making guesses repeatedly to find answers. We can use this approach with our dataset to simulate and analyze different scenarios. For example, if we’re studying the impact of a policy change on our dataset, we can create multiple random variations of our data, apply the policy change to each variation, and see how it affects the outcomes. By running many simulations, we can get a sense of the range of possible results and their probabilities, helping us to make informed decisions or predictions based on our dataset. It’s like rolling a dice many times to understand the likelihood of different outcomes without having to physically roll the dice every time.

23rd October

There are different methods of clustering: k-means, k-medoids, and DBSCAN.

k-means is a centroid-based clustering algorithm. It works by first randomly selecting k centroids, which are representative points of the data set. Then, each data point is assigned to the closest centroid. The centroids are then updated to be the average of the data points assigned to them. This process is repeated until the centroids no longer change.

k-medoids is also a centroid-based clustering algorithm, but instead of using the average of the data points assigned to a centroid, it uses the medoid, which is the most centrally located data point in the cluster. This makes k-medoids more robust to outliers than k-means.

DBSCAN is a density-based clustering algorithm. It works by identifying regions of high density in the data set. A data point is considered to be part of a cluster if it is within a certain distance (epsilon) of at least a minimum number of other data points (minPts). DBSCAN is able to identify clusters of arbitrary shape and size, and it is also robust to outliers.

20th October

Cohen’s d is a statistic used in statistical analysis to measure the difference between two groups. It helps us understand if the difference between these groups is significant or just due to random chance.

To calculate Cohen’s d, you need the means and standard deviations of two groups. The formula involves subtracting the mean of one group from the mean of the other and then dividing by the pooled standard deviation. The result gives a measure of how far apart the means of the two groups are in terms of standard deviations. A larger Cohen’s d indicates a more substantial difference between the groups, while a smaller value suggests a smaller difference

For example, if the Cohen’s d is 0.5, it indicates a moderate difference between the groups. If it’s 1.0 or higher, that’s considered a large difference. On the other hand, if it’s close to 0, it suggests that there’s not much of a difference between the groups, and any variations are likely due to random factors.

18th October

In this dataset, there are a total of 7,085 locations where police shootings occurred. These locations are represented by latitude and longitude coordinates. To work with these coordinates, we are using Mathematica, a mathematical software tool. We must then convert the latitude and longitude pairs into geographic position objects using the GeoPosition function. This allows us to work with the data in a geographic context. We can then create a geographic plot of all these locations, which helps us visualize where these incidents took place on a map. Additionally, we can use the GeoHistogram function to create a geographic histogram to see the distribution of these locations. This information is valuable for understanding the geographic patterns of police shootings in the continental United States.

16th October

Clustering is a technique in data analysis and machine learning used to group similar data points or objects together based on certain features or characteristics. The goal of clustering is to identify patterns and structures within the data, such that data points within the same cluster are more similar to each other than to those in other clusters.

The process of clustering police shooting locations in the continental US involves using geodesic distance, which calculates distances between latitude and longitude coordinates. In our case, Mathematica provides a convenient automatic clustering procedure called “Find Clusters.” What sets it apart is that it doesn’t require us to predefine the number of clusters.

To calculate the geodesic distance, which represents the distance along the Earth’s surface, between two locations using latitude and longitude data, we can use the Geo Distance function. This function takes the coordinates of two points and calculates the distance between them, taking into account the curvature of the Earth’s surface.

13th October

The project involves analyzing a dataset of fatal police shootings in the United States to gain insights and information about these incidents. The dataset contains information about individuals who were fatally shot by the police, including details such as their names, dates of the incidents, manner of death, whether they were armed, their age, gender, race, city, state, and more.

The dataset also documents the manner of death, specifying whether the individual was shot, shot and Tasered, or otherwise. It further classifies whether the individual was armed and with what, covering a range from guns and knives to being unarmed. The dataset provides insight into the presence of mental illness signs among the individuals and the perceived threat level they posed. It also notes whether the individuals were fleeing the scene and if body cameras were used during these encounters. Additionally, geographical coordinates (longitude and latitude) offer precise incident locations.

11th October

For the second project we are using K-NN method. K-nearest neighbors (k-NN) is a non-parametric supervised machine learning algorithm. It is used for classification and regression tasks. In classification, k-NN is used to predict the class of a new data point based on the classes of its k nearest neighbors. In regression, k-NN is used to predict the value of a new data point based on the values of its k nearest neighbors.

Here’s how it works:

  1. Imagine you have a bunch of data points in space, and each point represents something with certain characteristics.
  2. When you want to know something about a new data point, KNN looks at the “k” nearest data points to that new point.
  3. It then checks what those nearest data points are like (their characteristics), and based on the majority opinion of those neighbors, it predicts or classifies the new data point.

4th October 2023

In this project, we began by preprocessing and transforming the data obtained from the Centers for Disease Control and Prevention (CDC) on U.S. county rates of diabetes, obesity, and inactivity which is lack of physical activity. Following this,  we seamlessly merged the datasets based on FIPS code and year, creating a consolidated dataset for comprehensive analysis. My primary focus was on exploring the relationship between the percentage of diabetic individuals and the percentages of obesity and physical inactivity. Employing linear regression, we delineated a model to predict the percentage of diabetic individuals based on the percentages of obesity and inactivity. The training of the model involved splitting the data into training and testing sets, with subsequent predictions made and evaluated on the test set. Visualizing the correlation between these variables was a pivotal step, and we used a three-dimensional scatter plot to provide a comprehensive overview of their relationships.

2nd October 2023- Mon class

Regularization is a technique that prevents overfitting in statistical analysis. Overfitting occurs when a statistical model learns the training data too well and is unable to generalize to new data. Regularization works by penalizing the model for being too complex. This forces the model to learn only the most important features of the data, which makes it more likely to generalize well to new data. There are many different regularization techniques, some are

  • L1 regularization:  L1 regularization penalizes the model for the sum of the absolute values of its coefficients. This forces the model to learn only the most important features of the data, because it will be penalized for including too many features.
  • L2 regularization:  L2 regularization penalizes the model for the sum of the squared values of its coefficients. This forces the model to learn smaller coefficients, which makes it less likely to overfit the data.