Here I have used both fatal police shooting location values and police station location values from two different datasets. Using those values and plotting it on a map is very helpful to understand the distance between the shooting spot and the nearest police station. The police stations are indicated by red color and shooting spot is indicated by blue color. Using zoom option we can also observe more closely. And when you click on any red point it will show the police station count number and when you click on blue it shows the county or area name in tool tip. By this visualization its clear that most of the police shootings took place near by police stations.
A decision tree in statistical analysis is a graphical representation of a decision-making process. It is a tree-like model where each internal node represents a decision or test on an attribute, each branch represents an outcome of that test, and each leaf node represents a class label or a decision. Decision trees are used for both classification and regression analysis.
In a decision tree for classification, the goal is to classify an instance into one of several predefined classes. The tree is built by recursively splitting the data based on the most significant attribute at each node, creating decision rules that lead to the final classification.
In regression decision trees, the goal is to predict a continuous variable instead of a class label. The process is similar, with each node representing a decision based on an attribute, and the leaves containing the predicted values.
Decision trees are popular in machine learning and statistical analysis because they are easy to understand and interpret. They provide a visual representation of decision-making processes and can be a powerful tool for both analysis and prediction.
A t-test is a statistical hypothesis test used to determine if there’s a significant difference between the means of two groups or datasets. In the context of our dataset on police shootings, we can apply a t-test to compare specific characteristics of two groups and assess whether the observed differences are statistically significant.
For example, if we want to investigate whether there’s a significant difference in the ages of individuals involved in police shootings based on gender (male and female). To do this, we must first separate our dataset into these two groups. Then, we can apply a t-test to evaluate if the age means of these groups differ significantly. The null hypothesis in this case would be that there’s no significant age difference between the two gender groups, and the alternative hypothesis would suggest that there is a significant difference.
By performing a t-test, we can quantify the extent of the difference and determine whether it’s likely due to random chance or if it represents a meaningful distinction between the two groups. This allows us to draw statistical conclusions about the impact of gender on the ages of individuals involved in police shootings, providing valuable insights based on the characteristics of our dataset.
ANOVA, or Analysis of Variance, is a statistical test used to compare the means of three or more groups or datasets to determine if there are statistically significant differences between them. It’s a powerful tool to assess whether variations between groups are due to actual differences or if they could have occurred by chance. In the context of our dataset of police shooting records, here’s how you can use ANOVA:
Suppose if we want to analyze whether there are statistically significant differences in the ages of individuals involved in police shootings across different threat levels (e.g., “attack,” “other,” etc.). To do this, we must first divide our dataset into groups based on threat levels. Then, we could apply ANOVA to assess whether there are significant variations in the ages of individuals across these groups.
In this way, ANOVA helps you identify if there are meaningful distinctions in the ages of individuals involved in police shootings, which could provide valuable insights into whether certain threat levels are associated with specific age groups.
Hierarchical clustering is a method in data analysis and statistics used to group similar data points or objects into clusters or groups. It’s called “hierarchical” because it organizes data in a tree-like or hierarchy structure, where clusters can contain sub-clusters.
Here’s how it works:
- Data Points: We start with our individual data points or objects. In our project, these could be individual records of police shootings.
- Pairwise Distances: Here we calculate the distances between all data points. In our case, it might be the geographical distances between different locations of police shootings.
- Agglomeration: Initially, each data point is considered as its own cluster. Then, the two closest clusters or data points are merged into a new cluster. This process continues iteratively, gradually forming a hierarchy of clusters.
- Dendrogram: As we merge clusters, we create a visual representation called a dendrogram. A dendrogram is like a tree diagram that shows how clusters are formed and nested within each other.
Hierarchical clustering helps us group data in a way that shows relationships and hierarchies. It’s like organizing objects based on their similarities, where smaller groups are nested within larger ones. This technique can be useful for exploring patterns and structures within our dataset, such as identifying different categories or groupings of police shootings.
The Monte Carlo approach is a method used to solve complex problems through random sampling. In simple words, it’s like making guesses repeatedly to find answers. We can use this approach with our dataset to simulate and analyze different scenarios. For example, if we’re studying the impact of a policy change on our dataset, we can create multiple random variations of our data, apply the policy change to each variation, and see how it affects the outcomes. By running many simulations, we can get a sense of the range of possible results and their probabilities, helping us to make informed decisions or predictions based on our dataset. It’s like rolling a dice many times to understand the likelihood of different outcomes without having to physically roll the dice every time.
There are different methods of clustering: k-means, k-medoids, and DBSCAN.
k-means is a centroid-based clustering algorithm. It works by first randomly selecting k centroids, which are representative points of the data set. Then, each data point is assigned to the closest centroid. The centroids are then updated to be the average of the data points assigned to them. This process is repeated until the centroids no longer change.
k-medoids is also a centroid-based clustering algorithm, but instead of using the average of the data points assigned to a centroid, it uses the medoid, which is the most centrally located data point in the cluster. This makes k-medoids more robust to outliers than k-means.
DBSCAN is a density-based clustering algorithm. It works by identifying regions of high density in the data set. A data point is considered to be part of a cluster if it is within a certain distance (epsilon) of at least a minimum number of other data points (minPts). DBSCAN is able to identify clusters of arbitrary shape and size, and it is also robust to outliers.
Cohen’s d is a statistic used in statistical analysis to measure the difference between two groups. It helps us understand if the difference between these groups is significant or just due to random chance.
To calculate Cohen’s d, you need the means and standard deviations of two groups. The formula involves subtracting the mean of one group from the mean of the other and then dividing by the pooled standard deviation. The result gives a measure of how far apart the means of the two groups are in terms of standard deviations. A larger Cohen’s d indicates a more substantial difference between the groups, while a smaller value suggests a smaller difference
For example, if the Cohen’s d is 0.5, it indicates a moderate difference between the groups. If it’s 1.0 or higher, that’s considered a large difference. On the other hand, if it’s close to 0, it suggests that there’s not much of a difference between the groups, and any variations are likely due to random factors.
In this dataset, there are a total of 7,085 locations where police shootings occurred. These locations are represented by latitude and longitude coordinates. To work with these coordinates, we are using Mathematica, a mathematical software tool. We must then convert the latitude and longitude pairs into geographic position objects using the GeoPosition function. This allows us to work with the data in a geographic context. We can then create a geographic plot of all these locations, which helps us visualize where these incidents took place on a map. Additionally, we can use the GeoHistogram function to create a geographic histogram to see the distribution of these locations. This information is valuable for understanding the geographic patterns of police shootings in the continental United States.
Clustering is a technique in data analysis and machine learning used to group similar data points or objects together based on certain features or characteristics. The goal of clustering is to identify patterns and structures within the data, such that data points within the same cluster are more similar to each other than to those in other clusters.
The process of clustering police shooting locations in the continental US involves using geodesic distance, which calculates distances between latitude and longitude coordinates. In our case, Mathematica provides a convenient automatic clustering procedure called “Find Clusters.” What sets it apart is that it doesn’t require us to predefine the number of clusters.
To calculate the geodesic distance, which represents the distance along the Earth’s surface, between two locations using latitude and longitude data, we can use the Geo Distance function. This function takes the coordinates of two points and calculates the distance between them, taking into account the curvature of the Earth’s surface.
The project involves analyzing a dataset of fatal police shootings in the United States to gain insights and information about these incidents. The dataset contains information about individuals who were fatally shot by the police, including details such as their names, dates of the incidents, manner of death, whether they were armed, their age, gender, race, city, state, and more.
The dataset also documents the manner of death, specifying whether the individual was shot, shot and Tasered, or otherwise. It further classifies whether the individual was armed and with what, covering a range from guns and knives to being unarmed. The dataset provides insight into the presence of mental illness signs among the individuals and the perceived threat level they posed. It also notes whether the individuals were fleeing the scene and if body cameras were used during these encounters. Additionally, geographical coordinates (longitude and latitude) offer precise incident locations.
For the second project we are using K-NN method. K-nearest neighbors (k-NN) is a non-parametric supervised machine learning algorithm. It is used for classification and regression tasks. In classification, k-NN is used to predict the class of a new data point based on the classes of its k nearest neighbors. In regression, k-NN is used to predict the value of a new data point based on the values of its k nearest neighbors.
Here’s how it works:
- Imagine you have a bunch of data points in space, and each point represents something with certain characteristics.
- When you want to know something about a new data point, KNN looks at the “k” nearest data points to that new point.
- It then checks what those nearest data points are like (their characteristics), and based on the majority opinion of those neighbors, it predicts or classifies the new data point.
In this project, we began by preprocessing and transforming the data obtained from the Centers for Disease Control and Prevention (CDC) on U.S. county rates of diabetes, obesity, and inactivity which is lack of physical activity. Following this, we seamlessly merged the datasets based on FIPS code and year, creating a consolidated dataset for comprehensive analysis. My primary focus was on exploring the relationship between the percentage of diabetic individuals and the percentages of obesity and physical inactivity. Employing linear regression, we delineated a model to predict the percentage of diabetic individuals based on the percentages of obesity and inactivity. The training of the model involved splitting the data into training and testing sets, with subsequent predictions made and evaluated on the test set. Visualizing the correlation between these variables was a pivotal step, and we used a three-dimensional scatter plot to provide a comprehensive overview of their relationships.
Regularization is a technique that prevents overfitting in statistical analysis. Overfitting occurs when a statistical model learns the training data too well and is unable to generalize to new data. Regularization works by penalizing the model for being too complex. This forces the model to learn only the most important features of the data, which makes it more likely to generalize well to new data. There are many different regularization techniques, some are
- L1 regularization: L1 regularization penalizes the model for the sum of the absolute values of its coefficients. This forces the model to learn only the most important features of the data, because it will be penalized for including too many features.
- L2 regularization: L2 regularization penalizes the model for the sum of the squared values of its coefficients. This forces the model to learn smaller coefficients, which makes it less likely to overfit the data.
An example of cross validation is imagine learning to play a game, and to figure out how good you’re becoming, you decide to do a “5-round practice test.” In each round, you play the game a bit differently, learning from your mistakes and adjusting your strategy. After the 5 rounds, you look at your scores in each round and calculate the average score. This average score gives you a sense of how well you’re doing overall. In machine learning, we do something similar with cross-validation. Instead of training a model just once, we train it several times on different parts of our data, checking how well it does on average. It’s like making sure the model doesn’t just remember the training data but can also handle new situations effectively.
An example of bootstrap method is if you want to know the average height of students in a school, but measuring everyone is too time-consuming. Here’s where the bootstrap idea comes in. Instead of measuring every student, we randomly pick some students, record their heights, and put them back. Repeat this process several times, noting heights each time. By looking at all these recorded heights, we can make a good guess about the average height of all students. Bootstrap is like taking sneak peeks at smaller groups, putting them back into the “height pool,” and using those glimpses to figure out the average height without measuring every single student. It’s a clever way to estimate things without doing the whole task, making it much more practical, especially with large groups of data.
In today’s class we learnt about how we use tools like Least Squares Error (LSE) and Mean Squared Error (MSE) to measure how close our predictions are to reality. LSE helps find the best-fit line, and MSE gives us an average of how much our predictions differ from the actual values. Training Error tells us how well our model learned from the data it was trained on, while Test Error checks if our model can handle new, unseen data. It’s like checking if we studied well for a test (Training Error) and if we can apply that knowledge to questions we’ve never seen before (Test Error). Keeping an eye on these measures helps us make our models better over time.
Imagine you’re teaching a computer to make predictions, like guessing a person’s height based on some information. Least Squares Error and Mean Squared Error are like tools that help the computer learn the best way to make these guesses. Training Error is like checking how well the computer memorized the examples it learned from, while Test Error is like testing if it can correctly guess the height of someone it’s never seen before. It’s a bit like making sure the computer not only learns from its training data but can also adapt to new challenges.
Cross-validation and bootstrap are two resampling methods used to evaluate statistical models and estimate generalization error and parameter variance, respectively. Cross-validation splits the data into folds and trains the model on each fold, using the remaining folds for validation. Bootstrap creates subsamples of the data with replacement and trains the model on each subsample. K-fold cross-validation is a resampling technique that splits the data into K folds and trains the model on K-1 folds, using the remaining fold for validation. This process is repeated K times, and the average performance of the model on the validation folds is used to evaluate its overall performance.
Suppose we have a dataset of images of cats and dogs, and we want to train a machine learning model to classify the images correctly. We can use K-fold cross-validation to evaluate the performance of our model. First, we split the dataset into K folds, where K is a positive integer. Then, we train the model on K-1 folds and evaluate its performance on the remaining fold. This process is repeated K times, and the average performance of the model on the validation folds is used to evaluate its overall performance. For example, if we choose K=5, we would split the dataset into 5 folds. Then, we would train the model on 4 folds and evaluate its performance on the remaining fold. This process would be repeated 5 times, and the average performance of the model on the validation folds would be used to evaluate its overall performance.
Today I have learnt about t-test. A t-test is like a detective tool in statistics. Imagine you have two groups of data, and you want to know if they are really different or if it’s just a coincidence. The t-test helps you find out. It starts with a guess called the “null hypothesis,” which says there’s no real difference between the groups. Then, you collect data and calculate a special number called “t-statistic.” This number shows how much the groups really differ. If the t-statistic is big and the groups are very different, you get a small “p-value,” which is like a clue. A small p-value suggests that the groups are probably different for real, not just by chance. If the p-value is smaller than a number you choose (usually 0.05), you can say the groups are different, and you reject the null hypothesis. But if the p-value is big, it means the groups might not be that different, and you don’t have enough evidence to reject the null hypothesis. So, the t-test helps you decide if what you see in your data is a real difference or just a trick of chance.
In todays class we learnt that linear model fit to data is one that successfully captures the relationship between two variables, even when those variables exhibit characteristics that does not follow traditional assumptions. Here both of the variables involved in the model are non-normally distributed, meaning their data points do not follow the familiar bell-shaped curve. Furthermore, these variables may be skewed, indicating an asymmetry in their distribution, and they could exhibit high variance, implying that data points are spread widely across the range. Additionally, high kurtosis suggests that the data has heavy tails or outliers.
The crab molt model relates to the process of molting in crabs, particularly focusing on the sizes of a crab’s shell before and after molting. “Pre-molt” indicates the size of a crab’s shell before it undergoes molting. “Post-molt” indicates the size of the crab’s shell after the molting. Also a t-test is performed to determine whether there is a difference between the means of two groups.
Linear regression method helps us understand the relationship between variables. The independent variables are also called as predictor variables. So when we use two or more predictor variables we call it multiple linear regression. It extends simple linear regression by considering multiple predictors, allowing us to model complex relationships and make predictions based on the combined influence of these variables.
Correlation tells us if the two predictor things are related to each other. If they go up together, it’s a positive correlation. If one goes up while the other goes down, it’s negative. Strong correlations mean a tight connection, while weak ones mean they’re not closely linked.
Sometimes, the relationship between variables isn’t a straight line; it’s more like a curve. The quadratic model helps us deal with these curvy relationships, like when things go up and then down or the other way around. It’s like using a flexible ruler instead of a rigid one.
In summary, linear regression with two predictors helps us make predictions based on multiple factors. Understanding the correlation between predictors helps us see how they’re connected, and the quadratic model allows us to handle more complex relationships in the data. These tools are valuable for making sense of real-world data.
Today I tried to analyze the data from the CDC (Centers for Disease Control) datasets. The datasets contains information about diabetes, obesity, and inactivity rates in different counties of various states in the year 2018.
We can perform linear regression method to understand the data for this project. By using this method we can understand how the variables diabetes, obesity, and inactivity are affected by the other variables such as FIPS (Federal Information Processing Standards), County and State. We can even know the correlation and differences between them. Here the diabetes, obesity, and inactivity are dependent variables and FIPS, county and state are independent variables. Here we also perform some statistical analysis like mean, median, maximum, minimum rates and also find standard deviation of the data which helps in better understanding of data. We can also visualize the data as a histogram to understand its distribution and also helps us to observe the outliers. The outliers of the data can also be detected using the best fit line of linear regression method. Outliers are the data points that deviate far away from the predicted data point.
In today class, I have learnt a new topic called P-value which is a probability of a null hypothesis to be true. Null hypothesis (H0)states that there is no relationship between variables. Alternative hypothesis(H1) states that there is a relationship between variables. The significant p -value is 0.05. For example, when we toss a coin the probability for a tail to occur is 0.5 and for a head to occur is 0.5. Let’s assume the null hypothesis H0 – The coin is fair and H1 – The coin is not fair. If we toss a coin for 20 times and everytime we get a tail then the value of p goes on decreasing which means there is something tricky or wrong in that case we should reject the null hypothesis. If the value of p is equal to significant value 0.05 then we can confirm the null hypothesis and state that coin is fair.
In previous class, I was confused about heteroscedasticity but today what i understood is heteroscedasticity is a condition where the data points keep fanning out from the best fit line (or) error or residual increases as the value of variable increases. In simple terms, heteroscedasticity occurs when there is huge difference between actual value and predicted value. This a visual way of finding heteroscedasticity present in the model. In statistical way we use Breusch-Pagan test to detects the presence of heteroscedasticity in the linear regression model. This test uses null and alternative hypothesis.
Steps to perform Breusch – Pagan test
1. Fit the linear regression model and detect the residuals.
2.Calculate the squared residuals (R*2) of the regression model.
3. Now fit a new linear regression model using the response values of R*2 (R-Squared).
4.Calculate the Chi Square test as nr*2
where n is total number of observations and r*2 is R-sqaured of new linear regression model.
If the value of p is less than 0.05 then we reject the null hypothesis and state that heteroscedasticity is present, if not it fail to reject the null hypothesis and states that homoscedasticity is present.
In Linear regression we build a model and use the best fit line to predict the value of a variable using the value of another variable and also find the relationship between variables by using the formula y=β0+β1x+ε where x is the independent variable, y is the dependent variable, ε is the error term. The variable we predict is called the dependent variable and the variable we use to predict is called the independent variable. As the data doesn’t stand in a straight line and is not normally distributed we have errors which is the vertical distance between the data point and the regression line so we include the error in the formula. We use the least square method to minimize the errors.
We explored the CDC diabetes data which has 3 variables and they are diabetes, obesity, and inactivity. To build a model that truly fits the data and is reliable we need to find the relationship between variables using linear regression and use R square method to evaluate the model.
Also in today class got to know about what is a residual and learned a new topic heteroscedasticity which I will try to learn more about it and get to know better by the next class.
Welcome to UMassD WordPress. This is your first post. Edit or delete it, then start blogging!