29th September 2023- Friday class

An example of cross validation is imagine learning to play a game, and to figure out how good you’re becoming, you decide to do a “5-round practice test.” In each round, you play the game a bit differently, learning from your mistakes and adjusting your strategy. After the 5 rounds, you look at your scores in each round and calculate the average score. This average score gives you a sense of how well you’re doing overall. In machine learning, we do something similar with cross-validation. Instead of training a model just once, we train it several times on different parts of our data, checking how well it does on average. It’s like making sure the model doesn’t just remember the training data but can also handle new situations effectively.

An example of bootstrap method is if you want to know the average height of students in a school, but measuring everyone is too time-consuming. Here’s where the bootstrap idea comes in. Instead of measuring every student, we randomly pick some students, record their heights, and put them back. Repeat this process several times, noting heights each time. By looking at all these recorded heights, we can make a good guess about the average height of all students. Bootstrap is like taking sneak peeks at smaller groups, putting them back into the “height pool,” and using those glimpses to figure out the average height without measuring every single student. It’s a clever way to estimate things without doing the whole task, making it much more practical, especially with large groups of data.

27th September 2023- Wed class

In today’s class we learnt about how we use tools like Least Squares Error (LSE) and Mean Squared Error (MSE) to measure how close our predictions are to reality. LSE helps find the best-fit line, and MSE gives us an average of how much our predictions differ from the actual values. Training Error tells us how well our model learned from the data it was trained on, while Test Error checks if our model can handle new, unseen data. It’s like checking if we studied well for a test (Training Error) and if we can apply that knowledge to questions we’ve never seen before (Test Error). Keeping an eye on these measures helps us make our models better over time.

Imagine you’re teaching a computer to make predictions, like guessing a person’s height based on some information. Least Squares Error and Mean Squared Error are like tools that help the computer learn the best way to make these guesses. Training Error is like checking how well the computer memorized the examples it learned from, while Test Error is like testing if it can correctly guess the height of someone it’s never seen before. It’s a bit like making sure the computer not only learns from its training data but can also adapt to new challenges.

25th September 2023 – Monday class

Cross-validation and bootstrap are two resampling methods used to evaluate statistical models and estimate generalization error and parameter variance, respectively. Cross-validation splits the data into folds and trains the model on each fold, using the remaining folds for validation. Bootstrap creates subsamples of the data with replacement and trains the model on each subsample.  K-fold cross-validation is a resampling technique that splits the data into K folds and trains the model on K-1 folds, using the remaining fold for validation. This process is repeated K times, and the average performance of the model on the validation folds is used to evaluate its overall performance.

Suppose we have a dataset of images of cats and dogs, and we want to train a machine learning model to classify the images correctly. We can use K-fold cross-validation to evaluate the performance of our model. First, we split the dataset into K folds, where K is a positive integer. Then, we train the model on K-1 folds and evaluate its performance on the remaining fold. This process is repeated K times, and the average performance of the model on the validation folds is used to evaluate its overall performance. For example, if we choose K=5, we would split the dataset into 5 folds. Then, we would train the model on 4 folds and evaluate its performance on the remaining fold. This process would be repeated 5 times, and the average performance of the model on the validation folds would be used to evaluate its overall performance.

22nd September 2023- Friday class

Today I have learnt about t-test. A t-test is like a detective tool in statistics. Imagine you have two groups of data, and you want to know if they are really different or if it’s just a coincidence. The t-test helps you find out. It starts with a guess called the “null hypothesis,” which says there’s no real difference between the groups. Then, you collect data and calculate a special number called “t-statistic.” This number shows how much the groups really differ. If the t-statistic is big and the groups are very different, you get a small “p-value,” which is like a clue. A small p-value suggests that the groups are probably different for real, not just by chance. If the p-value is smaller than a number you choose (usually 0.05), you can say the groups are different, and you reject the null hypothesis. But if the p-value is big, it means the groups might not be that different, and you don’t have enough evidence to reject the null hypothesis. So, the t-test helps you decide if what you see in your data is a real difference or just a trick of chance.

20th September 2023- Wednesday class

In todays class we learnt that linear model fit to data is one that successfully captures the relationship between two variables, even when those variables exhibit characteristics that does not follow traditional assumptions. Here both of the variables involved in the model are non-normally distributed, meaning their data points do not follow the familiar bell-shaped curve. Furthermore, these variables may be skewed, indicating an asymmetry in their distribution, and they could exhibit high variance, implying that data points are spread widely across the range. Additionally, high kurtosis suggests that the data has heavy tails or outliers.

The crab molt model relates to the process of molting in crabs, particularly focusing on the sizes of a crab’s shell before and after molting. “Pre-molt” indicates the size of a crab’s shell before it undergoes molting. “Post-molt” indicates the size of the crab’s shell after the molting. Also a t-test is performed to determine whether there is a difference between the means of two groups.

18th September 2023- Monday class

Linear regression method helps us understand the relationship between variables. The independent variables are also called as predictor variables. So when we use two or more predictor variables we call it multiple linear regression. It extends simple linear regression by considering multiple predictors, allowing us to model complex relationships and make predictions based on the combined influence of these variables.

Correlation tells us if the two predictor things are related to each other. If they go up together, it’s a positive correlation. If one goes up while the other goes down, it’s negative. Strong correlations mean a tight connection, while weak ones mean they’re not closely linked.

Sometimes, the relationship between variables isn’t a straight line; it’s more like a curve. The quadratic model helps us deal with these curvy relationships, like when things go up and then down or the other way around. It’s like using a flexible ruler instead of a rigid one.

In summary, linear regression with two predictors helps us make predictions based on multiple factors. Understanding the correlation between predictors helps us see how they’re connected, and the quadratic model allows us to handle more complex relationships in the data. These tools are valuable for making sense of real-world data.

15th September 2023

Today I tried to analyze the data from the CDC (Centers for Disease Control) datasets. The datasets contains information about diabetes, obesity, and inactivity rates in different counties of various states in the year 2018.

We can perform linear regression method to understand the data for this project. By using this method we can understand how the variables diabetes, obesity, and inactivity are affected by the other variables such as FIPS (Federal Information Processing Standards), County and State. We can even know the correlation and differences between them. Here the diabetes, obesity, and inactivity are dependent variables and FIPS, county and state are independent variables.  Here we also perform some statistical analysis like mean, median, maximum, minimum rates and also find standard deviation of the data which helps in better understanding of data. We can also visualize the data as a histogram to understand its distribution and also helps us  to observe the outliers. The outliers of the data can also be detected using the best fit line of linear regression method. Outliers are the data points that deviate far away from the predicted data point.

13th September 2023- Wednesday class

In today class, I have learnt a new topic called P-value which is a probability of a null hypothesis to be true. Null hypothesis (H0)states that there is no relationship between variables. Alternative hypothesis(H1) states that there is a relationship between variables. The significant p -value is 0.05. For example, when we toss a coin the probability for a tail to occur is 0.5 and for a head to occur is 0.5. Let’s assume the null hypothesis H0 – The coin is fair and H1 – The coin is not fair. If we toss a coin for 20 times and everytime we get a tail then the value of p goes on decreasing which means there is something tricky or wrong in that case we should reject the null hypothesis.  If the value of p is equal to significant value 0.05 then we can confirm the null hypothesis and state that coin is fair.

In previous class, I was confused about heteroscedasticity but today what i understood is heteroscedasticity is a condition where the data points keep fanning out from the best fit line (or) error or residual increases as the value of variable increases. In simple terms, heteroscedasticity occurs when there is huge difference between actual value and predicted value.  This a visual way of finding heteroscedasticity present in the model. In statistical way we use Breusch-Pagan test to detects the presence of heteroscedasticity in the linear regression model. This test uses null and alternative hypothesis.

Steps to perform Breusch – Pagan test

1. Fit the linear regression model and detect the residuals.

2.Calculate the squared residuals (R*2) of the regression model.

3. Now fit a new linear regression model using the response values of R*2 (R-Squared).

4.Calculate the Chi Square test as nr*2

where n is total number of observations and r*2 is R-sqaured of new linear regression model.

If the value of p is less than 0.05 then we reject the null hypothesis and state that heteroscedasticity is present, if not it fail to reject the null hypothesis and states that homoscedasticity is present.



11th September 2023 – Monday class

In Linear regression we build a model and use the best fit line to predict the value of a variable using the value of another variable and also find the relationship between variables by using the formula y=β0+β1x+ε where x is the independent variable, y is the dependent variable, ε is the error term. The variable we predict is called the dependent variable and the variable we use to predict is called the independent variable. As the data doesn’t stand in a straight line and is not normally distributed we have errors which is the vertical distance between the data point and the regression line so we include the error in the formula. We use the least square method to minimize the errors.

We explored the CDC diabetes data which has 3 variables and they are diabetes, obesity, and inactivity. To build a model that truly fits the data and  is reliable we need to find the relationship between variables using linear regression and use R square method to evaluate the model.

Also in today class got to know about what is a residual and learned a new topic heteroscedasticity which I will try to learn more about it and get to know better by the next class.