Wednesday, December 2, 2015

Lab 5: Regression Analysis


Nathaniel Krueger

Lab 5

Regression Analysis

Part 1
    
       A news station is attempting to link the number of free school lunches to the crime rate. They are saying that as the number of free school lunches increases so does crime. Town X had a study done that collected the percent of kids that get free lunch and the crime rate per 100,000 people in the given area. After running a linear regression in SPSS, it has been determined that the news is not correct. There is an R-value of .416 which would leave one to think that there is not a strong correlation. A new area of town had 23.5% of kids getting free lunch and using the linear regression equation above there would be a corresponding crime rate of 22.21. After the calculations and seeing a crime rate of 22.21 with 23.5% of kids receiving free lunch, one would be not very confident in these results. The equation used to calculate the crime rate was y=a+bx (y=21.819+1.685(.235). It is a weak positive relationship between crime rates and percent of kids getting school lunches. There is a spurious relationship between the two, meaning that the news station thought that there would be a connection but it turns out there is not. The data that is discussed above is found in the tables below.







Part 2





 Intro
             The UW system has asked to have enrollment analyzed in comparison to other schools, the committee wants to know why students chose the schools that they did. The data that could be used for this is really almost endless, so a few variables were selected and investigated in this portion of the lab. There is no way to truly decipher why a student chose the school that they did, but certain variables may help us to uncover clues as to why they made the choice they did. The schools that were investigated in this lab are UW Eau Claire and UW Green Bay. A series of regression analysis were done in SPSS to determine if there is a link between the variable and why they go to school where they do.





Methods
              The following operations that are done below required a fair amount of set up to manipulate the data in a way that was used friendly. The enrollment data from all the UW schools in all 72 counties was found under the Q drive. Along with enrollment data there is data from a view different categories in order to broaden our findings. An education variable was given which was the percent of people with a bachelor’s degree for each county. An income variable was given and it was the median household income per county. The final important data that was examined was the distance each school is from the center of the county and the number of students attending the different UW schools. The variables that were determined to help find out why students selected the schools they did were median household income and percent of people with bachelor’s degrees. It is reasonable to infer that counties that have more people with bachelor’s degrees would have more people that decide to also go to college. The second data set that was used is the median household income, it is common belief that the higher education one has the more money they will make, and this will help decipher that.

            From the share drive on the Q drive in the lab 5 folder, the Microsoft Excel data called UW system was opened. In a new excel file only the data that was needed was transferred over in order to make it easier to work with. UW Eau Claire and UW Green Bay enrollments, Median household income, population and percent of people with bachelor’s degrees. The first step to making this lab work was to normalize the variable of county population with the distance from the Universities. To do this the county population was divided by the distance from the university, this was done for UW Eau Claire and UW Green Bay. This led into SPSS as one is now able to begin running regression analyses. Once in SPSS under the analyze tab there is a regression slide tab, and then under regression, the linear button was selected. The students attending the university being investigated is always used as the dependent variable. Three separate regression equations were ran for both schools. The independent variables used in the regression analyses is the Population/ Distance, Percent with a bachelor’s degree, and Median household income.  

            For the purposes of what is being investigated only variables that had statistical significance were mapped in ArcMap. They were deemed statistically significant because the null hypothesis was rejected, meaning that there was a difference between what was being compared. After the regression analyses were completed it was found that four of the six regressions done were deemed statistically significant. To map these in Arc Map, the regression was ran again but this time the residuals were saved for each county. This allowed one the ability to use the data in ArcMap. After joining the excel data to ArcGIS, the tables were joined by GeoID because the county names varied a little bit and caused some problems in the join just due to uppercase letters.







Results
            Below are the tables which display the regression analysis that were run for each variable in SPSS and four maps which display the variables that had statistical significance. Below the maps are the two regression analysis that were deemed not statistically significant. 
Map 1: Eau Claire Percent of People with Bachelors Degree
     

            Map 1 above displays the percent of people in each county that have bachelors degrees in relationship to Eau Claire county. The darker colors are over the standard deviation, so the darker the darker the shade of color the more above the average it is. What this does is help to illustrate where  a good number students attending Eau Claire are coming from. For example, Dane county which is the state capital is well over the standard deviation and this could be attributed to the fact that the Madison area tends to have a higher median income than surrounding areas. 


Table 1: Eau Claire Percent of People with Bachelors Degree
         

           Table 1 above relates to Map 1 above, this helps to give some basis to what the map is displaying. The constant B is -126.472, that is why on the map Eau Claire county is less than the average. A few possible explanations for this is possibly that a good percentage of the work force in Eau Claire and the surrounding areas commute from out of county. The significance level of .003 leads us to reject the null hypothesis, stating that there is in fact a difference in the percentage of bachelors degree per county when compared to Eau Claire county. 














Map 2: Eau Claire Population/ Distance
       

             Map 2 above displays the Eau Claire population divided by the distance the from the center of the county, this in turn gives us a relative idea of how close to the center of the county the university is. This map is extremely similar to Map 1, in fact at first glance the only variation is in the southeast corner of the state.



Table 2: Eau Claire Population/ Distance
         An R square value of .945 coupled with a significance level of .03, one can see why this variable is significant. There is a positive constant value of 8.518. That means that there is a positive relationship between the population/ distance when comparing Eau Claire University compared to the center location of other universities. 








Map 3: Green Bay Population/ Distance 


          Map 3 above illustrates the Green Bay Population divided by the Distance which then gives the relative center of the university in relationship to the rest of the counties in the state. In the map above the dark red color is well above the average and the dark orange is just slightly over the mean, then the light orange is under and mean and the yellow is even further under the average. So what can be taken from this map is that Green Bay is more towards the center of the county in comparison the other counties in Wisconsin. 



Table 3: Green Bay Population/ Distance



         Table 3 above goes with the map direct above it, an R square value of .961 shows that there is a strong relationship between Green Bays Population/Distance, which in turn means that the university is more in the center of the county by population compared to other counties. This is most likely because Green Bay is the biggest city in Brown county and the university is located within the city. 







Map 4: Green Bay Median Household Income



         Map 4 shows many similarities with Map 3 at first glance. The homes with a median household income appear to be sending there kids to Green Bay, as it is a dark red color which shows us that Brown county is over the average for median household income. 


Table 4: Green Bay Median Household Income


         Table 4 is the table that displays the Median Household income for Green Bay. The significance value of .044 is very close to the cutoff that is used when evaluating variables at a 95% confidence ratio. 


Table 5: Eau Claire Median Household Income


        Table 5 above shows the median household income for Eau Claire, at a 95% confidence interval a significance level of .104 was deemed not statistically significant. 





Table 6: Green Bay Percent of People with a Bachelors Degree


          Table 6 above is the Percent of people in brown county with a bachelors degree, linked with the university. A significance level of .085 at 95% confidence was deemed not statistically significant. 






Conclusion



           After examining multiple variables for two different UW schools, one would come to the conclusion that they were very similar in regards to the fact that each school yielded very similar maps in there own regards. The Eau Claire maps were all similar just as the Green Bay ones were, but they both were drastically different when comparing the two schools. The regression output suggested that there was in fact a link between where students picked to go to school and the variables. Though there could potentially be 1000's of other reasons why they chose to go to school where they did. So they was a good exercise to think critically about all that goes into selecting a college and it helped to work on getting the software down with SPSS and ArcGIS. An extended study could be done with many more variables to find a true link and discuss a more concrete reasoning as to why students chose the school they did. I think that ultimately distance and population are large influences into the appeal of a certain school, also household median income is important because it can more a less tell you which counties on average have enough money to send there kids to school. 























Wednesday, November 11, 2015

Lab 4 Correlation and Spatial Autocorrelation


        Lab 4 displays a wide array of skill involving Correlation and Spatial Autocorrelation. This lab helped to demonstrate skills in Excel, putting data into it and making a scatterplot with a trend line that is labeled in the way described. The program SPSS was also used to run correlations, from there the correlations were put into a scatterplot and discussed from there. Additional practice was gained in using the U.S census website to download data and shapefiles. From the Census Data GEOID’s were identified to help give an exact location to the data. Throughout this lab varies data sets were joined to make help the data be easier displayed on one map.

Part 1: Correlation

          The table above shows distance in relation to sound level. The null hypothesis states that there is no difference between distance and sound level. There is a fairly strong negative relationship between distance in feet and sound level in decibels. As the distance away increases the sound level goes down. There is a significance level that would make us nearly 100% sure that there is a negative correlation between the distance and sound. There is a Pearson correlation of -.896, this says that there is in fact a negative correlation between distance and sound.
Table 1: Correlation between Distance in Feet and Sound level in Decibels
Scatterplot 1: Comparison Between Distance and Sound Level

Part 2: Census Tracts and Population in Detroit, MI
            
           For part 2 there is a correlation matrix that displays what how various ethnicities relate to one another. There are several categories such as Median household income, if they have a bachelor’s degree, median home value, and number that work in manufacturing, retail and finance. The four ethnicities that are in the matrix are White, Black, Asian, Hispanic. The median home value is highest for Whites, then Asian, followed by Blacks and Hispanics. The trend listed about goes the same for number of people with bachelor’s degrees, and median household income.
            The Matrix does a good job of showing how the education and incomes are very different when looking at different ethnicities. What can be inferred is that generally speaking Whites tend to be the wealthiest, then Asian, Black and Last Hispanics.  
Table 2: Correlations between Ethnicity in Detroit Michigan 

Part 3: Spatial Autocorrelation

         Introduction: In this section data from the 1980 and 2012 elections is available from the Texas Election Commission (TEC). The data that is given is the percent of Democratic votes for both elections as well as the amount that voted for each election. The TEC wants to know if there is a pattern throughout the state of which demographic votes which way and if it has changed over the 32 year span. Throughout this lab many skills were used in determining if there is a difference between the voting from 1980 to 2012. In order to determine the differences GeoDa and SPSS were both used to create scatterplots showing Moran’s I and LISA maps displaying if there was a high-high, low-low, low-high or high-low relationship. The TEC wants to know if there is voting patterns and clusters throughout the state. 

        Methodology: The first step was to go to the US Census website to obtain data on the Hispanic Population in 2010 because it was not given. Under the advanced search option on the left all the counties within Texas were selected along with the Hispanic Populations for 2010. Not all of the data that was downloaded was needed so all but one column was deleted in order for the later join to be successful. The data was downloaded as a shapefile from the Census site and joined in ArcMap with a Geo ID instead of a FIPS. Once the joins were complete, the data was exported as a new shapefile that is compatible with GeoDa.
Next once in GeoDa under file, then open project is the new shapefile from Texas. The goal is to see if there is a spatial autocorrelation for both elections, voter turnout and Hispanic populations. In order to do this each of the topics had to be weighted. Under the tools tab, weight is selected and then create, next a variable ID is added which gives the variable its own ID, Poly ID was used as the new variable name, this step only had to be done once. Once this weight was created the Moran’s I and LISA cluster Maps can be created.
Two types of tests were done using the data, Moran's I and LISA maps( local indicators of spatial association). To get the Moran’s I cluster map a tool was used at the top of the screen which represented it, after it was selected the desired variable was clicked and the weighted that was calculated above was used. The same step was then repeated for the remaining variables. Next came the LISA cluster maps, another icon near the top with similar steps as the Moran’s I. The maps below display four different colors, dark red, light red, light blue and dark blue. Red being high and blue being low. 

Results:
Voter Turnout 1980
Voter Turnout 1980

Voter Turnout 2012
Voter Turnout 2012
Hispanic Population 2010
Hispanic Population 2010
Democratic Vote 1980

Above is the Moran's I scatterplot that is showing a relatively strong positive relationship for the 1980's Democratic Presidential Election Data.
Democratic Vote 1980

Above is the 1980's election displaying the democratic votes. In the northwestern part of Texas we see a cluster of blue counties which represent low-low. 

Democratic Vote 2012

The scatterplot shows a strong positive trend in the data for the 2012 Democratic Presidential Data.
Democratic Vote 2012

In the map above the low-low democratic votes shifted slightly from northwest to the center of Texas. Also there is more red counties in the south in comparison to the data from 1980, meaning that there were more democratic votes in the south.


Conclusion: There are many things that can be taken from the LISA maps and Moran’s I. There are patterns of voting in the state of Texas. Over the 32 year span there has been clustering throughout the state. First when examining the Hispanic population it is easy to see that the high clusters on the LISA maps tend to be in the southern most parts of Texas. This is obviously because it is the area closest to Mexico. In 1980 there was a high voter turnout in the northern part of the pan handle of Texas. Then a similar pattern continues in 2012, though the northern area is considerably smaller than in 1980. An interesting pattern occurs when looking at the Democratic votes throughout the state. In 1980 the south part of Texas voted the majority Democratic and the North tended to low-low. In 2012 the votes become even more defined. More votes shifted toward the Democrats in the South and more went low in the north. Also from 1980 to 2012 the low democratic voters shifted more from the western part of the state to more central.











Monday, October 26, 2015

Lab 3: Significance Testing





Part 1: T & Z Tests

Below are some terms and operations that are crucial to understand what was done later in the lab. Calculations of the data and terminology is crucial to determine the differences between Northern and Southern Wisconsin.

Interval Type
Confidence Level
n
Sig. Level
z or t
z or t value
A
Two Tailed
90
45
0.05
Z
pos or neg 1.65
B
Two Tailed
95
12
0.05
T
pos or neg 2.201
C
One Tailed
95
36
0.05
Z
1.65
D
Two Tailed
99
180
0.01
Z
pos or neg 2.58
E
One Tailed
80
60
0.2
Z
2.06
F
One Tailed
99
23
0.01
T
2.5
G
Two Tailed
99
15
0.01
T
pos or neg 2.997



A Department of Agriculture and Live Stock Development organization in Kenya estimate that yields in a certain district should approach the following amounts in metric tons (averages based on data from the whole country) per hectare: groundnuts. 0.5; cassava, 3.70; and beans, 0.30.  A survey of 100 farmers had the following results:

 μ          σ

            Ground Nuts   0.40     1.07

            Cassava            3.4       1.42

            Beans              0.33     0.14

a.       Test the hypothesis for each of these products.  Assume that each are 2 tailed with a Confidence Level of 95% *Use the appropriate test

b.      Be sure to present the null and alternative hypotheses for each as well as conclusions

c.       What are the probabilities values for each crop? 

d.      What are the similarities and differences in the results


A.    Z-Score= Sample Mean – Country Mean/ (Standard Deviation/Sqrt(n))


Ground Nuts= -.9346

Fail to Reject

Cassava= -2.1127

Reject

Beans- 2.1429

Reject


B.     The Null hypothesis is that at a 95% confidence interval there is no difference between the averages of Kenya’s crop production in comparison to the other 100 sampled farmers. (ground nuts, cassava, beans)


The alternative hypothesis at a 95% confidence interval says there is a difference between the 100 sample farmer and the average crop production of Kenya.  (ground nuts, cassava, beans)


C.     Ground Nuts: -.9346 (No difference)

Cassava: -2.1127 (Difference)

Beans: 2.1429 (Difference)


D.    There are two similar things that I noticed when looking at the data that I calculated. Two out of the three data sets fell outside of the range that would have classified a difference. As far as Z- scores go the numbers varied more, -2.1127 was 2 standard deviations below the county average. Then -.9346 is also almost one standard deviation below the county average.  The final value of 2.1429 is over two standard deviations over the county average. Hence the differences that I spoke about in the opening sentence.




An exhaustive survey of all users of a wilderness park taken in 1960 revealed that the average number of persons per party was 2.8.  In a random sample of 25 parties in 1985, the average was 3.7 persons with a standard deviation of 1.45 (one tailed test, 95% Con. Level) (5 pts)


a.       Test the hypothesis that the number of people per party has changed in the intervening years.  (State null and alternative hypotheses)

b.      What is the corresponding probability value


A.    The Null hypothesis at a 95% confidence interval is that there is not a difference in the average number of people per party in 1960 in comparison to the 1985 sample.

The Alternative hypothesis at 95% confidence is that there is a difference in the number of people per party in 1960 in comparison to the 1985 sample.


B.     1960=2.8


Sample in 1985=3.7

Standard Deviation of 1.45

N(1985)=25

The corresponding probability value of 1.711 and the T-score of 3.1034 would lead us to reject the null hypothesis. What these numbers tell us is that there is a difference between the whole number of park users in 1960 compared to the 985 sample.


Part 2: What and Where is up North?


Introduction


      In this Lab we were tasked with determining what separates the north from the south in Wisconsin. I am sure my opinion of up north is much different than others. For the purpose of this assignment I used Highway 29 as my divider between north and south. The objective of this assignment is to learn how to calculate Chi-Square and then understand how it relates back to hypothesis testing. Next it was also important to understand how to relate a spatial output to the Chi- Square statistics and then to relate that all back to the real world. Then finally we have to make sense of all the numbers and calculation to relate this all back to geography. No matter where I looked there is no clear cut definition of Up North. Each individual persons perspective influences what they think is up north. To determine where up north is in Wisconsin I used three different data sets. I chose to use Non- Resident Gun licenses sold per county, Acres of Lake Per County and Non- Resident Fishing License sold per county.


Methods


      I first went onto the US Census site and brought in the Wisconsin counties. After the 72 counties were displayed in ArcMap I began to select all the counties that were north of highway 29 and all the counties south of Highway 29. If counties had 29 going through them, I separated them to the category that had the majority of the county. When looking at counties and trying to separate it my data may vary from others but I found 28 counties north of Highway 29 and 44 counties south of Highway 29. The counties north of 29 are a light shade of red while the counties south of 29 are a baby blue. Each of the other 3 variables that I used are represented by various numbers, for the counties they are just 1-4. It is sort of backwards in the aspect that 4 is the least and 1 is the most. We were provided SCORP DATA on the Q drive which was there to give us options into the data we wanted to map. Once we decided which ones we wanted to use, 3 separate joins were performed.


The Map above simply illustrates how I split the state for this lab. Red is Northern Wisconsin and Blue is Southern Wisconsin.




The Map above is a representation of the amount of non resident fishing licenses sold per county in the state of Wisconsin. The darker the green the more licenses sold in that county, then the lighter the color the less licenses sold. It is easy to see the cluster of dark green counties in the North West portion of the state and the again slightly East. I believe the reasoning behind this is that there is simply more species of fish such as walleye. Also fish number are higher, less pressured and generally speaking bigger. So this makes it an obvious attraction for out of county residents.



The Map above is a Map of the amount of Non Resident gun deer licenses sold per county in the State of Wisconsin. Very similar to the map of the non resident fishing licenses we see a similar pattern here. The NorthWest corner of the state is sell more tags than any other area. I believe this is because there is an immense amount of public land in that area. The Nicolet National Forest is close by, also many people own cabins up north for other recreational activities along with hunting.
The Map above is a Map of the amount of Acres of inland lakes that each county in the state of Wisconsin has. Here we see a slightly different trend than the previous two maps. the north doesn't necessarily dominate the map.