如遇复杂公式,我会直接截图的,如有多个answer我会用分割线隔开,只允许点评,不允许回帖,谢谢各位合作!希望大家能收藏此帖,此帖包罗统计万象,常来看看!博闻强识!(看帖学统计,同时学习英语,何乐而不为?)
Odds and odds ratios in logistic regression
I am having difficulties understanding one logistic regression explanation. The logistic regression is between temperature and fish which die or do not die.
The slope of a logistic regression is 1.76. Then the odds that fish die increase by a factor of exp(1.76) = 5.8. In other words, the odds that fish die increase by a factor of 5.8 for each change of 1 degree Celsius in temperature.
Because 50% fish die in 2012, a 1 degree Celsius increase on 2012 temperature would raise the fish die occurrence to 82%.
A 2 degree Celsius increase on 2012 temperature would raise the fish die occurrence to 97%.
A 3 degree Celsius increase -> 100% fish die.
How do we calculate 1, 2 and 3? (82%, 97% and 100%)
answer1
作者: 论坛COO 时间: 2013-7-8 08:32
F test and t test in linear regression model
F test and t test are performed in regression models.
In linear model output in R, we get fitted values and expected values of response variable. Suppose I have height as explanatory variable and body weight as response variable for 100 data points.
Each variable (explanatory or independent variable, if we have multiple regression model) coefficient in linear model is associated with a t-value (along with its p value)? How is this t-value computed?
Also there is one F test at the end; again I am curious to know about its computation?
Also in ANOVA after linear model, I have seen a F-test.
Although I am new statistics learner and not from statistical background, I have gone through with lots of tutorials on this. Please do not suggest for going me with basic tutorials as i have already done that. I am only curious to know about the T and F test computation using some basic example.
Thanks !!
The misunderstanding is your first premise "F test and t-test are performed between two populations", this is incorrect or at least incomplete. The t-test that is next to a coefficient tests the null hypothesis that that coefficient equals 0. If the corresponding variable is binary, for example 0 = male, 1 = female, then that describes the two populations but with the added complication that you also adjust for the other covariates in your model. If that variable is continuous, for example years of education, you can think of comparing someone with 0 years of education with someone with 1 years of education, and comparing someone with 1 years of education with someone with 2 years of education, etc, with the constraint that each step has the same effect on the expected outcome and again with the complication that you adjust for the other covariates in your model.
An F-test after linear regression tests the null hypothesis that all coefficients in your model except the constant are equal to 0. So the groups that you are comparing is even more complex.作者: 论坛COO 时间: 2013-7-8 08:39
Explain data visualization
How would you explain data visualization and why it is important to a layman?
When I teach very basic statistics to Secondary School Students I talk about evolution and how we have evolved to spot patterns in pictures rather than lists of numbers and that data visualisation is one of the techniques we use to take advantage of this fact.
Plus I try to talk about recent news stories where statistical insight contradicts what the press is implying, making use of sites like Gapminder to find the representation before choosing the story.
Data visualization is taking data, and making a picture out of it. This allows you to easily see and understand relationships within the data much more easily than just looking at the numbers.
I would show them the raw data of Anscombe's Quartet (JSTOR link to the paper) in a big table, alongside another table showing the Mean & Variance of x and y, the correlation coefficient, and the equation of the linear regression line. Ask them to explain the differences between each of the 4 datasets. They will be confused.
Then show them 4 graphs. They will be enlightened.
From Wikipedia: Data visualization is the study of the visual representation of data, meaning "information which has been abstracted in some schematic form, including attributes or variables for the units of information"
Data viz is important for visualizing trends in data, telling a story - See Minard's map of Napoleon's march - possibly one of the best data graphics ever printed.
Also see any of Edward Tufte's books - especially Visual Display of Quantitative Information.
For me Illuminating the Path report has been always good point of reference.
For more recent overview you can also have a look at good article by Heer and colleagues.
But what would explain better than visualization itself?
作者: 论坛COO 时间: 2013-7-8 08:46
Sample problems on logit modeling and Bayesian methods
I'm looking for worked out solutions using Bayesian and/or logit analysis similar to a workbook or an annal.
The worked out problems could be of any field; however, I'm interested in urban planning / transportation related fields.
The UCLA Statistical Computing site has a number of examples in various languages (SAS, R, etc). In particular, see the following pages (look among the links titled logistic regression, categorical data analysis and generalized linear models):
Textbook Examples作者: 论坛COO 时间: 2013-7-8 08:55
Examples to teach: Correlation does not mean causation
We all know the old saying "Correlation does not mean causation". When I'm teaching I tend to use these standard examples to illustrate this point:
Number of storks and birth rate in Denmark;
Number of priests in America and alcoholism
In the start of the 20th century it was noted that there was a strong correlation between `Number of radios' and 'Number of people in Insane Asylums'
and my favourite: pirates cause global warming
However, I don't have any references for these examples and whilst amusing, they are obviously false.
Does anyone have any other good examples?
It might be useful to explain that "causes" is an asymmetric relation (X causes Y is different from Y causes X), whereas "is correlated with" is a symmetric relation.
For instance, homeless population and crime rate might be correlated, in that both tend to be high or low in the same locations. It is equally valid to say that homelesss population is correlated with crime rate, or crime rate is correlated with homeless population. To say that crime causes homelessness, or homeless populations cause crime are different statements. And correlation does not imply that either is true. For instance, the underlying cause could be a 3rd variable such as drug abuse, or unemployment.
The mathematics of statistics is not good at identifying underlying causes, which requires some other form of judgement.
Sometimes correlation is enough. For example, in car insurance, male drivers are correlated with more accidents, so insurance companies charge them more. There is no way you could actually test this for causation. You cannot change the genders of the drivers experimentally. Google has made hundreds of billions of dollars not caring about causation.
To find causation, you generally need experimental data, not observational data. Though, in economics, they often use observed "shocks" to the system to test for causation, like if a CEO dies suddenly and the stock price goes up, you can assume causation.
Correlation is a necessary but not sufficient condition for causation. To show causation requires a counter-factual.
I have a few examples I like to use.
When investigating the cause of crime in New York City in the 80s, when they were trying to clean up the city, an academic found a strong correlation between the amount of serious crime committed and the amount of ice cream sold by street vendors! (Which is the cause and which is the effect?) Obviously, there was an unobserved variable causing both. Summers are when crime is the greatest and when the most ice cream is sold.
The size of your palm is negatively correlated with how long you will live (really!). In fact, women tend to have smaller palms and live longer.
[My favorite] I heard of a study a few years ago that found the amount of soda a person drinks is positively correlated to the likelihood of obesity. (I said to myself - that makes sense since it must be due to people drinking the sugary soda and getting all those empty calories.) A few days later more details came out. Almost all the correlation was due to an increased consumption of diet soft drinks. (That blew my theory!) So, which way is the causation? Do the diet soft drinks cause one to gain weight, or does a gain in weight cause an increased consumption in diet soft drinks? (Before you conclude it is the latter, see the study where a controlled experiments with rats showed the group that was fed a yogurt with artificial sweetener gained more weight than the group that was fed the normal yogurt.)
The number of Nobel prizes won by a country (adjusting for population) correlates well with per capita chocolate consumption
A correlation on its own can never establish a causal link. David Hume (1771-1776) argued quite effectively that we can not obtain certain knowlege of cauasality by purely empirical means. Kant attempted to address this, the Wikipedia page for Kant seems to sum it up quite nicely:
Kant believed himself to be creating a compromise between the empiricists and the rationalists. The empiricists believed that knowledge is acquired through experience alone, but the rationalists maintained that such knowledge is open to Cartesian doubt and that reason alone provides us with knowledge. Kant argues, however, that using reason without applying it to experience will only lead to illusions, while experience will be purely subjective without first being subsumed under pure reason.
In otherwords, Hume tells us that we can never know a causal relationship exists just by observing a correlation, but Kant suggests that we may be able to use our reason to distinguish between correlations that do imply a causal link from those who don't. I don't think Hume would have disagreed, as long as Kant were writing in terms of plausibility rather than certain knowledge.
In short, a correlation provides circumstantial evidence implying a causal link, but the weight of the evidence depends greatly on the particular circumstances involved, and we can never be absolutely sure. The ability to predict the effects of interventions is one way to gain confidence (we can't prove anything, but we can disprove by observational evidence, so we have then at least attempted to falsify the theory of a causal link). Having a simple model that explains why we should observed a correlation that also explains other forms of evidence is another way we can apply our reasoning as Kant suggests.
Caveat emptor: It is entirely possible I have misunderstood the philosophy, however it remains the case that a correlation can never provide proof of a causal link.
作者: 论坛COO 时间: 2013-7-8 08:58
R packages for seasonality analysis
What R packages should I install for seasonality analysis?
You don't need to install any packages because this is possible with base-R functions. Have a look at the arima function.
This is a basic function of Box-Jenkins analysis, so you should consider reading one of the R time series text-books for an overview; my favorite is Shumway and Stoffer. "Time Series Analysis and Its Applications: With R Examples".作者: 论坛COO 时间: 2013-7-8 09:01
Finding the PDF given the CDF
How can I find the PDF (probability density function) of a distribution given the CDF (cumulative distribution function)?
As user28 said in comments above, the pdf is the first derivative of the cdf for a continuous random variable, and the difference for a discrete random variable.
In the continuous case, wherever the cdf has a discontinuity the pdf has an atom. Dirac delta "functions" can be used to represent these atoms.作者: 论坛COO 时间: 2013-7-8 09:03
How can I adapt ANOVA for binary data?
I have four competing models which I use to predict a binary outcome variable (say, employment status after graduating, 1 = employed, 0 = not-employed) for n subjects. A natural metric of model performance is hit rate which is the percentage of correct predictions for each one of the models.
It seems to me that I cannot use ANOVA in this setting as the data violates the assumptions underlying ANOVA. Is there an equivalent procedure I could use instead of ANOVA in the above setting to test for the hypothesis that all four models are equally effective?
Contingency table (chi-square). Also Logistic Regression is your friend - use dummy variables.作者: 论坛COO 时间: 2013-7-8 09:05
Multivariate Interpolation Approaches
Is there a good, modern treatment covering the various methods of multivariate interpolation, including which methodologies are typically best for particular types of problems? I'm interested in a solid statistical treatment including error estimates under various model assumptions.
Say we're sampling from a multivariate normal distribution with unknown parameters. What can we say about the standard error of the interpolated estimates?
I was hoping for a pointer to a general survey addressing similar questions for the various types of multivariate interpolations in common use.
Sorry, no quick answer. There are thick books dedicated to answering this question. Here's a 600-page long example: Harrell's Regression Modeling Strategies作者: 论坛COO 时间: 2013-7-8 09:15
What's the purpose of window function in spectral analysis?
I'd like to see the answer with qualitative view on the problem, not just definition. Examples and analogous from other areas of applied math also would be good.
I understand, my question is silly, but I can't find good and intuitive introduction textbook on signal processing — if someone would suggest one, I will be happy.
Thanks.
It depends on where you apply the window function. If you do it in the time domain, it's because you only want to analyze the periodic behavior of the function in a short duration. You do this when you don't believe that your data is from a stationary process.
If you do it in the frequency domain, then you do it to isolate a specific set of frequencies for further analysis; you do this when you believe that (for instance) high-frequency components are spurious.
The first three chapters of "A Wavelet Tour of Signal Processing" by Stephane Mallat have an excellent introduction to signal processing in general, and chapter 4 goes into a very good discussion of windowing and time-frequency representations in both continuous and discrete time, along with a few worked-out examples.作者: 论坛COO 时间: 2013-7-8 09:17
What are good basic statistics to use for ordinal data?
I have some ordinal data gained from survey questions. In my case they are Likert style responses (Strongly Disagree-Disagree-Neutral-Agree-Strongly Agree). In my data they are coded as 1-5.
I don't think means would mean much here, so what basic summary statistics are considered usefull?
A frequency table is a good place to start. You can do the count, and relative frequency for each level. Also, the total count, and number of missing values may be of use.
You can also use a contingency table to compare two variables at once. Can display using a mosaic plot too.作者: 论坛COO 时间: 2013-7-8 09:20
Can someone please explain the back-propagation algorithm?
What is the back-propagation algorithm and how does it work?
作者: 论坛COO 时间: 2013-7-8 09:29
PCA on correlation or covariance?
What are the main differences between performing Principal Components Analysis on a correlation and covariance matrix? Do they give the same results?
You tend to use the covariance matrix when the variable scales are similar and the correlation matrix when variables are on different scales. Using the correlation matrix standardises the data.
In general they give different results. Especially when the scales are different.
As example, take a look a look at this R heptathlon data set. Some of the variables have an average value of about 1.8 (the high jump), whereas other variables (200m) are around 20.
Notice that the outlying individuals (in this data set) are outliers regardless of whether the covariance or correlation matrix is used.作者: 论坛COO 时间: 2013-7-8 09:34
How would you explain Markov Chain Monte Carlo (MCMC) to a layperson?
Maybe the concept, why it's used, and an example.
First, we need to understand what is a markov chain. Consider the following weather example from Wikipedia. Suppose that weather on any given day can be classified into two states only: sunny and rainy. Based on past experience, we know the following:
Probability(Next day is sunny | Given today is rainy ) = 0.50
Since, the next day's weather is either sunny or rainy it follows that:
Probability(Next day is Rainy | Given today is rainy ) = 0.50
Similarly, let:
Probability(Next day is rainy | Given today is sunny ) = 0.10
Therefore, it follows that:
Probability(Next day is sunny | Given today is sunny ) = 0.90
The above four numbers can be compactly represented as a transition matrix which represents the probabilities of the weather moving from one state to another state as follows:
S R
P = S [ 0.9 0.1
R 0.5 0.5]
We might ask several questions whose answers follow:
Q1: If the weather is sunny today then what is the weather likely to be tomorrow?
A1: Since, we do not know what is going to happen for sure, the best we can say is that there is a 90% chance that it is likely to be sunny and 10% that it will be rainy.
Q2: What about two days from today?
A2: One day prediction: 90% sunny, 10% rainy. Therefore, two days from now:
First day it can be sunny and the next day also it can be sunny. Chances of this happening are: 0.9 0.9.
Or
First day it can be rainy and second day it can be sunny. Chances of this happening are: 0.1 * 0.5
Therefore, the probability that the weather will be sunny in two days is:
Prob(Sunny two days from now) = 0.9 0.9 + 0.1 0.5 = 0.81 + 0.05 = 0.86
Similarly, the probability that it will be rainy is:
Prob(Rainy two days from now) = 0.1 * 0.5 + 0.9 0.1 = 0.05 + 0.09 = 0.14
If you keep forecasting weather like this you will notice that eventually the nth day forecast where n is very large (say 30) settles to the following 'equilibrium' probabilities:
Prob(Sunny) = 0.833 Prob(Rainy) = 0.167
In other words, your forecast for the nth day and the n+1th day remain the same. In addition, you can also check that the 'equilibrium' probabilities do not depend on the weather today. You would get the same forecast for the weather if you start of by assuming that the weather today is sunny or rainy.
The above example will only work if the state transition probabilities satisfy several conditions which I will not discuss here. But, notice the following features of this 'nice' markov chain (nice = transition probabilities satisfy conditions):
Irrespective of the initial starting state we will eventually reach an equilibrium probability distribution of states.
Markov Chain Monte Carlo exploits the above feature as follows:
We want to generate random draws from a target distribution. We then identify a way to construct a 'nice' markov chain such that its equilibrium probability distribution is our target distribution.
If we can construct such a chain then we arbitrarily start from some point and iterate the markov chain many times (like how we forecasted the weather n times). Eventually, the draws we generate would appear as if they are coming from our target distribution.
We then approximate the quantities of interest (e.g. mean) by taking the sample average of the draws after discarding a few initial draws which is the monte carlo component.
There are several ways to construct 'nice' markov chains (e.g., gibbs sampler, Metropolis-Hastings algorithm).作者: 论坛COO 时间: 2013-7-8 10:03
What is the best way to identify outliers in multivariate data?
Suppose I have a large set of multivariate data with at least three variables. How can I find the outliers? Pairwise scatterplots won't work as it is possible for an outlier to exist in 3 dimensions that is not an outlier in any of the 2 dimensional subspaces.
I am not thinking of a regression problem, but of true multivariate data. So answers involving robust regression or computing leverage are not helpful.
One possibility would be to compute the principal component scores and look for an outlier in the bivariate scatterplot of the first two scores. Would that be guaranteed to work? Are there better approaches?
I think Robin Girard's answer would work pretty well for 3 and possibly 4 dimensions, but the curse of dimensionality would prevent it working beyond that. However, his suggestion led me to a related approach which is to apply the cross-validated kernel density estimate to the first three principal component scores. Then a very high-dimensional data set can still be handled ok.
In summary, for i=1 to n
Compute a density estimate of the first three principal component scores obtained from the data set without Xi.
Calculate the likelihood of Xi for the density estimated in step 1. call it Li.
end for
Sort the Li (for i=1,..,n) and the outliers are those with likelihood below some threshold. I'm not sure what would be a good threshold -- I'll leave that for whoever writes the paper on this! One possibility is to do a boxplot of the log(Li) values and see what outliers are detected at the negative end.作者: 论坛COO 时间: 2013-7-26 16:29
How do I order or rank a set of experts?
I have a database containing a large number of experts in a field. For each of those experts i have a variety of attributes/data points like:
number of years of experience.
licenses
num of reviews
textual content of those reviews
The 5 star rating on each of those reviews, for a number of factors like speed, quality etc.
awards, assosciations, conferences etc.
I want to provide a rating to these experts say out of 10 based on their importance. Some of the data points might be missing for some of the experts. Now my question is how do i come up with such an algorithm? Can anyone point me to some relevent literature?
Also i am concerned that as with all rating/reviews the numbers might bunch up near some some values. For example most of them might end up getting an 8 or a 5. Is there a way to highlight litle differences into a larger difference in the score for only some of the attributes.
Some other discussions that i figured might be relevant: http://stats.stackexchange.com/questions/1848/bayesian-rating-system-with-multiple-categories-for-each-rating
People have invented numerous systems for rating things (like experts) on multiple criteria: visit the Wikipedia page on Multi-criteria decision analysis for a list. Not well represented there, though, is one of the most defensible methods out there: Multi attribute valuation theory. This includes a set of methods to evaluate trade-offs among sets of criteria in order to (a) determine an appropriate way to re-express values of the individual variables and (b) weight the re-expressed values to obtain a score for ranking. The principles are simple and defensible, the mathematics is unimpeachable, and there's nothing fancy about the theory. More people should know and practice these methods rather than inventing arbitrary scoring systems. 作者: 论坛COO 时间: 2013-7-26 16:32
Is adjusting p-values in a multiple regression for multiple comparisons a good idea?
Lets assume you are a social science researcher/econometrician trying to find relevant predictors of demand for a service. You have 2 outcome/dependent variables describing the demand (using the service yes/no, and the number of occasions). You have 10 predictor/independent variables that could theoretically explain the demand (e.g., age, sex, income, price, race, etc). Running two separate multiple regressions will yield 20 coefficients estimations and their p-values. With enough independent variables in your regressions you would sooner or later find at least one variable with a statistically significant correlation between the dependent and independent variables.
My question: is it a good idea to correct the p-values for multiple tests if I want to include all independent variables in the regression? Any references to prior work are much appreciated.
It seems your question more generally addresses the problem of identifying good predictors. In this case, you should consider using some kind of penalized regression (methods dealing with variable or feature selection are relevant too), with e.g. L1, L2 (or a combination thereof, the so-called elasticnet) penalties (look for related questions on this site, or the R penalized and elasticnet package, among others).
Now, about correcting p-values for your regression coefficients (or equivalently your partial correlation coefficients) to protect against over-optimism (e.g. with Bonferroni or, better, step-down methods), it seems this would only be relevant if you are considering one model and seek those predictors that contribute a significant part of explained variance, that is if you don't perform model selection (with stepwise selection, or hierarchical testing). This article may be a good start: Bonferroni Adjustments in Tests for Regression Coefficients. Be aware that such correction won't protect you against multicolinearity issue, which affects the reported p-values.
Given your data, I would recommend using some kind of iterative model selection techniques. In R for instance, the stepAIC function allows to perform stepwise model selection by exact AIC. You can also estimate the relative importance of your predictors based on their contribution to R2 using boostrap (see the relaimpo package). I think that reporting effect size measure or % of explained variance are more informative than p-value, especially in a confirmatory model.
It should be noted that stepwise approaches have also their drawbacks (e.g., Wald tests are not adapted to conditional hypothesis as induced by the stepwise procedure), or as indicated by Frank Harrell on R mailing, "stepwise variable selection based on AIC has all the problems of stepwise variable selection based on P-values. AIC is just a restatement of the P-Value" (but AIC remains useful if the set of predictors is already defined); a related question -- Is a variable significant in a linear regression model? -- raised interesting comments (@Rob, among others) about the use of AIC for variable selection. I append a couple of references at the end (including papers kindly provided by @Stephan); there is also a lot of other references on P.Mean.
Frank Harrell authored a book on Regression Modeling Strategy which includes a lot of discussion and advices around this problem (§4.3, pp. 56-60). He also developed efficient R routines to deal with generalized linear models (See the Design or rms packages). So, I think you definitely have to take a look at it (his handouts are available on his homepage).作者: 论坛COO 时间: 2013-7-26 16:36
How can I test the fairness of a d20?
How can I test the fairness of a twenty sided die (d20)? Obviously I would be comparing the distribution of values against a uniform distribution. I vaguely remember using a Chi-square test in college. How can I apply this to see if a die is fair?
Here's an example with R code. The output is preceded by #'s. A fair die:
Now p<0.05 and we are starting to see evidence of bias. You can use similar simulations to estimate the level of bias you can expect to detect and the number of throws needed to detect it with an given p-level.
Wow, 2 other answers even before I finished typing.作者: 论坛COO 时间: 2013-7-26 16:40
Dealing with missing data due to variable not being measured over initial period of a study
I was recently consulting a researcher in the following situation.
Context:
data were collected over four years at around 50 participants per year (participants had a specific diagnosed clinical psychology disorder and were difficult to obtain in large numbers); participants were only measured once (i.e., it's not a longitudinal study)
all participants had the same disorder
the study involved participants completing a set of 10 psychological scales
the 10 scales measured various things like symptoms, theorised precursors, and related psychopathology: the measures tended to intercorrelate around r=.3 to .7.
in the first year one of the scales was not included
the researcher wanted to run structural equation modelling on all 10 scales on the entire sample. Thus, there was an issue that around a quarter of the sample had missing data on one scale.
The researcher wanted to know:
What is a good strategy for dealing with missing data like this? What tips, references to applied examples, or references to advice regarding best practice would you suggest?
I had a few thoughts, but I was keen to hear your suggestions.
I like the partial identification approach to missing data of Manski. The basic idea is to ask: given all possible values the missing data could have, what is the set of values that the estimated parameters could take? This set might be very large, in which case you could consider restricting the distribution of the missing data. Manski has a bunch of papers and a book on this topic. This short paper is a good overview.
Inference in partially identified models can be complicated and is an active area of research. This review (ungated pdf) is a good place to get started. 作者: 论坛COO 时间: 2013-7-26 16:43
A good way to show lots of data graphically
I'm working on a project that involves 14 variables and 345,000 observations for housing data (things like year built, square footage, price sold, county of residence, etc). I'm concerned with trying to find good graphical techniques and R libraries that contain nice plotting techniques.
I'm already seeing what in ggplot and lattice will work nicely, and I'm thinking of doing violin plots for some of my numerical variables.
What other packages would people recommend for displaying a large amount of either numerical or factor-typed variables in a clear, polished, and most importantly, succinct manner?
The best "graph" is so obvious nobody has mentioned it yet: make maps. Housing data depend fundamentally on spatial location (according to the old saw about real estate), so the very first thing to be done is to make a clear detailed map of each variable. To do this well with a third of a million points really requires an industrial-strength GIS, which can make short work of the process. After that it makes sense to go on and make probability plots and boxplots to explore univariate distributions, and to plot scatterplot matrices and wandering schematic boxplots, etc, to explore dependencies--but the maps will immediately suggest what to explore, how to model the data relationships, and how to break up the data geographically into meaningful subsets.作者: 论坛COO 时间: 2013-7-26 16:54
Adjusting for covariates in ROC curve analysis 图像 1.png(50.82 KB, 下载次数: 3)