Panel Data Analysis:
A Survey on Model-Based Clustering of Time Series An Academic presentation by Dr. Nancy Agens, Head, Technical Operations, Statswork Group  www.statswork.com Email: info@statswork.com
TODAY'S DISCUSSION Outline of Topics In Brief
Dirichlet Prior
Longitudinal Data
MCMC Simulation
Model Based Clustering
Conclusion
Example on Model Based Clustering
In Brief Clustering technique in Statistical Analysis is used to determine the subsets as clusters in the data using specified distance measure. We will discuss about some of the methods used for modeling longitudinal or panel data using Clustering Analysis technique
Longitudinal Data Longitudinal data is actually a sample of observations which are measured repeatedly over time. And, nowadays, longitudinal/repeated measure data or panel data exists in all areas of Applied statistics such as finance, psychology, economics and social sciences. Most studies deals with analyzing homogeneity in such Time series data. The most common method of capturing the heterogeneity is to assume the presence of latent classes and each class are stratified using the covariates.
Model Based Clustering
Measuring the distance between time series data is not appropriate thus a cluster based modeling strategy for finite mixture models is adopted using Bayesian rule. Model based clustering considers each time series to a single unit contained in an unknown latent class. One can see an excellent review of finite mixture models for longitudinal data in Vermunt (2010) especially in the areas of psychology, bio-statistics and other applied areas.
Example on Model Based Clustering The data consists of 237 teenagers who use marijuana for the year 1976-1980. The use marijuana is categorized into three types as never, not more than once a month and more than once a month. The following figure represents the sample of 10 observed response of use of marijuana usage among the 237 teenagers. The model considered for analyzing the marijuana usage is based on Generalized transition model.
Figure: Model Based clustering
Dirichlet Prior A Dirichlet prior is chosen in this case since the observed response variable is of categorical in nature.
Five different kernel classes are considered and evaluated the model using Dirichlet prior distribution and the results for the same is presented in the following table. The clustering kernel M2 to M5 shows that there exists a common behaviour in marijuana usage. If the value is smaller than one, then one may conclude that the method is overfitting, in this case, H3 class of kernel seems to be overfitting.
Table: Dirichlet Prior Distribution
MCMC Simulation An MCMC simulation is carried out for M3 with H2 and the following figure explains the sample of boxplots of the posterior probabilities for male and female groups. Comparing the likelihood results obtained from the above table (598.5) and the previous table (596.5) the stratified Model based clustering reduces to Standard Model based clustering and it is clear that the use of marijuana is not associated with the gender classification. From this results, it is concluded that the use of marijuana among teenagers may be clustered into two with never-use and other being more user groups.
Figure: Boxplots for MCMC Simulation
Table: Gender Specific Posterior Inference
Conclusion To sum up, model-based clustering technique along with the Bayesian flavor yields better results since it provides an answer to the most troublesome problems in the cluster analysis. In longitudinal or Panel data studies, usage of eculidean distance may be a valid one and hence a kernel based clustering for Time series data Analysis is considered and selection of the best method is analysed using different information criteria. An MCMC simulation is carried out to find the optimal clustering methodology.
CONTACT US UNITED KINGDOM +44-1143520021
INDIA +91-4448137070
EMAIL info@statswork.com