BETA MET 2015

Page 1

met Special Edition Ă&#x;ETA 2015

Modeling and Forecasting Electricity Spot Prices Using the Generalized Autoregressive Score Framework On Projected Principal Component Analysis Integrated Duty Assignment and Crew Rostering Social Media: A Proxy for Health Care Quality Indicators Optimization Under Privacy Preservation: Possibilities and Trade-offs

Ă&#x;ETA Special 2015


Get the most out of your thesis! Date: 16 November 2015 Place: Erasmus Paviljoen Info: beta-rotterdam.nl


Contents 4 10 16 22 30

Modeling and Forecasting Electricity Spot Prices Using the Generalized Autoregressive Score Framework Barbora Dlouhá

On Projected Principal Component Analysis Nishad Matawlie

Integrated Duty Assignment and Crew Rostering Thomas Breugem

Social Media: A Proxy for Health Care Quality Indicators Daisy van Oostrom

Optimization Under Privacy Preservation: Possibilities and Trade-offs. Rowan Hoogervorst

Index of Advertisers Econometrie.com Veneficus

3 Back Cover

Colophon Medium Econometrische Toepassingen (MET) is the scientific journal of FAE-

Address | Erasmus University Rotterdam | Medium Econometrische Toepassin-

CTOR (Faculty Association Econometrics & Operations Research), a faculty

gen | Room H11-02 | P.O. Box 1738 | 3000 DR Rotterdam | The Netherlands

association for students of the Erasmus University Rotterdam. Website: www.

| marketing@faector.nl | Acquisition: Sjoerd Baardman | +31 - (0)10 - 408 14 39

faector.nl/met ©2015 - No portion of the content may be directly or indirectly copied, pubFinal Editing: Kauther Yahya, Mitchell van Cittert | Design: Haveka, de grafis-

lished, reproduced, modified, displayed, sold, transmitted, rewritten for publication or

che partner, +31 - (0)78 - 691 23 23 | Printer: Nuance Print, +31 - (0)10 - 592

redistributed in any medium without permission of the editorial board.

33 62 | Circulation: 100 copies

MET | Volume 22 | ßETA Special | 2015

1


Special Edition

Dear reader, Following the successful previous editions of the Best Econometric Thesis Award (βETA), FAECTOR will continue to organise this prestigious event in cooperation with Veneficus! Eventhough the MET it self does no longer exist, we honour the tradition of releasing the βETA-MET this year as well. The βETA is an award organised for our Master students in Econometrics, who worked incredibly hard on their theses. We want to reward the four students, with different specializations, with the best Master theses. After all, your Master thesis is the icing on the cake and that is why we are honoured to hand out this award. After the succesfull pilot of last year of the Bachelor βETA we will continue hosting this categorie this year. Students who deem their theses fit for competition can sign up through our website. A selection procedure will determine whose thesis will make it into the nominee phase. Three Bachelor theses will end up being nominated and four Master theses will be nominated to compete versus each other, of course in their respective study. The nominations are made based on following criteria: innovation, reproducibility, added scientific contribution, potential for publication and entrepreneurship potential. I proudly present the Master nominees of the βETA 2015: Barbora Dlouha – Econometrics Nishad Matawlie – Quantitative Finance Thomas Breugem – Operations Research and Quantitative Logistics Daisy van Oostrom – Quantitative Marketing A warm congratulations to the authors of the nominated Bachelor theses: Jean-Paul van Brakel – Quantitative Finance Bob de Waard – Quantitative Finance Rowan Hoogervorst – Operations Research and Quantitative Logistics

transforming complex data-analyses to clear, visual output. They obtain the very best from your numbers and furthermore provide an improved integration of IT, finance and marketing processes. In this MET you will find adjusted versions of the theses written by the nominees. So are you a Master or bachelor student who likes to know how a good Master thesis has to be written? Read the theses, because before you know it, you will have to write one yourself! The winner and the nominees will not only earn eternal glory but also be awarded with a monetary price during the ceremony. Interested in the βETA? Don’t hesitate to join the ceremony, which will take place on Monday the 16th of November 2015. The ceremony of the βETA is going to be a great experience; speeches will be given by the nominees, prizes will be handed out and there will be a celebratory drink afterwards. If you are a Master student, you can also think about subscribing for next year and maybe you will be the winner of the βETA 2016! Please enjoy reading this βETA - MET and I hope to see you at the ceremony, Kauther Yahya Educational Officer of the 50th Board Special thanks to: The Master thesis coordinators Econometric Institute: Prof. dr. Richard Paap - Econometrics Dr. Wilco van den Heuvel - Operations Research and Quantitative Logistics Dr. Erik Kole - Quantitative Finance Prof. dr. Dennis Fok - Quantitative Marketing The Master jury: Prof. dr. Dick van Dijk Dr. Remy Spliet Prof. dr. Patrick Groenen Mr. Joost van der Zon

The Bachelor jury: Dr. Christiaan Heij Dr. Twan Dollevoet Mr. Hoksan Yip

During the event itself our exclusive jury will announce the winner of the nominees in both categories during the award ceremony. This jury consists of top members of the Econometric Institute together with a representative of our sponsor Veneficus. Veneficus is specialist in

Veneficus: Mr. Robbert Bos Mr. Joost van der Zon

2

MET | Volume 22 | ßETA Special | 2015


ECONOMETRIE

Follow us on Facebook


MODELLING AND FORECASTING ELECTRICITY SPOT PRICES U SIVE SCORE FRAMEWORK Barbora DlouhĂĄ

The thesis introduces the Generalized Autoregressive Score (GAS from now on) framework designed directly for electricity spot prices, which has not been done to the best of author’s knowledge before. The dynamics of electricity spot prices is very distinct from that of stock prices, stock indices or even of other commodities due to electricity non-storability, technical aspects of electricity transmission, strategic bidding mechanism in the respective electricity markets as well as seasonal changes in electricity demand and supply. Sudden unexpected increases in demand or shortfalls in supply cause spikes in otherwise mean-reverting evolution of electricity spot price. The electricity spot price process might also exhibit drops as a consequence of very low demand or for example regulatory measures.

Why the research topic is relevant The appropriate modeling and forecasting of electricity spot price processes is of importance for several reasons. Firstly, companies trading in electricity markets need accurate forecasts to bid and hedge against electricity spot price volatility. The prediction of price spikes is crucial for electricity retailers, who cannot pass the spikes onto final customers. Last but not least, since the world-wide deregulation of the power industry, the electricity spot price dynamics is of interest due to the competition analysis of electricity markets. The electricity spot prices modelling The industry standard is to decompose the electricity spot price into a long-term deterministic trend, a short-term deterministic seasonal component and a stochastic process. As for the latter, the Markov Regime Switching model (MRS from now on) is by the literature considered to be a natural choice due to the regime switch being a latent variable. Therefore one of the latest MRS models in energy economics literature introduced by Janczura & Weron (2010) has been adopted as a benchmark model in the thesis to which the properties and performance of the proposed GAS models have been compared. The MRS model belongs to the parameter driven time series models, whereas the GAS model is a representative of the observation driven time series models. As a result, the comparison of the two models can be perceived also as a confrontation of the two classes of models as established by Cox et al. (1981). The MRS model The MRS model with mean-reverting and heteroskedastic base regime and spike regime modeled by shifted log-normal distribution as introduced by Janczura & Weron (2010) takes the form

where

.

indicates a shift in the right heavy-tailed distribution (here log-normal) of the spike regime, which Janczura & Weron (2010) introduced in order to improve correct classification of the deseasonalized electricity spot prices into the base and spike regime. It is typically set equal to median of the dataset. Given 4

MET | Volume 22 | Ă&#x;ETA Special | 2015


RICES USING THE GENERALIZED AUTOREGRESthat the regimes are independent from each other and base regime is modeled by AR(1) process, then during the value of base the spike or drop regime at time regime becomes unobservable. As a result, supposing the electricity price gets back to the base will be dependent on the regime at time , fyt; latent variable . Janczura & Weron (2010) circumvent this problem by replacing by its expectation

score. This feature becomes particularly useful for electricity spot prices modeling since electricity price series in general exhibit time-varying means, timevarying volatility and significant outliers in the form of spikes and drops as already mentioned. By defining base and spike regime and by a suitable choice of data distribution in respective regimes, a score that either updates the time-varying mean proportionally to changes happening in the time series (in case of normally distributed base regime) or in a robust way with respect to large deviations caused by price spikes or drops (in case of skewed Student’s t distributed spike regime) can be obtained.

Where . The vector of parameters is estimated using the EM algorithm.

The GAS models The GAS framework, introduced by Creal et al. (2008) and utilized in the thesis, is very flexible, allowing for the selection of data distribution, the parameterization of time-varying parameters as well as the scaling of the

To account for potentially both the unexpected excessive upward and downward price movements in the spike regime, the skewed Student’s t distribution of Fernandez & Steel (1998) has been introduced and derived for the first time within the GAS framework. The switch between the regimes is ensured by the indicator function, which can be evaluated for example using fixed price threshold, variable price change threshold or probabilities of spike occurrence obtained from the autoregressive conditional hazard model as employed in the energy economics literature by Christensen et al. (2012). As a result, the timevarying mean without excessively erratic dynamics is determined throughout all regimes and not only in the base regime as in case of the MRS model. The first proposed GAS model is constituted by the timevarying mean and the two error terms - normally and skewed Student’s t distributed - that represent the innovations of the model in the base

MET | Volume 22 | ßETA Special | 2015

5

Since Janczura & Weron (2010) do not cover out-ofsample forecasting of the MRS model, the forecasts expressions have been derived. Furthermore, it has been outlined how the transition probabilities in outof-sample forecasting should be computed. On the other hand, because the spike regime of the MRS model is assumed to be log-normally distributed, this distribution has been adopted in the spike regime of the proposed GAS models as well in addition to the skewed Student’s t distributited spike regime.


and spike regime. Switching between these two regimes is enabled by the indicator function, which I evaluated using the fixed price threshold.

As a result, assuming and log-normally distributed spike regime, the time-varying mean and volatility process can be written as follows:

The daily electricity spot price process is expressed as a sum of the short-term deterministic seasonal component and a strictly stationary stochastic process

In the general GAS framework:

Specifically, assuming , the time-varying mean process can be written as follows:

As a result, assuming and skewed Student’s distributed spike regime, the time-varying mean and volatility process can be written as follows

This model has been further extended by considering the time-varying volatility either in the base regime only or both in the base and spike regime. For the specifications and derivations of all proposed models the reader is referred to the thesis itself, here is outlined the GAS model with the time-varying volatility both in the base and spike regime.

being introduced for the time-varying with volatility in both the base and spike regime.

6

MET | Volume 22 | Ă&#x;ETA Special | 2015


The theoretical part also elaborates on the stochastic properties of GAS processes. Since the research papers concerning the GAS models in general do not consider multiple days-ahead forecasting, the derivation of the expressions was needed. Case study & findings Consequently, the proposed GAS models have been applied together with the MRS model to New South Wales electricity spot prices 2002 - 2014 in order to compare their in-sample statistical fit as well as their predictive accuracy with and without a structural break present in the forecasting period. By confronting the properties of these models, the unswerving position of the MRS model as depicted in energy economics literature was questioned as well as potential shortcomings of the GAS framework were revealed. The case study might also serve as a guide for energy practitioners to utilize the intuitive and flexible way to electricity spot price modelling using the GAS framework. As for the models without time-varying volatility, the threshold of 100 AU$/MWh, commonly used in energy economics literature for Australian price spikes identification, emerged to be too high to compete with the MRS model. When decreasing the threshold to a level at which the number of spikes identified by the GAS model coincided with the number of spikes identified by the MRS model, the GAS model with log-normally distributed spike regime was found to be superior to the MRS model in terms of the AIC, the overall log-likelihood and also the average log-likelihood of the respective regimes. The GAS model with skewed Student’s t distributed spike regime showed comparable results for the threshold of 100 AU$/MWh, but performed worse for lower thresholds when the parameter restrictions became binding.

MET | Volume 22 | ßETA Special | 2015

However, the superiority of the GAS framework was not confirmed within the models with the ‘type 1’ time-varying volatility. In this class, the MRS model performed better in the base regime. One part of the difference was caused by better fit of the base regime specification of the MRS model, the second part resulted from the MRS model mis-classifying some of the price spikes into the base regime. This is possible since the regime switch is treated as a latent variable. The overall loglikelihood seemed to increase when moving from the GAS model with an optimized price threshold of 33.7 AU$/MWh and the time-varying volatility ‘type 1’ through ‘type 2’ to ‘type 3’. Even though this finding is not theoretically grounded by the concept of nested models, it makes sense that, for instance, introduction of the time-varying volatility into the spike regime represents additional flexibility of the GAS framework and therefore can lead to the log-likelihood improvement. The GAS model with the ‘type 3’ time-varying volatility, log-normally distributed spike regime and an optimized price threshold of 33.9 AU$/MWh managed to outperform the MRS model with q1 set equal to median in terms of all monitored in-sample _t criteria, but not the MRS model with an optimized q1. Nevertheless, it is difficult to judge to what extent the abovementioned findings are reliable since the loglikelihood functions of the employed GAS models are highly nonlinear. This makes search for the maximum of the log-likelihood function highly sensitive to the choice of initialization values. The out-of-sample forecasting performance of the MRS and GAS models has been assessed based on the RMSE, MAE and MAPE criterion. The indicator function has been evaluated either using the observed electricity prices, which would not be possible in reality but here it served as an upper bound for predictive accuracy of the GAS models, or it has been evaluated endogenously. The results profoundly depended on the criterion choice. As for the models without timevarying volatility, the RMSE criterion marked all forecasts from the GAS models with the real prices used to evaluate the indicator function to achieve better results than those from the GAS models with endog7


enously evaluated indicator function. The MRS model’s forecasts lay in between of these two groups up twodays-ahead forecasts before becoming the least accurate forecasts. Arguably, this is caused by the fact that the MRS model forecasts spikes by rolling the spikes observed in past to future. Whereas the choice of the price threshold did matter for predictive accuracy assessed by means of RMSE, the MAE criterion did not mark the GAS models with the lower price threshold as those performing considerably worse. In the thesis, we also argue that the MAPE criterion is not convenient for the assessment of electricity spot prices’ forecasts since it diminishes the gravity of not forecasting the price spikes, which does have serious consequences in reality. Predictive accuracy of the models without time-varying volatility has been assessed as well under the presence of a structural break in the out-of-sample fore-

casting period in the form of temporarily increased electricity spot prices. Whereas the GAS model with the threshold of 100 AU$/MWh classifies the increased prices still into the base regime and hence captures the actual price evolution, the GAS model with 40 AU$/MWh threshold overestimates the prices by treating them as the spike regime observations. The MRS model suffers from a similar problem, although the roots of the prices overestimation are different. The thesis has also examined whether the introduction of time-varying volatility into the base regime or both into the base and spike regime of the respective models yielded more accurate forecasts for the time span 2009 - 2011. The conclusions depend on the criterion’s choice again and in general they are not so straightforward as in case of the models without time-varying volatility.

8

MET | Volume 22 | ßETA Special | 2015


MET | Volume 22 | Ă&#x;ETA Special | 2015

9


ON PROJECTED PRINCIPAL COMPONENT ANALYSIS

Nishad Matawlie Erasmus Universiteit Rotterdam

We investigate a newly proposed method for conducting principal component analysis.This method is based on the projection of the data matrix onto a sieve space spanned by explaining characteristics before performing actual principal component analysis.The method is therefore referred to as the projected Principal Component (projected-PC or PPC) method. We illustrate the proposed method by both simulated data and excess returns of components of the S&P 500 index. Furthermore a Bayesian approach is conducted to examine the effects of parameter uncertainty when estimating loadings. We also explore the robustness of the new method in the presence of missing data. Finally, the factor structure is used to study the forecasting performance with both classical and Bayesian methods. We find that the projected-PC method has advantages over regular PC when estimating factors and loadings, especially in low-sample-size- highdimensionality cases. In cases of a larger sample size however, its performance is not explicitly better than the regular principal components method.

10

Introduction The modelling of economic output variables such as the excess returns of assets has been studied very extensively in econometric literature. To understand the common dependence among multivariate outputs, often a tool called factor analysis is employed. With factor analysis we model observed data as a linear combination of unobservable common factors and factor loadings plus idiosyncratic error terms. The unknown loadings and factors are commonly estimated by the method of principal components. For instance, see among others Stock and Watson (2002), Bai and Ng (2002), Choi (2012) and Lam et al. (2012). The factor model has also many more applications outside the field of finance. In such applications we often have to deal with very high dimensionality. For consistent estimation of parameters though generally a relatively large sample size is required. Nevertheless, due to several issues it is not always possible to use a large sample size. One way to tackle these kind of problems is to model the loadings in turn by additional explanatory variables. We apply the method of projected Principal (projected-PC) Components (Fan et al. (2014)). This estimation differs from the traditional methods for factor analysis in the sense that principal component analysis is applied, after the data is projected onto a sieve space spanned by explanatory characteristics. We use the methodology in several applications. In particular we compare its performance to that of classical principal component estimators. We examine the effect of parameter uncertainty on the projected loadings by considering a Bayesian approach. We perform specification tests for the applicability of projected-PC. Further we investigate the feasibility of projected-PC versus regular PC in the presence of missing data. Finally we use the factor structure to construct forecasts with both classical and Bayesian methods. Our research shows that especially in low-sample-size cases, projected-PC exhibits several advantages over regular PC.

MET | Volume 22 | Ă&#x;ETA Special | 2015


Theoretical framework

B-splines, wavelets - which spans a dense linear space . Then for the addiof the functional space for tive components we have

Setting Suppose that we possess observed data on excess returns with p and T respectively the of several assets (6) dimension and sample size of the data. We adopt the folHere represent the sieve coefficients of the lowing model additive component of , corresponding to (1) the factor loading. Further, is a ‘remainder function’ and J denotes the number of sieve terms, where represent unobservable common fac- which only grows slowly as . Let tors. The corresponding factor loadings for the variable i and t finally represents the are denoted by idiosyncratic error term. Let the d-dimensional vector contain the explanatory variables assoand let ciated with the output variable. We construct the fol- Also let . Furthermore let lowing semi-parametric factor model (2) be a matrix with its element equal to . (3)

Where is the residual component of the factor loading that can not be explained by the covariates .

the matrix from (6) is given by

(7)

The Estimators Finally, if we let Y be the p x T matrix of , F be the T x We construct the projection matrix as , be . We obtain the colK matrix of , G(X) the p x K matrix of and U be the p x T matrix of . umns of by taking the eigenvectors correthe p x K matrix of Then we can rewrite the model as sponding to the first K largest eigenvalues of the (4) matrix . The estimators for the loading is given by functions The projected-PC method is implemented by first projecting the data Y onto the sieve space spanned by and then applying principal component analysis on the Next we estimate the remainder of the loading projected data. component This is the part that can not be Semiparametric Factor Model explained by the covariates. The estimator is given is a vector of characteristics correspondRecall that . Finally we estimate ing to the variable. We want to estimate the function by non-parametrically. We assume that the function is additive. For every we have non-parwith ametric functions (5)

and

for

.

We estimate every additive component of by the sieve method (Chen (2007)). Let be a set of basis functions - such as Fourier series, polynomial series, MET | Volume 22 | Ă&#x;ETA Special | 2015

11


Estimating the number of factors Many consistent estimators of the number of factors K can be obtained in various ways, see for instance Hallin and Liˇska (2007), Alessi et al. (2010). More recently Ahn and Horenstein (2013) proposed a tuning parameter-free estimator. Following the thought that the observable characteristics possess explaining power for the factor loadings, it can be effective to rather work denote the with projected data. Let largest eigenvalue of . Furthermore let denote the nearest integer. The projected-PC is defined as (or PPC) estimator of (8)

Analogously, in terms of the same notation, the estimator of Ahn and Horenstein (2013) is defined as where is the maximum possible number of factor a researcher chooses. Hypothesis testing Recall our model in matrix form . This gives rise to two hypothesis tests. First we consider testing whether the covariates are able to explain parts of the loadings. This can be conducted by testing: . Next we determine if the loading can be fully explained by the covariates by testing . The first test can be considered as a diagnostic tool to determine if the use of projected-PC is appropriate. The second test examines the competence of the semi-parametric factor model. We calculate the test statistic for as

with

and

.

Here the estimator of the realized factors should be consistent. Recall that under the null hypothesis, has no explaining powers on the loadings. Therefore we chose to be the regular PC estimator (e.g. Stock and Watson (2002)). 12

are obtained by taking That is, the columns of eigenvectors of the data matrix the first . We reject the null hypothesis when the test stais large. The standardized limiting distributistic , is given by tion as . To test the second hypothesis we employ , the test statistic , with the projected-PC where we have estimator for the factors. For the weight matrix we , where is a diagonal covariance take matrix of under , following the assumption that are uncorrelated. If we let , then we can define the estimator . We reject the null hypothesis for large values of . The standardized limiting distribution is given by .

Data We collect stocks from the S&P500 constituents which have complete daily closing prices from year 2005 to 2013. Furthermore we also retrieve the corresponding market capitalizations and book values. Stocks with incomplete data are deleted. This gives us a total number of 438 stocks. For these stocks we calculate the daily excess returns by using 1 month T-bills. We consider four characteristics. From our data we construct size, value, momentum and volatility numbers. Size effect is calculated by the logarithm of the market capitalization (in millions) on the day before the data analyzing window. We take the ratio of the market value to the book value as a measurement for the value of a firm. Momentum is constructed by taking the cumulative half-year return of the previous 126 days. Volatility finally is the standard deviation of the daily returns of the previous 126 days. We standardize all four characteristics to have unit variance and zero mean. Stocks that display extreme values on the characteristics are removed, such that there are 400 stocks left. As data analyzing window we chose the first quarter of the year 2006.

MET | Volume 22 | Ă&#x;ETA Special | 2015


Application and Conclusions Calibrating a model with real data we first projected To fit the loading functions the data on the sieve space spanned by the characteristics and subsequently calculated the several estimators as outlined in the Section ‘the estimators’. We chose the sieve space to be the univariate cubic spline space. As basis functions we used B-splines. The usefulness of B-splines lies in the fact that any spline function of certain order can be expressed as a linear combination of the same order B-splines (see De Boor (1978)). We fitted loading functions for a number of . The factors. We chose the sieve dimension were obtained by using additive components and . We used real the relevant submatrices of data to calibrate a model that could be used for simulation studies. We treated the loading functions that we obtained as ‘true’ loading functions and used these in a data generating process. Loadings and noise components were constructed from normal and gamma distributions, with parameters calibrated from the real data. Factors were constructed by considering a VAR(1) process. To compare the projected-PC method with the classical PC method we considered some matrix error and the measures. We used the max-norm of the difference between real Frobenius norm and estimated matrices measures for estimation accuracy. We ran simulations for 500 times for number of assets ranging from 300 to 1000 stocks, for several . For accuracy we considered the average norms over all simulations. We found that the estimation errors for the loadings barely increased and even decreased as grew. In general projectedPC outperformed regular PC in estimating the loadings. For the factors we found some interesting results. Projected-PC was still better with small samples. However we saw that with a relatively larger sample size, projected-PC was definitely outperformed by regular PC when it came to factor estimation. Projected-PC has not much added value over regular PC when estimating factors if we consider a large enough sample size, which is very remarkable.

approach for the estimation of the projected loadings. This could perhaps give us a more realistic view on the rate at which the estimation errors increase. With Bayesian statistical modelling we can integrate out parameter uncertainty by considering a probability distribution for the loadings, rather than a point estimate (e.g. Rachev et al. (2008)). We considered the matricvariate normal distribution (conditional on the covariance matrix) and matricvariate student’s t distribution (unconditional) to draw loadings from. We investigated the effect of parameter uncertainty for the low-sample-size cases. In every simulation, we both estimated the projected loadings by least squares and by drawing loadings from a posterior distribution. Ordinary projected-PC displayed an ideal image of very slowly increasing estimation errors. The first Bayesian approach (matrix normal) showed us a much faster increasing rate. The second approach (matrix t) showed us more or less the same rate as ordinary projected-PC, however with relatively larger errors. We conclude that the errors can be considerably larger than the ideal image represented by ordinary projected-PC if we consider parameter uncertainty. This can influence other results. Estimating the number of factors Furthermore we used simulated data to estimate the number of factors. We compared the performance of the projected-PC estimator for factors against an earlier developed eigenvalue-ratio estimator by Ahn and Horenstein (2013). Both methods work in a similar way. The difference lies in the use of projected data versus original data. We found that estimating the number of factors with the projected data estimator yields better results in general. With the projectedPC estimator we were almost always able to retrieve the correct number of factors. Moreover the projected-PC estimator also displayed less variability in the estimated number of factors. Especially in the low-sample-size cases, the projected data estimator showed its effectiveness over the more traditional estimator.

Bayesian inference on Projected Loadings To examine the effect of parameter uncertainty on the estimation errors, we considered a Bayesian

Hypothesis tests We used the real dataset to conduct hypothesis tests. Recall that we test for the applicability of the projected-PC method and the adequacy of the semi-parametric factor model. We conducted the hypothesis tests using a moving window scheme with several

MET | Volume 22 | ßETA Special | 2015

13


window lengths varying between 10 days ( ), a ) , a quarter ( ), and a half year month ( ). For the calculation of every test statistic ( again. The we reconstructed the characteristics was determined by setting sieve dimension , with . The number of factors was estimated at every turn by the eigenvalue ratio estimators. We found that in almost all cases it was possible to significantly explain (parts of) the loadings by observed covariates. This provides a theoretical basis for the appropriateness of projected-PC. Furthermore we found that, in almost none of the cases it was possible to fully explain the loadings. Hence, we can state that the characteristics certainly have explaining power for the loadings. However not so much that the loadings can be fully explained. This is very reasonable as we often have to deal with noise when modelling output variables. Missing data We also studied the robustness of projected-PC and its performance compared to regular PC when we have to deal with missing values in the data. Ilin and Raiko (2010) studied some practical approaches to principal component analysis in the presence of missing data. We extended one of their methods, to make it applicable to projected-PC. An example of a where observed data is data matrix with denoted by and missing entries are denoted by is given below.

instance, by the row-wise means of the observed values in . Then we calculate the usual estimators as outlined in the Section ‘The estimators’. Next the filled matrix is updated by taking

and PCA can be applied again to this updated matrix. This process of constructing a filled matrix and applying PCA is repeated until convergence is obtained. Ilin and Raiko (2010) show that the imputation algorithm minimizes a certain cost function and that it actually is an implementation of the EM-algorithm for a simple probabilistic model. We found that the projected-PC method clearly outperformed traditional PC in this setting. In one of the specific cases, projected-PC was even able to estimate loadings more accurately while dealing with missing data, than regular PC applied on full data. However once again we also found that this mainly holds for the low-sample-size case. For a larger sample size we found that the performance of both methods is very similar. This is in line with what we had already seen for the full data case.

We applied an iterative procedure based on the EMalgorithm, which performs PCA in a direct fashion. We will refer to this procedure as the imputation algorithm. It alternates between filling up missing entries in , to construct a complete matrix and applying and . (projected) PCA to the filled matrices Initially, we can replace the missing values, for

Forecasting performance Lastly we used the factor structure to construct and evaluate forecasts with both PC methods. We proceeded in a Bayesian as well as a frequentist fashion. We saw once again that the power of the projectedPC method comes forward in the case of a low sample size. For a low sample size we found significant differences in MSPEs between the projected-PC and regular PC method. This indicates that the projectedPC method is preferable in such a setting as it gives more accurate forecasts. We also again examined the effect of parameter uncertainty on the forecasting performance. We discovered that the effect of parameter uncertainty on regular PC is quite larger than on projected-PC. Finally we compared forecasts of the same PC method for the different approaches. We found significant differences in MSPEs of the Bayesian and frequentist approach for the regular PC method. However we did not observe much differences for the projected-PC method. This implies that

14

MET | Volume 22 | Ă&#x;ETA Special | 2015


parameter uncertainty clearly affects regular PC more than projected-PC. All in all, projected-PC has clearly advantages over regular PC when it comes to estimating factors, loadings and even the number of factors in low-sample-size cases. However with relatively larger sample sizes its performance is not explicitly better than the classical method of principal components.

J. H. Stock and M. W. Watson. Forecasting using principal components from a large number of predictors. Journal of the American statistical association, 97(460):1167–1179, 2002.

References S. C. Ahn and A. R. Horenstein. Eigenvalue ratio test for the number of factors. Econometrica, 81(3):1203–1227, 2013. L. Alessi, M. Barigozzi, and M. Capasso. Improved penalization for determining the number of factors in approximate factor models. Statistics & Probability Letters, 80(23):1806–1813, 2010. J. Bai and S. Ng. Determining the number of factors in approximate factor models. Econometrica, 70(1):191–221, 2002. X. Chen. Large sample sieve estimation of semi-nonparametric models. Handbook of econometrics, 6:5549–5632, 2007. I. Choi. Efficient estimation of factor models. Econometric Theory, 28(02):274–308, 2012. C. De Boor. A practical guide to splines. Mathematics of Computation, 1978. J. Fan, Y. Liao, and W. Wang. Projected principal component analysis in factor models. Available at SSRN 2450770, 2014. M. Hallin and R. Liˇska. Determining the number of factors in the general dynamic factor model. Journal of the American Statistical Association, 102(478):603– 617, 2007. A. Ilin and T. Raiko. Practical approaches to principal component analysis in the presence of missing values. The Journal of Machine Learning Research, 11:1957–2000, 2010. C. Lam, Q. Yao, et al. Factor modeling for highdimensional time series: inference for the number of factors. The Annals of Statistics, 40(2):694–726, 2012. S. T. Rachev, J. S. Hsu, B. S. Bagasheva, and F. J. Fabozzi. Bayesian methods in finance, volume 153. John Wiley & Sons, 2008.

MET | Volume 22 | ßETA Special | 2015

15


Integrated Duty Assignment and Crew Rostering Thomas Breugem

We propose an integrated model for the Duty Assignment and the Crew Rostering problem. Both problems are part of the crew planning process at Netherlands Railways; The Duty Assignment problem consists of finding a ‘fair’ allocation (according to some measure) of the duties among the roster groups. The Crew Rostering problem is well known in literature, and consists of finding good rosters given a set of duties. Our model integrates the above two problems, and hence involves large scale optimization (since all duties have to be assigned simultaneously). Our research evaluates the effectiveness of an integrated approach compared to e.g., a sequential approach. We also propose a second model to counter the problem of weak lower bounds, a problem well known for rostering problems. We show that this new model leads to promising results.

Introduction 16

As the largest railway operator in the Netherlands, NS faces many challenging operational problems. Among these problems are e.g., the design of the timetable, the dispatching of rolling stock and the planning of the crew. These are all challenging problems, which are further complicated by the high density of demand in the Netherlands. It is therefore no surprise that OR techniques are used intensively to support decision making at NS. This thesis concerns the crew planning process, which can be separated into different steps. The first step in the planning process is Crew Scheduling, i.e., creating a set of duties that cover the set of tasks (indivisible blocks of work). The duties should satisfy many labor rules, e.g., a minimal meal break time and a maximum duty length. Once the duties are constructed, they are assigned to the roster groups, i.e., the Duty Assignment problem. A roster group is a group of employees which execute the same roster. The goal of the assignment problem is to find a solution that not only satisfies all constraints, but is also optimal in some sense. The model proposed in Abbink [2014] does this by minimizing the dis-balance of the solution, i.e., we try to find a solution where all groups are assigned approximately the same set of duties (according to certain attributes of the duties). Among such attributes are the duty length and the type of rolling stock of the trips. The above problem closely resembles the Generalized Load Balancing problem, see e.g., Caragiannis [2008], which is known to be NP-hard. When the duties are assigned to the different depots, the rosters for the different roster groups are created. The problem of rostering the assigned duties is known as the Crew Rostering problem and is NP-hard. In general, the Crew Rostering problem can be divided into two types, a cyclic and a non-cyclic variant. We consider the cyclic variant, abbreviated CCRP. In the CCRP we construct one roster for multiple employees, and each employee cycles through the roster. The CCRP is often split up in two phases, as first proposed in Sodhi and Norris [2004]. In the first step a rota schedule is created, which is a roster in which only the roster day types are specified. These types specify either the type of the duty (i.e., early, late, night) or other types such as rest and reserve days. In the second step the actual duties are assigned to days in the roster. MET | Volume 22 | Ă&#x;ETA Special | 2015


Our main contribution is to integrate the Duty Assignment and Crew Rostering problem on the depot level. That is, we want to take the balancing of the different duty properties into account simultaneously with the rostering of the roster groups. We note that current models (Hartog et al. [2009], Abbink [2014], Xie and Suhl [2014], Mesquita et al. [2013]) assume that the duties are already assigned to the roster groups before the rostering starts or do not consider any restrictions on the allocation to the roster groups. We assume that the duties are already assigned to the different depots and that the rota schedule for each group is known. Mathematical Models We now discuss our proposed model; first we formalize the two problems separately. The model for the Duty Assignment problem is based on Abbink indicates whether duty [2014]. The binary variable is assigned to group . The set is the set of all combinations of duty types and weekdays (e.g., is the set of duties that match early and Monday). . Furthermore, the duty type and weekday of denotes the number of duties of type in the rota schedule of group . Every duty d has a cerfor some attribute , e.g., duty tain score be the number of duties we need to length. Let roster for group . The decision variables and indicate the minimum and maximum average score over all groups on attribute , respectively. Both these variables are bounded by a lower bound and an upper bound . The model reads as follows. (1) (2) (3)

weighted sum of the ‘spreads’ of the scores, where are the weights. Constraints (2) assures all duties are assigned to exactly one group. Constraints (3) enforces that we assign the correct number of each type. Constraints (4), (5) and (6) set the correct values of the spread variables and enforce the bounds, respectively. Finally, constraints (7) and (8) specify the domain of the decision variables. We refer to this model as DAPR (Duty Assignment Problem with Rota schedules). The model we propose for the Crew Rostering problem is based on the work of Abbink [2014] and Hartog et al. [2009]. They consider a model for a single roster group, in which the duties are assigned to the different days whilst taking into account undesirable properties of the roster. We immediately extend the model to multiple groups. Let the decision variable indicate whether duty is assigned to group on represents the set of day . For each group the set days for that group to which a duty needs to be assigned (this is a subset of all days in the roster). The indicating that duty set contains triples is the set of can be assigned to group at day . triples where the duty has type . The parameters specify the duty type for each day and each group. The undesirable properties of the rosters are penalized using patterns, which can be expressed as linear indicate whether, or to restrictions. The variables what extent, pattern is violated. The set contains all patterns and the set contains all patterns relevant for only group . All patterns are enforced by a and have domain . Here is the constraint vector with decision variables for group . One could imagine that a pattern penalizes a total weekly workload that exceeds, say, 45 hours. Then would sum the lengths of all duties in that week minus 45, while . In the thesis we disthe domain is given by cuss all patterns in detail. Using the above notation, the model reads as follows.

(4) (5) (6) (7)

The objective (1) expresses that we minimize (8) a MET | Volume 22 | Ă&#x;ETA Special | 2015

17


(9) (10)

(11) (12) (13) (14)

The objective (9) minimizes the penalties incurred from the patterns. Constraints (10) and (11) assure that every day is assigned a duty of the correct type and that every duty is assigned exactly once. Constraints (12) models the different patterns. Finally, constraints (13) and (14) specify the decision variables. We refer to this model as CCRP (Cyclic Crew Rostering Problem). The above two formulations can be integrated as follows. We state the integrated formulation as a biobjective optimization problem, since we want to find rosters with low cost, but we also want a ‘fair’ allocation of duties to the roster groups. Observe that the and variables are connected by the equation (15)

Constraints (2) and (3) of the DAPR are already enforced by constraints (10) and (11) of the CCRP. Hence it suffices to add only the constraints correand variables to the CCRP. sponding to the variables, we obtain, Using (15) to eliminate the by merging the CCRP constraints with the new DAPR constraints, the integrated model

18

We will refer to this model as ICCPR (Integrated Cyclic Crew Rostering Problem). We also developed a model in which we directly assign a set of duties to numerous days. The general idea is that we ‘cut’ a roster in a set of smaller ‘pieces’, which we call clusters. Here we only discuss the main idea behind this formulation. We proved that, under general conditions, the new model leads to a tighter formulation for the CCRP. Let denote the set of all clusters, e.g., is the set of all weeks in the roster. An assignment of a set of duties to a cluster is called a duty sequence. We define as the set of sequences that can be assigned to cluster and let be the set of all sequences. Furthermore, let be the set of all sequences containing . duty Some pattern violations can be modeled a priori, i.e., the violation only depends on whether or not we select a certain sequence. An example of such a pattern would be one that penalizes a high workload in a week, if the clusters are the weeks. Let be the set of patterns that depend on solely one cluster in . The cost of these patterns can be modeled by introducing a sequence cost . The remaining patterns ) still needs to be incorporated into (i.e., the set the model. This is done similar to the CCRP. We are now able to state the new model, to which we refer as the CCRP2.

MET | Volume 22 | ßETA Special | 2015


(22) (23) (24) (25) (26) (27)

The objective function (22) expresses that we minimize the total cost of penalties occurring in the sequences plus the penalties of the remaining patterns. Constraints (23) and (24) assure that we assign a sequence to each cluster and that all duties are assigned, respectively. Constraints (25) enforce the remainder of the patterns. The functions are easily obtained from the functions , as we show in the thesis. Finally, constraints (26) and (27) define the decision variables. This model can be integrated with the DAPR similar to the ICCRP. Solution Approaches We use budget constraints to reduce the ICCRP to a one-dimensional optimization problem. A budget constraint eliminates one of the objective functions by adding it as a constraint to the model, i.e., we can bound the DAPR objective by a parameter by adding the budget constraint (28)

Similarly, we can bound the CCRP objective by a parameter by adding the constraint (29)

Using such constraints allows us to make a tradeoff between the two objectives. The aim of our analysis was twofold. First we analyzed the impact of the sequence of optimization, i.e., first minimizing the DAPR objective or CCRP objective, and then using a budget constraint. We extend this analysis with a Pareto-alike approach, in which we solve the ICCRP for varying γ. Secondly, we analyzed the performance of ‘large scale’ optimization (i.e., solving the ICCRP MET | Volume 22 | ßETA Special | 2015

for all groups at once) versus ‘small scale’ optimization (i.e., rostering per group or per multiple groups). We considered bounded running times, as our initial experiments showed that proving optimality was very difficult. Results and Conclusion We applied our approaches to the rostering process at base Utrecht. We compared the CCRP2, using weeks as clusters, to the CCRP model on two instances of approximately 100 duties. The CCRP2 was able to find optimal solutions in either seconds or a few minutes, while the CCRP was not able to proof optimality in 30 minutes for both instances. Hence the CCRP2 is a promising addition to the literature; a more advanced implementation using e.g., column generation, could lead to a fast solution method for larger instances. Our results for the ICCRP were somewhat divided. For four roster groups, i.e., roughly 150 duties, the impact of the sequence of optimization was clearly visible; if we minimized the DAPR objective first we found relatively bad rosters, and vice versa. This implies that using tight budget constraints or fixing variables in an early stage of the solution process, as is done in such approaches, might not be a good idea in practice. Using the ICCRP model, we were able to bridge the gap between such methods, i.e., the flexibility of this model allowed us to make a trade-off between finding good rosters and assuring a fair allocation. When rostering all groups at once, i.e., rostering roughly 600 duties, we were not able to find good solutions if we treated the problem as one large scale optimization problem. We did find evidence that decomposing the problem is beneficial in this case; especially our method that divided the set of 16 roster groups in sets of 4 groups was able to find good solutions, from a rostering perspective. We conclude that solving the integrated problem for multiple roster groups is beneficial, but that we are not yet able to cope with very large instances, as we ideally want.

19


References E. Abbink. Crew Management in Passenger Rail Transport. PhD thesis, Erasmus Research Institute of Management (ERIM), October 2014. I. Caragiannis. Better bounds for online load balancing on unrelated machines. In Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms, pages 972–981. Society for Industrial and Applied Mathematics, 2008. A. Hartog, D. Huisman, E. Abbink, and L. Kroon. Decision support for crew rostering at NS. Public Transport, 1(2):121–133, 2009. M. Mesquita, M. Moz, A. Paias, and M. Pato. A decomposition approach for the integrated vehiclecrew-roster problem with days-off pattern. European Journal of Operational Research, 229(2):318– 331, 2013. M. Sodhi and S. Norris. A flexible, fast, and optimal modeling approach applied to crew roster ing at London Underground. Annals of Operations Research, 127(1-4):259–281, 2004. L. Xie and L. Suhl. Cyclic and non-cyclic crew rostering problems in public bus transit. OR Spectrum, 37(1):99–136, 2014.

20

MET MET || Volume Volume 20 22 | | BETA ßETASpecial Special| |2013 2015


MET MET | Volume 22 | 20 Ă&#x;ETA Special | 2015 | Volume | BETA Special | 2013

21


Social Media: A Proxy for Health Care Qualtiy Indicato Daisy van Oostrom

In this research we discuss to what extent social media can act as an indicator for health care quality of Dutch hospitals. For this purpose Twitter and Facebook are used as source of free text messages and indicators are retrieved from the Quality Windows (“Kwaliteitsvensters”), which display the scores of hospitals concerning certain topics. Sentiment analysis is used to evaluate correlations between sentiments in Facebook and Twitter with the Quality Window indicators. A model is proposed which bases the overall quality of a hospital on several features. The features chosen are based on the Quality Window indicators. The results show that of all features only one feature, capturing how quickly you can go to a hospital after making an appointment, is significantly correlated to one of the indicators.

22

Introduction Hospitals in the Netherlands want to be open about their health care quality and allocate a lot of money each year to investigate the patients’ opinion on the provided service. In order to present this information, the experiences of the patients are measured (Miletus). The measured values are called quality indicators. Such indicators are used, e.g. by Elsevier1, to rank 91 Dutch hospitals using 542 indicators. On the 13th of May 2014 NVZ launched quality windows, containing information provided by the hospitals. The windows provide the same type of information for each hospital, allowing the patients to compare the general quality across hospitals. Secondly, patients can now see how their hospital scores relative to previous years and the national average. Consequently, hospital information becomes more accessible for the Dutch population. As the collection of quality information about hospitals is costly, alternative sources of information may be considered. One source to be considered are free text messages found on the Web. Health care is a service which every citizen of the Netherlands has the right to use. Therefore it can be assumed that health care quality is widely discussed on social media. This makes social media sites valuable sources of people’s opinion and sentiments (Pak and Paroubek, 2010). Retrieving opinions and sentiments belongs to the field of research of opinion mining. An important part of our information-gathering behaviour has always been to find out what other people think. With the growing availability and popularity of opinion-rich resources such as online review sites and personal blogs, new opportunities and challenges arise as people now can, and do, actively use information technologies to seek out and understand the opinions of others. Hospitals want to know what patients think of their hospital. Growing opinionrich resources as Twitter and Facebook provide new opportunities to seek the opinion of patients. Therefore, an upcoming topic within this field is the use of sentiment analysis. With sentiment analysis it is possible to retrieve opinions from text messages. Sentiment analysis deals with the computational treatment of sentiment in text (Pang and Lee, 2008).

MET | Volume 22 | ßETA Special | 2015


dicators Previous studies indicate that there is a relationship between information on social media and quality of health care (Verhoef et al., 2014; Bardach et al., 2013; Greaves et al., 2013). For example, a study done by Timian et al. (2013) shows that Facebook “likes” have a strong negative association with mortality rates and a positive association with patient recommendation. But these studies only considered the relation between the likes and ratings on health service websites and objective measures of hospital quality (i.e., Consumer Quality Index measures). A new step would be to look whether ratings can be drawn from actual messages people place on the Web. Another aspect here is to see whether the information available about this topic changes over time. The changes in the available data over time are called trends. Trends of Twitter users are typically driven by emerging events, breaking news and general topics that attract the attention of a large fraction of users (Vaca et al., 2014). This thesis discusses to what extent we can use social media as an indicator for health care quality of Dutch hospitals. We use the social media sources Twitter and Facebook as indicator for health care quality by using the sentiments of people concerning the quality of hospital care. A difficulty is that these sources are not directly connected with health care, as they are platforms where lots of people can express their opinions about all their experiences. Data collection In this section we briefly want to discuss the data extraction and the tools that are used to extract data. Firstly, a web crawler tool called Visual Web Ripper2 used to collect information from the Facebook pages of the hospitals and to collect data from the Dutch quality windows. The quality windows (in Dutch “kwaliteitsvensters”) provide information about hospitals within the Netherlands. These windows provide the same type of information for each hospital, so that the patients can compare the general quality across hospitals. Each window provides information on 10 topics (indicators). Information is gathered for 81 hospitals in the Netherlands and ranges over the years 2010 to 2013. The number of variables of which values are known differs per hospital. Visual Web Ripper is a powerful web page scraper used for extracting website data and automatically harvests MET | Volume 22 | ßETA Special | 2015

content from web pages. Secondly, the messages from Facebook and Twitter are retrieved using Coosto3 provided by Totta4. Coosto is a tool for social media monitoring and webcare. All messages about hospitals in the Netherlands are to be considered. A selection of messages is automatically made by using keywords. For the time frame of 2013 1,569,608 messages are retrieved from twitter and Facebook. Out of all 1,569,608 messages 210,575 Twitter messages were provided with geolocations. For analysis the message and its date and geo-location (only for twitter), if available, are used. . Overview approach The overall idea is to use the sentiment analysis to create scores that represent the hospital quality, which then can be used to calculate the correlations between these scores and already validated scores(in this case the data from the quality windows. The system overview in Figure 1) shows all the steps from collecting the data to eventually calculating the correlations. Sentiment analysis In this section the sentiments of patients with respect to a hospital are determined. In order to determine these sentiments opinion mining is needed. The overall sentiment of a hospital can depend on different aspects on which people express their opinion. Normally, a first step in such analysis would be the identification of these different aspects, which we call features. As this work focuses on a narrow field and explicitly wants to know what the impact is of already given indicators (the quality window indicators) on the overall assessment of a hospital, the features are based on the information behind these indicators. Therefore, before detecting the sentiments, the messages need to be processed according to the hospitals and the features. Clustering based on hospitals As we investigate the general quality across hospitals the messages need to be clustered per hospitals. First, the name of the hospital is used as keyword. The messages are linked to the hospital which is mentioned in the message. Second, if the message does not contain a hospital name the geo-locations belonging to the message are used. The message is assigned

23


to the closest hospital. Last, if the message contains neither a hospital name nor geo-locations the message is checked for place names. This only applies to city’s that have one hospital. Splitting messages based on features After determining to which hospital a message belongs, the sentiment per feature of each message is calculated. The given quality indicators of the quality windows are used as features for the sentiment analybe the number of messages per hospital, sis. Let . where To be able to find the sentiment per feature, the messages are analysed according to the features. Only the part of the message concerning a feature is used for conducting the sentiment of a feature. This is done by first splitting each message according to the sentences. After this step a bag-of-words is made for each feature, containing the words of the sentences including keywords of the feature. The keywords are formed from the descriptions of the indicators and using their synonyms. The synonyms are retrieved from Cornetto 5, which is a lexical semantic database for Dutch. It is possible that a sentence is about multiple features. Then all sentiment of the sentence is attributed to all those features.

24

Furthermore, all words with hashtags are also taken into account, as it is assumed that they represent the overall topic of a message. is discussed per message Knowing what feature gives us knowledge about the distributions of features discussed. In total 10 features are discussed and therefore the number of features stays constant over time. Let be the number of features, then Storing this information at time gives matrices . 0 indicating the feature with each element was not discussed in message and 1 indicating the feature was discussed in message . The analysis is conducted for the year 2013 and we take time periods of a month, which means this analysis contains 12 time periods. The overall format of the output per hospital is stated in Figure 2. Sentiment Analysis After splitting the messages based on features, the bag-of-words for each feature is used to conduct the sentiment analysis (Figure 1). The sentiment analysis gives the output of 0, 1 or -1. 0 when the sentiment is neutral, 1 when the sentiment is positive and -1 when the sentiment is negative. Storing this information at . time gives matrices with each element Again the overall format of the output per hospital is stated in Figure 2.

MET | Volume 22 | Ă&#x;ETA Special | 2015


The sentiment analysis is performed by a java-tool called Senti-Strength. In order to create a sentiment analysis that is adjusted to the topic of hospital quality extra input is given to this tool. The input consists of lists of words and their positive or negative scores. Furthermore, it is set to give higher scores to words when a hashtag is used. The list of words and their scores are created by first retrieving all messages with smileys. Senti-Strength provides a list with all smileys and their score. The ten most used words in a sentence with a smiley are given the same score as that smiley. Again the synonyms of words within the list get the same score as the word in the list and are retrieved using Cornetto. Output All information needed about the messages and their sentiment of all hospitals over time are stored in two matrices. These matrices are obtained by aggregating the matrices of the format given in Figure 2 over the messages. For each hospital the sum is taken of all the values. contains the distributions of the The first matrix messages over the hospitals at time , with every elethe sum over the messages per hospital ment conat time (See Figure 3). The second matrix tains the sentiment per hospital per message per feathe sum ture at time , with every element over the messages per hospital at time (See Figure 3).

is defined as the score per feature per hostime pital corrected for the number of times this feature is is treated as discussed at time . The data matrix an element-wise product that consists of two factors with , where and both . represents the have the size of , number of hospitals and represents the number of and . can be features, with interpreted as the average score of a hospital per feawill be referred as the sentiment ture at time . score of a hospital per feature at time . Note that the , as sentiment score always lie in the range of the sentiment of a message per feature per hospital . Lets assume that the distribution with the number of is messages per hospital per feature at time explained by the number of messages per hospital at and the weights of the sentiments per featime . The weights represent the diviture at time sion of the messages over the features. The weights matrix can be decomposed into two non-negative and , such that their product can well matrices, approximate the matrix . is treated as a product , where has a size of of two factors and has a size of . Furthermore, let and be transition matrices in which the current distributions are linearly explained . Knowing from the previous ones whether the distributions linearly depend on the previous ones, gives information about how these distributions change over time. This helps recognising patterns in the distributions over time. From the discussed descriptions the following Equations are derived: (1)

Model Now a score per feature per hospital at time and the number of times a feature is discussed per are defined. With this informahospital at time tion given the sentiment per feature per hospital at

MET | Volume 22 | Ă&#x;ETA Special | 2015

(2) (3) (4)

where

is the element-wise product and : Score per feature per hospital at time , : Sentiment score per feature per hospital

25


at time ,

: Number of times a feature is discussed per hospital at time , : Total number of messages per hospital at time , : Weight per feature at time , : Transition matrix explaining how much the current is linearly explained from the previous , one : Transition matrix explaining how much the current is linearly explained from the previous . one Time based factorization Opinions can change and at time you would only want to know someones opinions at that time. So if we would give a score at time to a hospital we want to know the opinion of people at that specific point in time . One is faced with the trade off between past and present observations. Completely forgetting the past might result in loss of crucial contextual information. Just focusing on the past means ignoring the fact that opinions can change randomly over time. Both the past and the present contain valuable information, therefore, the trade off between both realities is modelled. The aim is to decompose the matrix in matrices and (non-negative matrix). is with and treated as a product of two factors being an element from , where has a size with and has a size of . of Let the present decomposition at time be defined as:

first part is the present weight and the second part containing weights based on past is the vector information. The trade-off between the past and preand . By restrictsent weights is expressed by and to sum up to 1, meaning ing and represent percentages of how strongly the model depends on the past and present. This restriction also helps to make sure that the restriction holds. If the model strongly depends on the past the sentiments are considered constant over time. This would mean that the past is a good prediction of the future. If the model strongly depends on the present the sentiments radically change over time. The past has no predictive value for the future as opinions change strongly over time. By modelling the trade-off between both past and present, the present decompositions can be interpreted differently. Now is the corrected sentiment score per hospital and contains information from the past. To summarise, the following is to be decomposed collectively: (7)

subject to: (8) (9)

(5)

is Only considering the present decomposition, defined as the sentiment score per hospital at time and holds the weight of the features. Without taking past information into account, is equal to . Therefore, the feature distribution is supposed to be decomposed in terms of previous distributions and the current distribution, leading to a past decomposition of (6)

Here it is stated that

26

depends on two parts. The

(10)

In order to solve the problem, a loss function needs , which to be defined and the decompoquantifies the distance between sitions in Equations 7. The resulting optimisation problem aims to minimise the followingloss function: : (11)

, and where repsubject to represents the resents the Frobenius norm and is norm. The Frobenius norm of a matrix

MET | Volume 22 | Ă&#x;ETA Special | 2015


and the

norm of a vector B is

, , with being an element from vector B which has elements. From the loss function in Equation 11, the gradients according to each parameter are derived:

While

do

By substituting the gradients from Equations 12, 13, 14 into These are Karush-Kuhn-Tucker first order conditions in our problem. The loss function is invariant are at a in the updates if and only if , and stationary point in the function. The following update equations are derived:

Algorithm 1 shows how the update equations are applied. With the update equations we are able to create a sentiment score taking past and present information into account. Between the sentiment scores and the retrieved indicator variables the correlations are calculated. Algorithm 1 Joint Past Present Decomposition Algortihm Input: Output: random nonnegative initialization; random initialization with numbers between 1 and 1; MET | Volume 22 | Ă&#x;ETA Special | 2015

end Results and conclusion Heatmaps are used to observe the distributions, as seen in Figure 4. At first glance it seems that larger hospitals have more messages available to analyse, but the changes over time are quite big. Also the distributions of the messages per hospital per feature are not constant over time. There is reason to assume this is due to topic sensitivity of social media. When a topic appears in the news and is much discussed, more and more people tend to have an opinion about it. The distributions of discussed features are used to assign weights to the features. Weights assigned to the features do stay more or less constant over time because the realtive distribution of the feature does stay the same. The evaluation of doctors and pain prevention after surgery have the biggest weights, thus are the topics that are the most discussed on social media. For the time based factorization model it is assumed that the sentiments gradually change over time. This condition is not met, which is in line with previous conclusions. The sentiments on time . have no relation with the sentiments on time This could explain why the sentiment scores are not correlated with their indicator variables. Finally, the results show that of all features the feature capturing how quickly you can go to a hospital after making an appointment is significantly correlated to one of the indicators variables. This indicator variable holds the percentages of patients that had to wait more than four weeks to receive treatment per hospital.

27


References N.S. Bardach, R. Asteria-Penaloza, W.J. Boscardin, and R.A. Ěƒ Dudley. The relationship between commercial website ratings and traditional hospital performance measures in the usa. BMJ Quality and Safety, 22, 2013. F. Greaves, D. Ramirez-Cano, C. Millett, A. Darzi, and L. Donaldson. Use of sentiment analysis for capturing patient experience from free-text comments posted online. Journal of Medical Internet Research, 15(11), 2013. A. Pak and P. Paroubek. Twitter as a corpus for sentiment analysis and opinion mining. 2010. B. Pang and L. Lee. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval, 2, 2008. A. Timian, S. Rupcic, S. Kachnowski, and P. Luisi. Do patients like good care? measuring hospital quality via facebook. American Journal of Medical Quality, 25(5), 2013. C. Vaca, A. Mantrach, A. Jaimes, and M. Saerens. A timebased collective factorization for topic discovery and monitoring in news. International World Wide Web Conference Committee (IW3C2), 2014. L.M. Verhoef, T.H. Van de Belt, L.J.L.P.G. Engelen, L. Schoonhoven, and R.B. Kool. Social media and rating sites as tools to understanding quality of care: A scoping review. Journal of Medical Internet Research, 16(2), 2014.

28

MET MET || Volume Volume 20 22 | | BETA Ă&#x;ETASpecial Special| |2013 2015


MET MET | Volume | Volume 22 | 20 Ă&#x;ETA | BETA Special Special | 2015 | 2013

29


Optimization Under Privacy Preservation: Possibilities and Tr Rowan Hoogervorst Erasmus University Rotterdam

We investigate the trade-off between privacy and solution quality that occurs when a privacy preserving database is used as input data to a bin-packing problem. To minimize this trade-off we firstly suggest different methods of privacy preservation, under the privacy criteria of k-anonymity and differential privacy. Here we investigate the effects of using different global recoding techniques for k-anonymity. Secondly, we suggest optimization methods that minimize the extent to which the data perturbation affects the feasibility and objective value of the found solution. Here we utilize the frameworks of robust and stochastic optimization to enforce chance constraints for the case of k-anonymity, while we propose a heuristic based on the Neft-Fit bin-packing heuristic to minimize the effect of perturbation in case of interactive differential privacy. In an empirical evaluation of the proposed methods we find that for the case of k-anonymity the combination of K-Optimize and stochastic programming provides overall the best solution quality. This while we find for differential privacy that enforcing feasibility comes at great cost in terms of the number of bins used, where our suggested heuristic performs best at ensuring feasibility.

30

Introduction The last decades have seen a trend in which ever more data is being collected about individuals and in which ever more of such data is made public. The combination of these two trends has lead to privacy concerns throughout society. This as adversaries may be able to obtain sensitive information about individuals in the data, by joining the published data with available background information, in cases where no adequate privacy measures are taken. A convincing example is given by Sweeney (2002), who identified the medical records of the governor of Massachusetts in a publicly available medical database. Different methods have thus been proposed to properly preserve privacy in published databases, where the two most common frameworks are that of recoding the data through generalization and suppression, and that of adding random noise to the data (Iyengar, 2002). In both cases the original data is perturbed, affecting the further usability of the data. An area in which this effect may be especially important is that of decision making, where the added data uncertainty due to perturbations may pose a serious concern for decision quality. This paper will more specifically consider the effects of using a privacy preserving database as input to an optimization problem. The problem will be considered from both the perspective of privacy preservation and optimization, where our main focus will be on the natural trade-off that occurs between the level of privacy provided by the privacy preserving database and the solution quality of the bin-packing problem in terms of objective value and feasibility of the found solution. As previous literature is absent on this topic of using privacy preserving data in optimization, a major contribution of this paper will be to propose a framework in which the effect of enforcing the required privacy criteria in terms of solution quality is as small as possible. The Problem Setting The optimization problem we focus on is the wellknown bin-packing problem, which is at the core of many operational research problems. The problem

MET | Volume 22 | Ă&#x;ETA Special | 2015


and Trade-offs

considered in bin-packing is to find the best allocation of items over bins such that the minimum has amount of bins is used. Here each item and each bin has capacity , where weight . If we introduce the decision variables

(1)

(2)

the formulation of the bin-packing problem as proposed by Martello (1990) is given by (3)

(4)

(5)

(6)

Considering this problem, the privacy-sensitive information is assumed to be the weights of the items . There are now two important challenges in applying the combination of privacy-preservation and optimization. First of all, the data publisher faces the challenge to ensure a certain level of privacy, while maximizing the further usability of the data for optimization. This while the aim of the optimizer is to find a solution which reduces the effects of data uncertainty as much as possible, while ensuring a good objective value. These two challenges form the core of our methodological exploration. Methodology Privacy preserving data publishing An important choice to make for the data publisher is on the privacy criterion that will be used to anonymize the data, as this privacy criterion determines the protection for individuals in the data. In our case we chose the privacy criteria of k-anonymity and differential privacy, which counter two fundamental types

MET | Volume 22 | Ă&#x;ETA Special | 2015

of attacks that an adversary could employ to compromise privacy: the linkage attack and the probabilistic attack (Fung, Wang, Fu, & Philip, 2010). Before we introduce these two types of attacks, we formalize our setting. Let the available data be given by a table of rows and columns, where the rows represent the data entries and the columns the different variables of interest in the table, which are also called the attributes. The table can now be represented as , where denotes the attribute. It is common to subdivide these attributes into explicitly identifying, quasi-identifying, sensitive and nonsensitive attributes (Fung et al., 2010). An explicit identifier immediately identifies an individual and a quasi-identifier may be used to identify an individual when combined with possible background information, while sensitive attributes are the attributes that should remain private. Lastly, non-sensitive attributes are all the attributes left. A linkage attack now refers to the situation in which an adversary is able to link an individual to either a certain entry in the table, a given attribute of the table or even to the table as a whole (Fung et al., 2010). A probabilistic attack refers to the situation in which the publication of the dataset allows the adversary to change his belief significantly that some individual has some property , or more formally, if his prior belief that individual has some property is significantly different from his posterior belief. The basic idea of the k-anonymity privacy criterion (Sweeney, 2002) is to prevent linkage attacks by recoding the data such that every entry is other data entries. indistinguishable from at least Thus assuming that a table is already scrubbed from any directly identifying attributes we can define k-anonymity as: Definition 1 (k-Anonymity ((Machanavajjhala, Kifer, Gehrke, & Venkitasubramaniam, 2007)) A table satisfies k-anonymity if every record in the other table is indistinguishable from at least records with respect to every set of quasi-identifier attributes. One can achieve k-anonymity in a database through various ways of recoding the data, where the two

31


most common techniques are that of generalization and suppression (Bayardo & Agrawal, 2005). Here generalization refers to making the given entry more general, while suppression accounts to replacing the entry by a predefined suppressing token. As finding an optimal k-anonymous generalization is in general an NP-hard problem (Meyerson & Williams, 2004), different methodologies and methods have been developed to perform the recoding. Our focus in this paper is on global recoding techniques, where more specifically we consider full-domain generalization, partition-based single-dimensional recoding and partition-based multi-dimensional recoding. The implementations we choose for these different global recoding techniques are respectively the Flash algorithm (Kohlmayer, Prasser, Eckert, Kemper, & Kuhn, 2012), the K-Optimize (Bayardo & Agrawal, 2005) algorithm and the Mondrian (LeFevre, DeWitt, & Ramakrishnan, 2006) algorithm. Based on theoretical arguments we argue that full-domain generalization in general perturbs the data the furthest, but can handle large datasets the best of all methods. This while single-dimensional and multi-dimensional recoding theoretically offer the least perturbation, but are either time-consuming as in case of exact singledimensional recoding or can only be solved heuristically as in case of multi-dimensional recoding. The second privacy criterion considered is that of differential privacy (Dwork, 2006), which targets probabilistic attacks. To derive this privacy criterion Dwork (2006) reasons that given certain background information it is possible to learn sensitive information on someone when a database is published, regardless of the fact if this individual is included in the database. She thus argues that achieving an absolute protection of probabilistic attacks is infeasible and to overcome this problem does not require that the publication of a database by itself should not make one viable to a probabilistic attack, but that instead the inclusion of someone in the database should not make him significantly more liable to any probabilistic attack: Definition 2 ( A randomized function

32

(Dwork, 2006)) gives

pri-

vacy if for all tables element, and all

and , differing on at most one ,

(7)

We thus require that the inclusion or exclusion of a single data entry does not significantly change the output of the random function for any output S, making someone indifferent between being included or excluded from the database on the basis of privacy. To achieve differential privacy we make use of the Laplace mechanism (Dwork, 2008) and the Exponential mechanism (McSherry & Talwar, 2007). We argue that the first method is mostly applicable for the case of numeric output, while the second method is more suitable in case the output is non-numeric. Optimization The choice of optimization method is dependent on the method of anonymization that is used. Moreover, it depends on the summary statistics released about every item (observation) in the data. One possibility in optimization would be to directly consider the query results from the privacy-preservation methods as an input to the bin-packing problem. While this might be a fruitful approach in some contexts, it generally disregards the uncertainty that is involved with these summary statistics which are used as weights, which may lead to poor solution quality. Here the data uncertainty in the case of k-anonymity relates to the generalization applied to the actual weight of an item, while in case of differential privacy it relates to the random noise added to the query of interest. Instead we consider three approaches in this paper to reduce the effects of data uncertainty. The first two apply chance-constrained optimization, which enforces a constraint to be satisfied with some probfor . These are the so called ability chance constraints (Charnes, Cooper, & Symonds, 1958). Then if we assume that the true realizations of the weight vector are given by

MET | Volume 22 | Ă&#x;ETA Special | 2015

(8)


We consider the chance constraints

(9)

for the set of capacity constraints (4). Note that these are the only constraints involving the weights and thus the only uncertain constraints. An intuitive way to solve a problem involving the chance constraints (9) is through considering the actual support of the distribution of , which we may indicate call . Assume that is finite and let a realisation from the distribution. We then consider the stochastic programming problem (10)

(11)

(12)

(13)

(14)

In this formulation we allow by introducing the binary variables in the set of constraints (11) that the capacity may be violated in the cases where . Then we enforce in the knapsack constraint (13) that this does not happen any more often than with probability , such that the chance constraints are satisfied. However, due to the large increase in size of the model implied by the stochastic programming approach, we alternatively also consider a robust optimization approach (Ben-Tal, El Ghaoui, & Nemirovski, 2009). With feasible assumptions on the distribution of we derive that the constraints

ing model significantly smaller than that in case of stochastic optimization. However, as we now make use of probability bounds, these constraints may potentially be less accurate. Lastly, we note that for the case of differential privacy the above methods are in general not easily applicable as the Laplace distribution in the case of the Exponential mechanism implies that is not finite. Instead we consider a heuristic approach to solve the binpacking problem in case of differential privacy to reduce the amount of queries made on the database. Like this, the sensitivity of the queries becomes lower, reducing the amount of noise added to the queries. The heuristic we suggest is a variation to the Next-Fit bin-packing heuristic. Here the idea of Next-Fit is to pack the items one-by-one, where items are added to the bin until the bin is full. If an item no longer fits, the bin is closed and the next bin is considered. A simple implementation of Next-Fit would however not be the best choice, as this would mean that queries are needed to obtain the sum overall every time an element is added. Let us instead assume the optimizer has a reasonable idea about the mean weight of the items or obtains from a query on the database. Then a good starting point would be to items to fit into a bin, from which expect we can move up or down depending on the fact if these items actually fit in or not. This is the main idea behind the Mean Adjusted Next-Fit algorithm. In items, after this algorithm we initially try to fit in which we iteratively decrease or increase the amount of items to be fitted by m, depending on if the initial estimate fitted or not. These evaluations that determine if the chosen amount of items fit are then made through the Laplace mechanism.

enforce the chance constraints (9) for appropriately chosen . We furthermore note that this system can be represented as a linear system, making the result-

Experimental evaluation After introducing methods for both the phase of privacy preservation and optimization and giving theoretical arguments supporting the use of these methods, the aim of this section is to evaluate these methods empirically. An important aim in this respect is to check the methods for robustness, such that they perform well in a wide range of bin-packing settings. To do so, we evaluate five different bin packing settings, which are presented in Table 1. Here we assume that

MET | Volume 22 | Ă&#x;ETA Special | 2015

33

(15)

(16)


the weights are distributed uniformly on the interval for settings I-IV, while of the weights are and of the distributed uniformly on weights uniformly on in case of setting V.

We now evaluate the performance of the different anonymization methods for the cases of k-anonymity and differential privacy first. These results are displayed in Figure 1 for the case of k-anonymity and in Figure 2 for the case of differential privacy. In these evaluations we consider only the standard model of optimisation for bin-packing. The summary statistics chosen for the case of k-anonymity are that of the mean and upper bound statistic, which represent respectively the mean and upper bound of a class of values that are indistinguishable from each other. For the case of differential privacy, the summary statistic considered is simply the noisy estimate as returned by the Exponential mechanism. To simplify the analysis, we here present the performance ratio, which is the ratio of the obtained objective value with privacy preservation over that without privacy preservation. To a large extent these results

confirm our theoretical analysis. Here the method of Flash seems to have the biggest impact on objective value, while the methods of Mondrian and K-Optimize seem to perform considerably better. However, the performance of these methods seems to be dependent on the problem instance at hand. Here Flash performs especially bad in cases where the item weights are large respective to the bin capacity and in case of the upper bound summary statistic. We argue that this behaviour can be explained by the structure of the problem. Moreover, we also see that the difference between K-Optimize and Mondrian is especially visible for instances I and V, whereas for the other settings results seem more similar. Interesting to note is that the running times for Flash and Mondrian were mostly comparable, while K-Optimize is considerably slower for smaller k. Lastly, we find that for the case of differential privacy the objective value seems to be on average correct, but that the deviations may be substantive. More important, the feasibility of the solutions is poor, where in some cases none of the found solutions is feasible when considering the true weights instead of the noisy weights which where used to obtain the solution. Next we evaluate the performance of the different optimization methods suggested. Here the results for the case of k-anonymity are displayed in Figure 3 and for differential privacy in Figure 4. Here we only consider the K-Optimize algorithm for the case of k-anonymity, as this has shown to be the

34

MET | Volume 22 | Ă&#x;ETA Special | 2015


best performing. What we observe is that for the case of k-anonymity the method of robust optimization performs poorly. This as the returned results are simply in every instance equal to the results given by the upper bound summary statistics, resulting in a poor performance ratio. The best method for k-anonymity thus seems to be the stochastic programming approach. For most problem instances the objective value is here considerably lower, while the chance constraints are . However, for the satisfied for the intended larger problem instances, this method sometimes seems to perform poorly due to an inability to find optimal solutions in the given time. For the case of differential privacy we note that the method of meanadjusted Next-Fit also performs somewhat mixed.

MET | Volume 22 | Ă&#x;ETA Special | 2015

Here its performance is quite similar compared to the case of simply considering the Exponential mechanism for larger , but often poorer for lower values of . However, we derive that given some minor adjustments to this algorithm, better feasibility can be achieved, but only at substantial cost to the objective value. Conclusion Considering the results of the experiments we found that one can obtain reasonable estimates of the true objective value at reasonable cost in terms of uncertainty when one is not interested in feasibility. However, enforcing feasibility comes at greater cost to the objective value, especially when considering a strong

35


criterion of privacy such as differential privacy. K-anonymity with stochastic programming and K-optimize provided better results, but mostly in the settings where the level of k was not too large or when the weights were reasonably small. Hence we concluded that when the optimizer needs to obtain feasible solutions without increasing the objective

Refrences Bayardo, R. J., & Agrawal, R. (2005). Data privacy through optimal k-anonymization. In Proceedings of the 21st international conference on data engineering (pp. 217–228). Ben-Tal, A., El Ghaoui, L., & Nemirovski, A. (2009). Robust optimization. Princeton University Press. Charnes, A., Cooper, W. W., & Symonds, G. H. (1958). Cost horizons and certainty equivalents: an approach to stochastic programming of heating oil. Management Science, 4 (3), 235–263. Dwork, C. (2006). Differential privacy. In M. Bugliesi, B. Preneel, V. Sassone, & I. Wegener (Eds.), Automata, languages and programming (Vol. 4052, p. 1-12). Dwork, C. (2008). Differential privacy: A survey of results. In Theory and applications of models of computation (pp. 1–19). Springer. Fung, B. C., Wang, K., Fu, A. W.-C., & Philip, S. Y. (2010). Introduction to privacy-preserving data publishing: concepts and techniques. CRC Press.

36

value unrealistically, this can only be done when the level of privacy protected is low. In this sense our paper shows that what we have done is certainly not the end to the work in this field and that more advanced methods may still be necessary in cases where the privacy requirements are high.

Kohlmayer, F., Prasser, F., Eckert, C., Kemper, A., & Kuhn, K. A. (2012). Flash: efficient, stable and optimal k-anonymity. In Proceedings of the 4th ieee international conference on information privacy, security and trust (passat) (pp. 708–717). LeFevre, K., DeWitt, D. J., & Ramakrishnan, R. (2006). Mondrian multidimensional k-anonymity. In Data engineering, 2006. icde’06. proceedings of the 22nd international conference on (pp. 25–25). Machanavajjhala, A., Kifer, D., Gehrke, J., & Venkitasubramaniam, M. (2007). l-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data (TKDD), 1 (1), 3. Martello, T.-P., Silvano. (1990). Knapsack problems : algorithms and computer implementations. Chichester; New York: J. Wiley & Sons. McSherry, F., & Talwar, K. (2007). Mechanism design via differential privacy. In Foundations of computer science, 2007. focs’07. 48th annual ieee symposium on (pp. 94–103).

MET | Volume 22 | ßETA Special | 2015


Meyerson, A., & Williams, R. (2004). On the complexity of optimal k-anonymity. In Proceedings of the twenty-third acm sigmod-sigact-sigart symposium on principles of database systems (pp. 223– 228). Sweeney, L. (2002). k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10 (05), 557–570.

MET | Volume 22 | ßETA Special | 2015

37


s n o i t a l u t a r g n o C to the B E TA winners! 38

MET | Volume 20 | Ă&#x;ETA Special | 2013

MET | Volume 20 | BETA Special | 2013

38


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.