Team DAT!Analysis - Project Report

Page 1

PUB 607 TECH PROJECT REPORT:

Team “DAT!Analysis” Content Analysis of MPub Project Reports and TKBR Essays

Alison Strobel, Monica Miller, Zoe Wake Hyde, Josh Oliveira and Alice Fleerackers


Part I: Introduction Research Questions & Vision This project was conceived as part of a wider undertaking by the Master of Publishing class of 2015 to explore the creation of a new ​ Journal of MPub​ , showcasing the work they have produced throughout the year. Our team’s contribution to this was to collect the current work, as well as that of past cohorts, and see what insights could be drawn from them. The following report details the process by which our team collected, processed and analyzed three years worth essays from the PUB 800 and PUB 802 courses that are located on the school’s TKBR site, as well as fifteen years worth of project reports from the SFU institutional repository. Our goal was to gain a deeper understanding of the content generated by SFU’s Master of Publishing students. We were curious how—or perhaps whether—MPub writing has transformed throughout the years to reflect changing societal and industry trends. We wanted to understand how events such as the rise of social media or the financial crisis of 2008 have altered the kind of things we write about as publishing students and the way we write about them. We were also curious about how the kind of writing published on TKBR differs from that published in SFU’s Summit Research Repository. Are TKBR posts more blog-like? Are they more closely tied to current events? How does the sentence length in TKBR posts compare to the Project Reports? The TKBR posts also present a unique opportunity to conduct some more detailed analysis using the associated metadata. Does the time of day published affect the sentence length of posts? Are the tags students assign truly effective descriptions of their essay topics? What factors (if any) influence the number of links and sources used throughout a post? These are just some of the questions we chose to investigate over the course of our analysis, though not all were achievable within the timeframe. We hope that the results of our investigation will provide greater insight into what is being written in the publishing program and why—findings that could be used to inform acquisition strategy and site organization for the forthcoming ​ Journal of MPub​ 2016.

DAT!Analysis 2


Part II: Methodology 2.1 Collection TKBR Data Collection Our initial decision regarding the TKBR content extraction was either to communicate with the WordPress API, or to download the content directly from the database. When we realized that the WP JSON API was trickier than we had anticipated, we decided to go directly to the database. This process began with initial advice from Juan:

Figure 2.1.1 Juan’s initial advice In order to access the database, we had to download Sequel Pro to connect to the TKBR MySQL database. This allowed us to connect directly to the database on local and remote servers (see Fig 2.1.2).

DAT!Analysis 3


Figure 2.1.2 Downloading Sequel Pro Using the credentials that Juan provided, we were able to access the database using a Secure Shell connection (SSH), which is a network protocol for secure data communication and remote command execution (see Fig 2.1.3). This was very handy for us because we could access the database from home instead of having to be on campus.

Figure 2.1.3 Using Secure Shell connection to access the database Once inside the database, we were able to see all of the blog IDs in order to determine which of the WordPress sites we needed to access to get the content we needed for our project (see Fig 2.1.4). Our desired content was housed in blog IDs 9 and 12, which corresponded to the courses PUB 802 and PUB 800, respectively.

DAT!Analysis 4


Figure 2.1.4 Blog IDs We accessed 802 post content by selecting ​ wp_9_posts​ and filtering for ​ post_type: post​ , so we would not have multiple drafts for each post, such as unpublished drafts, deleted posts, and other revisions, which would skew our data (see Fig 2.1.5). We did the same for 800 posts (​ wp_12_posts​ ), then exported our filtered results as comma-separated value (CSV) files.

Figure 2.1.5 Filtered ​ wp_9_posts​ (PUB 802 posts) As per Juan’s advice, we downloaded OpenRefine, a data cleanup and format transformation tool, from GitHub, following the installation documentation. After importing our CSVs, we found our post content was split into several columns, making it very hard to manage (see Fig 2.1.6). This error was caused by in-text commas ending DAT!Analysis 5


“posts,” thereby splitting sentences into new columns instead of collecting related content in one ​ post_content​ column. This may have been due to a bug in MySQL, OpenRefine, or both. ​

Figure 2.1.6 802 posts in OpenRefine split into various columns and rows Attempting to fix this problem, we imported the CSV file into Sublime Text (see Fig 2.1.7) to search through the content and HTML code to eliminate offending characters (rogue commas). From there, we could construct a regular expression to eliminate all such rogue characters.

Figure 2.1.7 802 posts in Sublime Text Eventually, Juan came up with a workaround solution for us. We decided to re-export the file from the database, but instead of using a comma for our termination character, we would DAT!Analysis 6


instead terminate fields with an ​ @​ (see Fig 2.1.8). The character choice made sense as it was unlikely that any of the posts would end with ​ @​ . As we learned throughout this project, workaround solutions of this kind are used frequently when cleaning and analyzing data.

Figure 2.1.8 Re-exporting 802 posts with an ​ @​ as the termination character Juan had determined that the characters were breaking the posts into excessive columns, so to fix that we imported the PUB 802 CSV file back into Sublime Text and did a search and replace for​ \\"\n​ to ​ \\"​ so that our post content would remain in one column (see Fig ​ 2.1.9).

Figure 2.1.9 Search and replace in Sublime Text

DAT!Analysis 7


We then re-imported the CSV into OpenRefine with the column-separating character changed to​ @​ which fixed the problem. It successfully caused all post content to remain in the proper columns, under ​ post_content​ , enabling further data manipulations (see Figure 2.1.10).

Figure 2.1.10 All 802 post content in the proper column! Unfortunately, the same operations did not resolve identical issues with the PUB 802 content. We attempted using other unlikely termination characters (like ​ ~​ ) to see if we could get all of the content to remain in one column, which was ultimately unsuccessful (see Fig 2.1.11).

Figure 2.1.11 PUB 800 post content not cooperating At this point, Juan suggested changing our initial filtering of the posts from ​ post_type: post to ​ post_status: publish​ (see Fig 2.1.12). We then exported these results as a CSV file and did ​ not need to manipulate them with Sublime Text before re-importing into OpenRefine. DAT!Analysis 8


Figure 2.1.12 800 posts with new filtering criteria The PUB 800 posts presented another unique issue. While 802 posts were all posted to the TKBR site tagged as ​ posts​ , the 800 essays had initially been posted tagged as ​ pages​ . Thus in the ​ post_type​ column, we had to create a new facet for our 800 posts, including both ​ post and ​ page​ to capture the data from previous cohorts (see Fig 2.1.13). Consistency in essay posting would have made this simpler. We also excluded unnecessary posts, such as nav_menu_item​ and ​ (blank)​ items.

Figure 2.1.13 Using a custom facet for ​ post_type After all these steps and a lot of Juan’s help, we were finally able to get all of our PUB 800 post content in the proper column (see Fig 2.1.14). We had successfully collected and

DAT!Analysis 9


extracted all of the content from the TKBR, and could pass the data/files along for further cleaning.

Figure 2.1.14 PUB 800 posts in one column. Success at last!

Summit Data Collection In addition to the TKBR data, we decided to explore the collection of past project reports housed in the SFU institutional repository. The reports are stored as PDF files and are accessible through the ​ Summit​ website, leading us to decide on accessing them directly from the website rather than contacting the repository managers. On Juan’s advice, we decided to write a web scraping script in Python. This script would automatically loop through the pages of the Publishing Program collection, locate the URLs of the PDF files, collect them in a list and finally open them in order to download the files. Starting with Juan’s web scraping ​ tutorial​ , and using the documentation for different Python libraries and a number of StackOverflow and other internet resources, we went through the following stages: 1. Access a page and identify all ​ .pdf​ links in source code This was done using the Python libraries ​ requests ​ and ​ BeautifulSoup​ . The first sends an HTTP request and the second allows for easy searching within the HTML code that is returned by the HTTP request. In order to identify the correct links within the code, we used a simple regular expression to find ​ href​ tags that ended in ​ .pdf​ .

DAT!Analysis 10


2. Write links to a list Once the links were located, in order to collate them we wrote them to a list called ‘Links’, which could then be accessed again in a later stage.

3. Automate a loop through several pages to find links Now able to locate the links and write them to a list, we needed to do so for all 28 pages in the collection without having to manually enter each URL. We found that the first page had a unique URL, but all subsequent pages followed a pattern, where we could create a loop that input the page number automatically.

4. Automate opening each link to download file The final stage was to then access each of the saved links to download the file. This proved challenging, and after several unsuccessful attempts we instead tried to write the list to a text file, remove the HTML tags and input the URLs into a download manager. However this was still partially unsuccessful as only 254 of the 281 identified links would download, and the process was unwieldy and required too many steps. We returned to the goal of automating the process and eventually identified a method using ​ urllib​ , and set the file name to the final section of the URL (e.g. http://summit.sfu.ca/system/files/iritems1/16237/​ etd9432_ASmith.pdf​ )

DAT!Analysis 11


With these four stages working correctly, we ran the final script and successfully downloaded all 281 PDF files from the Summit repository which could then be prepared for analysis.

2.2 Data processing/cleaning: TKBR Data Cleanup Upon successfully exporting the TKBR data, we needed to clean it up and run some calculations before beginning analysis. All of these steps were simultaneously completed on both the PUB 802 and PUB 800 data, to ensure consistency. CHARACTER ENCODING As discussed in Section 2.1, the 800 and 802 files were .csv files with @ as the separated value. Upon importing them into OpenRefine, we first had to deal with character encoding. Where OpenRefine didn’t recognize a character, it inserted a black diamond with a question mark: �.

Figure 2.2.1 Character Encoding errors Depending on how the data was exported, we found different encoding errors. It looked like any double quotes, em-dashes, and sometimes hyphens had been misidentified, as well as HTML entities such as ampersands. Within OpenRefine, you can select the character encoding. Through trial and error (and some re-exporting of the data), we managed to get OpenRefine to respond to character encoding ​ UTF-8​ . DAT!Analysis 12


Figure 2.2.2 OpenRefine parse data settings Once the TKBR data was in OpenRefine, we set about cleaning up the data. First, we needed to deal with the remaining HTML entities, such as ​  ​ (non-breaking space) and ​ & (ampersand). Fortunately, OpenRefine has a convenient command found through GitHub’s documentation recipes to resolve HTML: value.unescape("html")

Figure 2.2.3 Unescape HTML entities If we also had any XML entities, the unescape value can also be modified for use with XML. DAT!Analysis 13


STRIP HTML Next, we wanted to be able to analyze sentence and essay length, so we needed to strip out the HTML tags. However, we also wanted to be able to calculate the number of hyperlinks and embedded images, so we didn’t want to override the original information. Using the dropdown on the ​ post_content​ column, we selected ​ ​ Edit Column > Add column based on this column​ . In the resulting dialogue box, we had to enter a GREL command to create a new column that stripped the HTML. Through trial and error (a lot of internet searching and help from Juan), we used the following command: replace(value,/<[^>]+>/,’’)

Figure 2.2.4 Stripping out HTML DATE STAMP Next, in order to analyze the content based on dates, we needed to separate out the date stamp, which exported in the following format: ​ YYYY-MM-DD HH:MM:SS​ . Using OpenRefine, we used ​ Edit Column > Split Column​ , selecting to use the space separator to get the date and time in two columns, while still retaining the original.

DAT!Analysis 14


Figure 2.2.5 Split column ​ post_date We then repeated this step using the dash to separate the full date into three distinct year, month, and day columns. We then looked at the facet for months and renamed the columns based on corresponding months, instead of the numerical value. Although we could have used a code to do this, OpenRefine makes renaming facets very straightforward.

Fig 2.2.6 Renaming month facets LINK AND IMAGE COUNT Our next step was to return to the original post_content column (with the HTML tags still in place) to calculate the number of links and the number of images used. We found a tutorial that stated the GREL expression ​ value.split("pickyourword").length() ​ would return the number of instances a value (​ pickyourword​ ) appeared in the column. However,

DAT!Analysis 15


the tutorial stated that when the cell did not contain the value, it would still return an instance of 1 instead of 0. The work around included adding a small calculation at the end of the string: value.split(“href”).length()­1

And: value.split(“img”).length()­1

We chose to use the word “​ href​ ” and “​ img​ ” as those two terms would appear in all the relevant HTML tags for hyperlinks and images, respectively. Again, to ensure we did not overwrite the original data, we created a new column based on the column ​ post_content​ .

Figure. 2.2.7 Add link-count column based on ​ post_content

DAT!Analysis 16


Figure 2.2.8 Add img-count column based on​ post_content ESSAY LENGTH The next step returned us to the ​ post_content_noHTML​ column, to do a total word count. Some early attempts included using GREL to try to count instances of words and value match to RegEx commands for words: ​ \w​ and​ ​ :word:​ ​ and then return ​ ​ .length()​ . The final solution was counting the number of spaces that split the words. Once we realized this, we had to make sure that all double spaces were replaced with single spaces. Although this left some room for error, we believed the margin would be miniscule. We used the following GREL command: replace(value, " ", " ") Then we calculated the total word count: value.split(“ “).length()

DAT!Analysis 17


Figure 2.2.9 Add column to calculate Word Count SENTENCE LENGTH The last cleanup calculation was to determine the sentence length so we could compare averages over time. To do this, we needed to strip out any punctuation that wasn’t the end of a sentence (​ . ? !​ ), using the following GREL command: replace(value, /[,\/#!$%\^&\*;:{}=\­_`~()"'œÃƒâ€šÂ¢¢¬¯]/, " ") We quickly scanned through the data to check that any special characters had been removed, such as ¿ or £. Then we calculated the number of sentences total per entry, based on the post_content_noHTML​ column, by counting the number of instances of a period, exclamation point, or question mark: length(split(value, /\. |\! |\? /)) Using the ​ Word Count​ generated earlier, and the ​ Sentence Count​ per essay, we were able to create a GREL formula to calculate the average sentence length per entry. This code divides the total word count in the essay by the number of sentences to get an average: cells["essay_length"].value / cells["sentence_count"].value The values for ​ essay_length​ and ​ sentence_count​ need to correspond identically to the name of the column.

DAT!Analysis 18


Figure 2.2.10 Calculate average sentence length

Summit Data Processing In order to be able to analyze the project reports, they needed to be converted from PDF to simple text files. Initially we explored doing this with Python, to simply add another step to our web scraping script, but weren’t successful. Instead, we decided to run ​ pdftotext​ , an open source command-line tool. In order to do so, however, we first had to install ​ Bash​ ,a Unix​ shell. We could then write and run a simple bash script to convert the files: ​

During this process, however, we hit a roadblock when we discovered that the majority of the PDF files did not allow text extraction.

DAT!Analysis 19


Figure 2.2.11 Permission errors returned for protected PDF files The only solutions we found required the document’s password, which we hoped would be the same for all documents, having been set by the library who manage the repository, but thankfully Juan was able to find another solution and remove the restrictions on all the files. He then placed these unlocked PDFs into a shared Google Drive that we could access when two more issues arose. The first was that upon attempting to download the contents of the folder, the Drive somehow chose to remove them instead, and they were unrecoverable from the Drive’s trash bin.

Figure 2.2.12 No. No I did not.

DAT!Analysis 20


It also became apparent that there were several files that were not related to the publishing program at all. The first glimpse of this was in an initial trial with topic modelling (see below) that revealed a word cluster completely unrelated to the publishing industry. Upon further investigation, we realized that a significant portion for the 281 PDF files were not the project reports from our program, but an assortment of Masters and Doctoral theses from the Department of Archaeology. We first attempted to remove them manually, but given the scale eventually waited until after having converted them all to text files, indexed the folder they were located in, enabled searching within the files and filtered them by the keyword archaeology​ . We also deleted a handful of duplicates and appendix files that were also captured in the web scraping process. In the end, this left us with 147 files to work with.

Figure 2.2.13 Successful conversion from PDF to text files With the text files converted, we could now prepare for them for analysis. Topic modelling could have been run on the files as they were, but they were not identified by year and we chose to also first run a stemming script, at which time we could also extract the year of completion for each file. What stemming does is reduce each word in the document to its stem, or root form, which means that different forms of the word - such as ​ organize​ , organizes​ , and ​ organizing​ - are all counted as one (Manning, Raghavan & Schutze, 2008). There are limitations to this method, particularly as it would count words like ​ publisher​ and publishing​ as the same when they have quite different meanings, but we decided that a keyword search could highlight some of that lost information, and overall it might give a more interesting picture of a small data set. However we acknowledge that this is a potential limitation of our results. We also attempted stemming on the TKBR data as well, but the outputted CSV file flattened each of the 800 and 802 corpuses to a single string, negating the ability to make comparisons over time. As such, we chose not to proceed with the stemmed version for the TKBR data.

DAT!Analysis 21


For the project reports, we began using ​ NLTK​ , a widely used Python platform for natural language processing. The stemming process itself first involves with tokenizing the inputted text, which breaks the block of text in a file into individual units, or tokens, in this case by word (Manning, Raghavan & Schutze, 2008). A list of those tokens can then be filtered for stop words - common, repeated words such as ​ the​ ,​ is ​ ,​ at​ ,​ which​ , and ​ on​ . While there is no universal list of stop words, we imported the NLTK standard version. After being tokenized, the words are then stemmed, and joined back together to form a new string of text.

Following this, we also had to extract the year of completion so our analyses could include comparisons over time. We had hoped that the repository file naming conventions would allow for this, but they did not. Instead, we had to find a pattern within the text files that would allow us to locate the year of completion, and not confuse it for any other date. We managed to find a regular expression that located this in all but a handful of reports, which we could then fix manually.

DAT!Analysis 22


Finally we wrote the file name, year, unstemmed text and stemmed text to a CSV file that could then be inputted into Mallet for topic modelling.

Figure 2.2.14 The full set of project reports after stemming and extracting the year of completion

DAT!Analysis 23


Part III: Analysis 3.1

Summit

Topic Modelling Analysis To explore latent themes in the MPub project reports, we ran a topic model analysis using UMass’s MALLET (MAchine Learning for LanguagE Toolkit). MALLET is a natural language processing toolkit that relies on Latent Dirichlet Allocation (LDA) to map out underlying themes in large collections of texts, or “corpuses” (McCallum, 2002; Underwood, 2012). These themes are clusters of words that tend to co-occur within the same text. Using probabilistic methods, MALLET a) infers which themes exist in a given corpus and b) calculates to what extent each theme contributes to each individual document (Chaney & Bley, 2012). As such, we planned to use our topic model to a) determine common themes within past MPub project reports, and b) examine how each theme’s “contribution” to the corpus has changed (or, perhaps, remained stable) over time. However, although MALLET is considered the “standard implementation of LDA [i.e. Latent Dirichlet Allocation]” (Underwood, 2012), there is surprisingly little (helpful) documentation available for people interested in learning it. In our research, we relied primarily on a tutorial by Shawn Graham, Scott Weingart and Ian Milligan (2012), the MALLET website (McCallum, 2002), and, of course, Juan. First, we had to download and install MALLET—a feat in and of itself. This first involved downloading and installing the Java Developer’s Kit (Oracle.com, n.d.), then downloading and unzipping MALLET to the ​ C:/​ directory. We used MALLET 2.0.7, a slightly older version than is currently available, because there was more online support for that. Next, we had to set an ​ environment variable​ (i.e. a short-cut that directs the computer to the MALLET program).

DAT!Analysis 24


Figure 3.1.1 Setting the environment variable To run MALLET, we had to use the command line. The first step was to import the reports and transform them into a​ .mallet file. There are two ways to do this in MALLET, by ​ importing the corpus as a folder (containing each document as a separate .txt file) or a .txt file (where each line of text in the file represents a different document). Initially, we planned to use the folder approach, as follows: :: Create a mallet file (from a folder) bin\mallet import­dir ­­input summit ­­output summit.mallet ­­keep­sequence ­­remove­stopwords In this example, ​ summit​ is the name of the folder containing each of the different documents (input) and ​ summit.mallet​ is the processed MALLET file (output). We first point the computer to ​ bin\mallet​ , the directory that contains all of the MALLET commands we will be using. The ​ import-dir​ command indicates that we are inputting a folder or ​ directory​ and not a single file. Importantly, the input folder must be saved directly in the MALLET directory in order for the analysis to work. The ​ --keep-sequence​ command retains the order of the texts as they are within the ​ summit​ folder and the ​ --remove-stopwords​ command removes any common words such as “and,” “but” and “the” (using a default list that comes with the MALLET package). These stop words occur often, but don’t contribute to the themes within a text; including them can thus obstruct the analysis (Graham, Weingart and Milligan, 2012). After running all of our text manipulations, however, we were left with a single text file, not the folder we had anticipated. Luckily, the single file import method is very similar to the folder import method: DAT!Analysis 25


:: Create a mallet file (from a file) bin\mallet import­file ­­input summit.txt ­­output tutorial.mallet ­­keep­sequence ­­remove­stopwords For this method, all commands remain the same, but the ​ input​ document must be a text file and the import command must read ​ import-file​ rather than ​ import-dir​ . That file must be organized in a specific format, with the first word of each line corresponding to the title of the document. This first word will be included in the final output file, along with a sequential #DOC​ (document number) that MALLET automatically assigns to each line of text. Although we did not have titles or ID numbers for the project reports, we ​ did ​ have the years that they were published. As such, we decided to use the year, rather than document title, as the first word on each line. This allowed us to keep completion dates and topic model data together in one place, a requirement for assessing how (or perhaps whether) different topics were more popular at different points in time. Our final ​ summit.txt​(input) file looked as follows:

Figure 3.1.2 “summit.txt”: our input file for the project reports STOP WORDS However, we realized at this stage that certain terms, such as “Simon Fraser University”, “Master” and “Publishing” would occur in virtually every document. To ensure these terms would not affect our results, we decided to incorporate a list of additional stopwords, in a text file titled ​ mpubstopwords.txt​ (as specified by MALLET’s “Data Import” page, McCallum, 2002):

DAT!Analysis 26


Figure 3.1.3 “mpubstopwords.txt”: our additional stopwords To add these additional stopwords into the analysis, we saved this list in the MALLET directory, then added an ​ ­­extra­stopwords​ command to the import step: :: Create a mallet file (with extra stopwords) bin\mallet import­file ­­input summit.txt ­­output tutorial.mallet ­­keep­sequence ­­remove­stopwords ­­extra­stopwords mpubstopwords.txt Next, we ran MALLET's “train-topics” command on our new .mallet file to “infer” the latent topics in the summit corpus (as per the Graham, Wiengart and Milligan tutorial)1: :: Train Topics bin\mallet train­topics ­­input summit.mallet ­­num­topics 20 ­­optimize­interval 20 ­­output­state topic­state.gz ­­output­topic­keys summit_keys.txt ­­output­doc­topics summit_composition.txt Here, the ​ –output-state​ command outputs every word in the corpus, along with the topic it belongs to, into a compressed file (​ .gz​ ). The ​ --optimize-interval​ command controls the Dirichlet prior​ , a dispersion parameter that reflects how tightly clustered a given set of data points are (Edwin, 2012). The default Dirichlet prior in MALLET is 2.5, which assumes that words are evenly distributed within a topic. By optimizing intervals, we allow the model to infer the distribution of words within a topic, rather than assuming that each word is evenly weighted. Graham, Weingart and Milligan (2012) argue that optimizing intervals tends to result in better topics, presumably because it allows for the kind of uneven word distributions common in natural language to exist. The result of the ​ train-topics​ command are two outputs: ​ summit_keys.txt​ (a text file containing a list of the top keywords for each topic), and ​ summit_composition.txt​ (the 1

A note about our parameters:​ ​ We decided to use the default 20 topics (as indicated by ​ --num-topics 20​ ), with 20 keywords each, but only after exploring some different combinations (e.g. 5 topics, 20 words, 10 topics, 10 words). See the Limitations section for a full discussion on setting topic modelling parameters.

DAT!Analysis 27


amount, expressed as a percentage, that the topics contributed to each project report). To make sense of these files, we had to open them in OpenOffice.org's Calc, rather than the default text editor (Excel or similar should work too):

Figure 3.1.4a The project report composition file, in Notepad++

Figure 3.1.4b The project report composition file, in OpenOffice.org's Calc To observe how topic contributions changed over time, we first needed to reformat our data. As indicated in Figures 3.1.3.1 and 3.1.3.2, MALLET provides the topic data in descending order (starting with the topic that contributed most to each essay). As a result, the topics and their corresponding contributions are listed in a different order in each row, making the data difficult to manipulate. Instead, we wanted each topic to be on its own row; that way we could more easily observe how its contribution changed over time and from one report to another. To do so, we used Google's OpenRefine. To learn to use OpenRefine, we relied on a number of resources, including the Google Refine YouTube Tutorials (2011), a GitHub post by Joe Wicentowski (2015), Stack Overflow (various), and again, Juan.

DAT!Analysis 28


When we first imported the data into OpenRefine, it looked like this:

Figure 3.1.5 The Summit project report composition data in OpenRefine, uncleaned First, we split ​ Column 2​ into two columns to isolate the publication year of each essay:

DAT!Analysis 29


Figures 3.1.6a and 3.1.6b Splitting columns in OpenRefine Next we removed ​ Column 2 2​ and renamed the two first columns (“Doc ID” and “Year”, respectively):

Figure 3.1.7 Deleted and renamed Next, we needed to move each topic into a separate row, while still retaining the relevant metadata (i.e. the Doc ID and Year published). This was more complicated, but made possible by using the ​ Transpose​ function twice. First, we transposed cells in columns (starting from Column 3) into rows in two new columns (which we named “Label” and Value”), making sure to select ​ Fill down in other columns​ :

DAT!Analysis 30


Figures 3.1.8a , 3.1.8b, and 3.1.8c Transposing cells across columns into rows

DAT!Analysis 31


Then we transposed every 2 cells in column “value” into separate columns:

Figures 3.1.9a, 3.1.9b, and 3.1.9c ​ Transposing every two cells in column “value” into separate columns

DAT!Analysis 32


Next, all we needed to do was simply delete the empty rows (using a Text Facet) and rename the columns:

Figures 3.1.10a, 3.1.10b, and 3.1.10c Using a Text Facet to select and then delete all blank rows Finally, we noticed that many of the contributions were very small, less than 1%. Rather than spending time creating visualizations for so many essentially insignificant contributions, we decided to filter out all values less than 0.001 (1%), using a Numeric Facet:

Figure 3.1.11 Using a Numeric Facet to filter out all insignificant contributions

DAT!Analysis 33


Figure 3.1.12 Finally! We were ready to start exploring the data Now that we had successfully transformed the topic composition data into a workable format, we turned to the topics themselves. The raw .txt file output by MALLET (​ summit_keys.txt​ ) looked like this:

Figure 3.1.13 summit_keys.txt: the 20 topics inferred from the project reports In the above figure (3.1.13), the numbers in the leftmost column (1-19) are automatically assigned by MALLET, with one number for each of the 20 topics. The next column contains the Dirichlet parameter for each topic (discussed above). Finally, the last column contains the twenty keywords related to each topic. These words tend to co-occur in the texts, suggesting that they relate to a unified topic or theme. In some cases, this was clearly the case. For example, ​ Topic 0​ includes the following keywords: ​ press, ubc, editor, univers, manuscript, scholarli, author, review, publish, market, seri, product, public, process, academ, freelanc, depart, acquisit, member. ​ It was clear to our group that these words all related to an academic or scholarly publishing theme. But other topics, such as Topic 14 (​ thi, wa, ha, make, work, becaus, mani, time, publish, onli, print, product, reader, howev, veri, DAT!Analysis 34


part, differ, ani, read​ ) were virtually impossible to decipher. This may have been due at least in part to the stemming process we applied earlier (see Section 2.2). We discuss the implications of stemming further in our Limitations section (see 4.1, below). LABELLING TOPICS To be able to use the topics, we wanted intuitive names for each topic. To name the topics, we went through the list as a group and discussed what we thought would be most appropriate. Labeling the topics as a team helped us to make sense of some of the less intuitive topics, which we discussed in depth before assigning a label. We added these labels to our topic contribution data as a separate column (called “Label”) for later use. Our final labels were:

Figure 3.1.14 Final topic labels for Summit project report data For a full list of topics and keywords, see Appendix A. We note that two topics (Topics 14 and 15) were impossible to name; we chose to leave them unnamed and excluded them from the visualization portion of our analysis.

Summit: Key Terms Analysis For our Summit key term research, we used the plain text versions of the PDFs collected from the Summit repository, consisting of a sample of 147 Master of Publishing project DAT!Analysis 35


reports dating from the years 2000 to 2015. We were able to selectively run key term searches for any terms we thought might reveal broader developments in publishing using NLTK in Python. Similar to the stemming process, we began by tokenizing the data by word, but also by sentence. The script then counted each of those sets tokens in order to find the word count and sentence count for each file, then averaged them to find the average sentence length.

We then also decided to search for the following specific terms that we thought would be interesting to track over time: publishing social Facebook Twitter digital brand

data ebooks2 future smartphones traditional print

editing editorial self-publish self magazine internet

2

As the script we used looked only for exact word matches, we ran a second version that looked for every instance of a variation of spelling for ​ ebook​ and collated them manually.

DAT!Analysis 36


branding death crisis self-publishing

Amazon Chapters monopoly marketing

blog author editor

Finally, we extracted the year of completion using the same code used in the stemming script to once again be able to track trends over time. The resulting data was written to a CSV file.

Figure 3.1.15 CSV file containing Summit key terms This file containing key term results for the Summit report data was then loaded into a visualization software program called Tableau. Though this program is able to create very complex visualizations, simple bar and line graphs illustrated our data sufficiently for this project.

DAT!Analysis 37


Figure 3.1.16 A view inside Tableau’s application interface

Figure 3.1.17 A view inside Tableau’s web interface There are two components to Tableau: the desktop tool, in which visualizations are created, and the website, where visualizations can be published and shared. Our Summit data set was far from ideal. For example, many of the terms have multiple spellings, such as eBook and e-book, or appear in multiple forms, like self-publishing and self published. This was resolved for the ebook data set by combining several spellings, resulting in a group referred to in the graph below as “Ebook1.”

DAT!Analysis 38


Another limitation is that our data was limited before 2005, with only seven records for the years 2000 to 2003. This early data was deemed insufficient for gaging the popularity of key terms. As such, all of our Summit visualizations focus data from 2004 to 2015. The numerical values represent how often each term appeared in average project reports from a given year. Four line graphs were ultimately created using the Tableau application and website:

Figure 3.1.18 The terms “Editorial” and “Publishing” decreased in average Summit reports

Figure 3.1.19 The term “Digital” and various spellings of “Ebook” are increasing in the data

DAT!Analysis 39


Figure 3.1.20 “Facebook” and “Social” peaked from 2007-2011 and then peaked even higher in 2014-2015.

Figure 3.1.21 As concern with the “Future” increases, so does concern with the “Traditional”

3.2

TKBR

Topic Modelling Our methods for training the TKBR posts were essentially the same as with the Summit project reports. The only key difference arose with word stemming. Although we had DAT!Analysis 40


intended to stem both the project report and the TKBR post data before topic modelling, we had significant difficulties creating a script that successfully stemmed the TKBR posts and exported them in a format that MALLET could process (i.e. as one document per line, or one folder containing one text file per document). In the interest of time, we decided to run the topic model without stemming the TKBR posts first. We discuss the implications of this decision in our Limitations​ section (4.1, below). ​ The MALLET commands we used looked as follows: :: Create a mallet file (with extra stopwords) bin\mallet import­file ­­input TKBR.txt ­­output TKBR.mallet ­­keep­sequence ­­remove­stopwords ­­extra­stopwords mpubstopwords.txt :: Train Topics (optimized, 20 topics, 20 keywords each) bin\mallet train­topics ­­input TKBR.mallet ­­num­topics 20 ­­optimize­interval 20 ­­output­state topic­state.gz ­­output­topic­keys TKBR_keys.txt ­­output­doc­topics TKBR_composition.txt Our final composition data, after cleaning in OpenRefine (as in Section 3.1, above), looked like this:

DAT!Analysis 41


And our final topic labels (after a group decision making process, again see Section 3.1) were:

Figure 3.2.1 Final topic labels for TKBR post data Again, we had one “nonsense” topic, Topic 3 (​ ing tion pub lish pro web con li open tal schol di tions work dig read arly tive mag​ ). This was particularly odd, seeing as we did not stem the words in our TKBR data before running MALLET. Ideally, we would have investigated this further (perhaps by running the model again, and comparing results across trials). But due to the time constraints, we had to move forward with our analysis; as with the project report topics, we left this topic unlabelled and excluded it from the visualizations portion of our analysis.

TKBR: Key Terms As with our Summit data, our TKBR data was organized into CSV files that were loaded into Tableau, analyzed, visualized, and eventually published (see below). The terms analyzed were identical to the Summit data, though different terms proved relevant for visualization. There were also several other key differences between the TKBR and Summit data. Our TKBR data did not span as many years as the Summit data, so it was determined that we would not use it to track changes in key term use over time. Instead, we used the TKBR data to DAT!Analysis 42


compare the frequency of key terms in the two core MPub courses, those being ​ PUB 800 Industry​ and ​ PUB 802 - Technology​ . As such, rather than being divided into many individual rows, the data was divided into only two rows, one for each class.

Figure 3.2.2 Original TKBR key term data However, this data presented one obvious problem. While the data from PUB 802 represented a little over 280,000 published words, the PUB 800 data represented closer to 180,000 words. This would make direct comparison of the total usage of each key word far less useful. To solve this, basic arithmetic amounting essentially to cross multiplication was applied to the PUB 800 row to normalize the data as if each data set contained the exact same number of words; the values were then rounded to the nearest whole number.

Figure 3.2.3 Normalized TKBR key term data After the data had been normalized, it was possible to make direct comparisons between the two sets using simple bar graphs. Once again, the data used for these visualizations was far from perfect. It included only 22 key terms, a large number of which had various potential spellings and variations unaccounted for. Further, it was fairly easy to surmise, and it was confirmed by the data, that the output of the Technology class would feature more technical terms and that of the Industry class would feature more terms relating to the broad scope of publishing. However, there were some interesting and somewhat surprising results, as well. Such as the fact “Author” actually appears more in 802 writing, while “Future” appears roughly equally in the writing for both classes.

DAT!Analysis 43


Here are the three resulting bar graph visualizations from the TKBR key term data:

Figure 3.2.4 Leading “Technology” key terms in our data

Figure 3.2.5 Leading “Industry” key terms in our data

DAT!Analysis 44


Figure 3.2.5 Some interesting and unforeseen results

TKBR: Other Measures With the cleaned TKBR data imported into Tableau, we were able to create some data visualizations. One major idea was comparing the two classes (PUB 800 and PUB 802) over the three years of data. What follows is some of our comparison visualizations.

Figure 3.2.6 Highly tweetable post titles The character length of titles (Fig 3.2.6) in both PUB 800 and PUB 802 was fairly long, compared to the 6 words recommended by social media experts (Lee, 21 Oct 2014). However, with less than 100 characters per title, they are highly tweetable lengths.

DAT!Analysis 45


Figure 3.2.7 PUB 800 essay assignments are typically shorter The average essay length is calculated as the average word count across all entries. The limitation of this, as detailed in the Limitations section, is that these entries are not solely essays. They also include reading responses, seminar reports, and comments. As we can see below (Fig 3.2.13), the number of records would also impact this visualization.

Figure 3.2.10 Average sentence length is generally longer in shorter essays This scatterplot (Fig 3.2.10) looks at the average sentence length compared to the word count for each post. We don’t have an explanation or hypothesis for this, but do find it very interesting and kind of surprising. It may be worth looking into in the future, to see if there are any specific outliers—essays or authors—that impact this.

DAT!Analysis 46


IMAGES & LINKS Looking at the next two figures (Fig 3.2.8 & 3.2.9), we can see the general trend of the use of hyperlinks and images in posts.

Figure 3.2.8 PUB 802 generally has more links

Figure 3.2.9 Images in PUB 800 posts are increasing We can draw some inferences regarding the increase of image use, or the decrease in links in 802. But it is more important, in our opinion, to consider what may be missing or skewing the data. First of all, this is a very small sample set, so if you look at the average number of images, there is less than one per essay. Second, some people included their bibliography with hyperlinks, whereas others didn’t activate the links in their bibliography or the citation style they used didn’t require links. The other consideration is what each instructor requests—Juan is very vocal about requiring in-text links for PUB 802, whereas John is less adamant about this for PUB 800. AUTHOR RECORDS For interests sake, we also wanted to look at the number of contributions by author (below, Fig 3.2.11 and 3.2.12).

DAT!Analysis 47


Figure 3.2.11 Total Word Count by Author (PUB 802) With PUB 802’s word count total (Fig 3.2.11), the course requirements have a direct correlation to the visualization. As noted above (Section 2.1), the data collected from TKBR includes reader responses, comments, as well as essays. However, it is interesting where the entries generally group. In Figure 3.2.11, the majority of site authors contributed three (3) posts and a total word count of 3,500–5,500. We can generalize that most students submitted two essays (1500-2000 word) and 1-2 essay responses (~500 words each) as well as 1-2 reading response (~500 words), which makes sense, given the course syllabus.

DAT!Analysis 48


Figure 3.2.12 Total Word Count by Author (PUB 800) The trend of more posts resulting in a higher cumulative word count is expected. Again, the course requirements have a direct correlation to these visualizations. By the end of PUB 800, each student was required to submit four essays of approximately 2,000 words each (total ~8,000 words). But, as noted in the Limitations section, not all of the data was essays, it also included reading responses. Also, not all of the cohorts posted their essays on TKBR.3 For example, when it came time to post our third essay in February 2016, TKBR was offline and we emailed or printed our essays.

3

Also, not all of the essays from our cohort were due when we downloaded the TKBR content, thus the 2016 data is incomplete.

DAT!Analysis 49


Figure 3.2.13 Number of Records by Month Posted This visualization (Fig 3.2.13) of the total number of records posted per course against the month they were posted further indicates the impact of the course syllabus. Figure 3.2.13 depicts the total number of posts over the three years of available data, which suggests the volume of posts in PUB 802 may attribute to the high word count witnessed in Figure 3.2.7.

Visualizing our Topic Models - Summit In order to make sense of the topic data, we imported the cleaned contribution data into Tableau. Although there are learning resources such as video tutorials available for Tableau, DAT!Analysis 50


the software is relatively intuitive. The following steps were guided by experimentation and our past experience using Excel charts. After loading the data into Tableau, the first thing we wanted was a general understanding of how topic contributions had changed over time—that is, year by year. To do this, we constructed a stacked bar chart, with the year as the independent variable (x-axis) and the average contribution of each topic as the dependent variable (y-axis). We chose to use the average (i.e. the total a topic contributed to each report published that year, divided by the number of reports for that year), rather than the sum of the topic contributions because we had more project reports for some years than for others. Contributions were color-coded and numbered by topic; we also included a legend with each topic’s assigned Label. Our first visualization looked like this:

Figure 3.2.14 Summit project report topic contributions, by year Plotting our topic data in this way helped us decide which topics to investigate further. Some topics, such as ​ Topic 9: General Publishing Terms​ occurred in virtually every year, with a relatively stable contribution. This makes sense intuitively; it is only natural that a certain amount of general publishing terminology would be used in any MPub project report.

DAT!Analysis 51


Figure 3.2.15 Topic 9: General Publishing Terms over time As we can see from the above figure, General Publishing Terms did stay relatively stable (at between 0.15 and 0.27), with one unexpected spike in 2001.4 Other topics, on the other hand, fluctuated significantly year by year. ​ Topic 6: Digital Magazine​ , for example, was virtually nonexistent before 2008, but later grew much more important in later years. We see this reflected in the following graph:

4

Upon further investigation, we discovered that we only had one project report from 2001 in our data set. This is issue is ​ addressed in our Limitations section.

DAT!Analysis 52


Figure 3.2.16 Topic 6: Digital Magazine As a group, we discussed possible reasons for these interesting peaks and troughs, pulling from what we’ve learned about the publishing industry throughout the program. The rising interest in digital magazines after 2008, for example, could be linked to the economic recession of that same year. With declining advertising revenues, magazines may have been interested in diversifying revenue streams or moving to a leaner, paper-free business model. Research by Dora Santos Silva (2011) supports this hypothesis. She writes that “in 2007, MediaIDEAS had already suggested that in 2022 digital magazines would represent 30% of the magazine market and in 2032 75% of all periodicals market.” As 2007 also the year the first real digital magazines were published (Santos Silva, 2011), it makes sense that the topic doesn’t feature heavily in many of the early project reports. The spike we see in 2010 could likewise be related to the launch of Apple’s iPad in April of that year (Santos Silva, 2011). Of course, these kinds of post hoc explanations are largely based on conjecture and cannot be taken as fact. Still, the fact that our topic model trends map out, even roughly, onto industry trends is interesting, to say the least.

DAT!Analysis 53


In a related vein, ​ Topic 19: Ebook Production​ takes off in 2011 to 2012, approximately one year after the iPad’s launch:

Figures 3.2.17 Topic 19: Ebook Production A quick Google search revealed that 2011 was, in fact, a big year for ebooks. ​ The Guardian, for example,​ reported that digital book sales rose by as much as 366% in 2011 (Flood, 2012). ​ Pew Research​ published a report in 2012 that claimed that ebook reading increased from 17% in December 2011 to 21% in February 2012, just a few months later (Rainie et al, 2012). Clearly, ebooks were having a moment in 2012, a moment that graduating MPub students were interested in writing about.

DAT!Analysis 54


Finally, ​ Topic 5: Open Access​ also showed an interesting spike in 2006:

Figure 3.2.18 Topic 5: Open Access over time At first, we wondered whether the sharp increase might have been related to Juan coming to SFU, but we quickly realized this occurred years later and hence could not have been the contributing factor. But 2005 and 2006 were times of change for the Open Access movement; in 2005, the Canadian Library Association formally endorsed OA (Geist, 2005), and the following year the US Senate and House of Representatives introduced The Federal Research Public Access Act (FRPAA). The new Act “would require that 11 U.S. government agencies with annual extramural research expenditures over $100 million make manuscripts of journal articles stemming from research funded by that agency publicly available” (“Federal Research Public Access Act (FRPAA)”, n.d.), a monumental moment for Open Access.

DAT!Analysis 55


Visualizing our Topic Models - TKBR Posts While it made sense to look at changes in topic contribution over time for the Summit project report data, we only had three years (2014 to 2016) of data to draw from for our TKBR post analysis. Thus we chose instead to compare and contrast the kind of topics written about in the two courses, PUB 800 and PUB 802. In theory, these courses differ in subject matter; however, recent complaints from John, our Industry professor, that “Juan [the Technology professor] gets all the good essays” lead us to think otherwise. To assess this, we first compared the topic contributions for all topics in PUB 800 to those in PUB 802 using a simple bar chart, as follows:

Figure 3.2.19 Topic contributions for PUB 800: Industry and 802: Technology Clearly, there is significant overlap in the kind of material discussed in 800 and 802 TKBR posts. That may, of course, be due to the fact that the same students are writing them, for completion of the same graduate program. We have no other MPub course data to compare these results with. However, the similarities in the prevalence of topics such as ​ Online Publishing​ (light pink), ​ SEO & Discoverability​ (green-yellow), and ​ Social Media Marketing (yellow-green) between the two courses do suggest that 800 and 802 may cover similar subject material. Interestingly, there were some topics for which the courses differed substantially. ​ Data & Privacy​ , for example, is a much more prominent topic in 802 than 800, whereas ​ Canadian Cultural Grants​ features more heavily in the 800 posts. These differences were to be expected, given the nature of the two courses.

DAT!Analysis 56


Other topic differences, however, were less intuitive. For example, ​ Topic 8: Copyright & Access​ , while more obviously linked to the 800 curriculum, actually featured more heavily in the 802 posts. This is illustrated in the graph below:

Figure 3.2.20 Topic 8: Copyright & Access in PUB 800 vs PUB 802 TKBR posts In contrast, ​ Topic 11: Digital Reading​ featured more heavily in 800 posts than in 802 ones, despite being a better fit for a Technology seminar.

Figure 3.2.21 Topic 11: Digital Reading in PUB 800 vs PUB 802 TKBR posts While it is beyond the scope of this analysis to asses whether the 802 posts were “better” than the 800 ones, our findings do suggest that the two course titles do not meaningfully reflect the kind of content being written about in said courses. DAT!Analysis 57


In addition to exploring the differences between the two courses, we wondered whether our topic data could help to structure the content published on the Journal of MPub website. Indeed, in our own research, we noticed that few of the existing TKBR posts have been tagged by their authors, making it difficult for prospective readers to discover them and understand what they are about without reading them first. Our topic analysis could resolve this issue. Not only could our topic labels be used as tags to aid searchability, we could also provide more detailed breakdowns of essay content using a simple visualization such as the following:

Figure 3.2.22 Topic breakdown for a sample essay This kind of visualization would be invaluable to a student or researcher searching for an essay about online publishing. The problem with tags is that they have no weight associated with them and thus do not fully represent the content of a post. An author can assign tags, but those tags don't necessarily explain how ​ prevalent ​ an associated topic is—they just indicate that the topic​ is​ associated with that essay. If the author of the post in Figure 3.2.22 tagged her essay as “Information Age” and “Ebooks”, for example, a prospective reader would have no way of knowing that ebooks only play a minor role in the content of this essay, whereas Information Age features heavily. If she is only interested in reading about ebooks, then this essay may not be particularly useful to her—but she has no way of knowing that without reading (or at least skimming) it first. But if her search also provided her with a pie chart illustrating the prominence of the different topics within each essay, along with a list of keywords associated with each of those topics, she could quickly and easily assess whether to read the post or skip ahead. In an age of virtually endless online content, the power of this kind of capability cannot be overstated. DAT!Analysis 58


Part IV: Discussion 4.1 Limitations Topic Modelling & Data Analysis in General Importantly, because topic modelling is an unsupervised probabilistic analysis that relies on repeated sampling, it incorporates a certain degree of randomness; every time MALLET runs a topic model, the output will be slightly different (Graham, Weingart, Milligan, 2012). In addition, there are no “ideal” number of topics, or number of words per topic, for a given model. As Underwood (2012) explains, topic models “require you to make a series of judgment calls that deeply shape the results you get.” With no clear standards, researchers are encouraged to train and compare several models on a given corpus, each with a different number of topics and topic words, and then select the one that best “fits” the corpus (i.e. leads to the most intuitive results, with the least amount of variation upon each subsequent re-sampling). But even when a model seems to fit the data well, it is always possible that a different number of topics or topic words would result in more reliable results. The same is true of stop words. Certainly, MALLET provides a default list of stop words, but many researchers choose to add or remove a few words of their own (for example, we chose to remove “publisher” and “book” from our corpus, because these terms occurred in almost every essay). These will change from corpus to corpus, but also from researcher to researcher, such that “the resulting model ends up being tailored in difficult-to-explain ways by a researcher’s preferences” (Underwood, 2012). Although we believe we have a strong rationale for making the choices we did, other researchers may well feel otherwise. This subjectivity is important to keep in mind when considering the results of probabilistic analysis in general, but especially that of relatively inexperienced students like ourselves. Another key limitation of our data was our decision to stem one corpus (Summit project reports) and not the other (TKBR posts). This makes it difficult to compare and contrast the kinds of topics that arose from each analysis with any degree of scientific rigour. A related question is whether or not we should have stemmed at all. Matthew Burton, a Postdoctoral Researcher in the University of Pittsburgh’s Digital Scholarship Services department, explains the rationale behind stemming as follows: “Basic tokenization and term frequency is going to count ‘model’ and ‘models’ as separate tokens. This can be a problem because we want these tokens to be counted together” (2013). However, he also acknowledges that “aggressive stemming is generally not very useful for topic modeling because the topics become difficult to interpret because word’s morphological roots may have different meanings.” As far as we could tell, there is no clear consensus within the DH community about best practices for stemming and topic modeling. DAT!Analysis 59


Although, of course, the TKBR post data was nothing like the Summit project report data, the fact that one was stemmed and the other was not allows us at least a glimpse into the advantages and disadvantages of stemming. Overall, the stemmed Summit data yielded slightly less intuitive topics (two of which were so hard to decipher that we labelled them “nonsense”) than the TKBR data (which had only one “nonsense” topic). This may be due in part to the aggressive nature of stemming; by removing prefixes and suffixes, we may have lost important nuances in meaning. Furthermore, many of the key “words” the topic model returned were so butchered we had to rely on inferences to make sense of them. For example, one of our topics (Topic 3) contained the following: “​ ent agazin workflow editor articl system edit im mother jone news softwar sm print classic ha ore editori layout​ ”. We later named this topic “Digital Magazine Workflow” but another team of researchers may well have interpreted it entirely differently. Of course, if we had more time, we would have compared the output for stemmed and unstemmed data for both the Summit and the TKBR data, and then applied the most effective method. But using what we were able to accomplish in the short span of time we had, we believe an unstemmed corpus may be more useful than a stemmed one—at least for the sake of this project. We might also have tried using an alternate stemmer, such as a Lancaster or Wordnet stemmer (Potts, 2011); indeed, the Porter Stemmer we employed is considered one of the more aggressive methods available (Burton, 2013). Notably, it is also one of the more popular ones (Potts, 2011). Further research and experimentation with the data would likely clarify which stemmer (if any) yields the best results. We were also only able to count instances of single words, not phrases. Being able to count instances of phrases would have made our analysis richer and more meaningful. One major limitation of our data sets for topic modelling is that the size of our corpuses (both for TKBR and Summit) are very small. This means that outliers can significantly impact our results. There is a tendency to trust data as “objective,” however, we know that as we came into this project with certain expectations to find certain trends, we may have imposed our own preconceived notions onto the numbers. As we have taken the data and put a narrative to the numbers, we understand that there are many viable interpretations for these numbers, and that our interpretations are just one set of interpretations. We also know that correlation is not causation, meaning that a correlation between two variables does not imply that one causes the other. As much as we would like to, we know that we cannot take data as gospel.

Dirty Data – TKBR & Summit Reports We are aware that our conclusions derived from our data sets have significant limitations. The problem of dirty data exists for our analysis of both TKBR and Summit reports. TKBR content included essays (which is what we intended to isolate), but it also included bibliographies, reading reports, seminar notes, and comments. We would have liked to DAT!Analysis 60


remove these extraneous elements, but with a lack of time we had to make do, and we did clean the data as much as time would allow. We had originally hoped to directly compare the TKBR essays with the Summit reports, but because we were unable to isolate the essays without any other shorter posts, we realized the two would be too different to make any meaningful comparison. The Summit report content was also contaminated with both title pages and bibliographies. The extraneous materials in both of our corpuses affected our sentence length numbers and our word counts.

Future Considerations – TKBR & Summit Reports Metadata is critical to both the TKBR essays and the Summit reports. We would like to recommend suggestions for both. For the TKBR posts, all content should be categorized properly from the outset. The responsibility would fall on each individual student to mark their content into the correct part of the site. This would mean categorizing content into “essays,” “reading responses,” “comments,” “seminar notes,” etc. In addition to this, if the PUB 800 site did not have some of the essays as “posts” and some as “pages,” and instead had a clearer configuration, our data extraction could have been made simpler. We also recommend that topic tagging of essays be compulsory. This would aid in the navigability, discoverability, and consistency of the site. A tag-cloud widget could be added to the TKBR class sites, and users could easily browse essays based on tagged topics. If this tagging was done, it could have made our analysis more interesting, as we could compare the tag cloud on the respective sites with the topic modelling results we obtained. This means we could compare what people ​ thought​ they were writing about versus what the topic modelling ​ tells​ us they were writing about in the analysis of latent themes. Metadata is just as important for the Summit repository. The archive is a mess. Somehow, around 125 archaeology theses were categorized with the same metadata as the Master of Publishing project reports, and thus were included in our original collection. After manually removing every archaeology report (multiple times), we can confidently say that proper cataloguing of these reports would be an invaluable asset to anyone undertaking these operations in future. It also aids in discoverability for anyone using the repository. As mentioned earlier in our limitations section, our corpuses were quite small for this type of analysis. For the TKBR content, we do not have very many essays, and we only have about three years of records. This issue will work itself out over time, as future cohorts add their work to the TKBR sites. As for the Summit reports, the repository only has digitized reports from the year 2003 on, and even then they are patchy. As the Master of Publishing’s first graduating class was in 1995/1996, we recommend that all previous reports be digitized.

DAT!Analysis 61


Another issue we ran into with the Summit repository was when we tried to extract the text from the individual reports; we still could not access the content. This is because the PDFs from the repository were password protected and did not allow text extraction. This seems unnecessary for student work that is openly available online. We recommend the PDF DRM lock should be removed so that more people would be able to access and use these reports for scholarly purposes.

4.2 Uses Some of the results of this group data analysis project were inconclusive, partly due to time constraints, inexperience, and faulty data. Nevertheless, we have gained sufficient knowledge of the processes involved to identify several potential ways data analysis could assist with operating a ​ Journal of MPub​ website. Further, we have identified several ways in which they might also prove useful within the Master of Publishing program generally. Basic data analysis could benefit a student-run web journal by identifying salient terms or topics that could be used as the basis for naming post categories and to ensure equal representation on the site of all the program’s diverse interests. The TKBR data could also be used to determine if the posts of each successive cohort are becoming more blog-like, which we define as featuring shorter sentences and paragraphs, as well as more image and video files. This might demonstrate that the reports are increasingly being written with eventual web publishing in mind, thus strengthening the argument for the need for such a journal site. The timeline graphs created using key terms and topics found the Summit reports could also be of value if posted online within the MPub journal. Particularly if the datasets were broadened to include more project reports, the resulting visualizations could be used by students, as well as academics, interested parties within publishing, and even the public to roughly track the concerns of the Canadian publishing industry over time. The results of our TKBR data analysis project could be used also to benefit the MPub program in general by providing instructors with objective, visual insight into which concepts are being interacted with more or less actively by the students within the Industry and Technology classes. This could influence the design of future syllabi or simply reinforce the logic of existing pedagogical decisions. In particular, this data could be used to avoid duplication of content between the courses where possible. Summit report data could be used by students and professors to determine which topics to write future project reports on, based on the abundance or lack of recent reports engaging with similar themes.

DAT!Analysis 62


Part V: Conclusion In conclusion, despite the numerous challenges we faced along the way, we were successfully able to extract, clean, and analyze data from more than a combined 300 texts written and published by SFU Master of Publishing students. The results of our analyses revealed interesting changes in the structure and content of these texts over time. Some of these changes mapped onto real world issues such as the 2008 financial crisis or the rise of Facebook and social media in 2007-2011, illustrating one way the MPub program engages with the world at large. Others, such as the differences between the 800 and 802 posts, provide insight into MPub curricula and possibly the interests of the instructors themselves. And although we acknowledge the limitations of our analyses, we also acknowledge their strengths. As discussed above, our research revealed trends that we could not have otherwise detected in the data, opened up avenues for future research, and provided a possible way to organize content on the forthcoming Journal of MPub. Now DAT, we think, is some pretty cool analysis.

DAT!Analysis 63


Works Consulted Burton, M. (2013, May 21). The Joy of Topic Modeling. Retrieved from http://mcburton.net/blog/joy-of-tm/ Chen, E. (2012, Mar 20). “Infinite Mixture Models with Nonparametric Bayes and the Dirichlet Process.” Retrieved from http://blog.echen.me/2012/03/20/infinite-mixture-models-with-nonparametric-bayes-and-th e-dirichlet-process/ “Federal Research Public Access Act (FRPAA)”. (n.d.). Retrieved from http://sparcopen.org/our-work/frpaa/ Flood, A. (2012, May 2). Huge Rise in Ebook Sales Offsets Decline in Printed Titles. Retrieved from ​ http://www.theguardian.com/books/2012/may/02/rise-ebook-sales-decline-print-titles [Google Refine]. (2011, July 19). Google Refine 2.0 – Introduction (1 of 3) (video version 2) [Video file]. Retrieved from ​ https://www.youtube.com/watch?v=B70J_H_zAWM Graham, S., Weingart, S., and Milligan, I. (2012, Sep 2). “Getting Started with Topic Modelling and MALLET.” Retrieved from http://programminghistorian.org/lessons/topic-modeling-and-mallet Lee, Kevan. (2014, Oct 21). Infographic: The Optimal Length for Every Social Media Update and More. ​ Buffer​ . Retrieved from ​ https://blog.bufferapp.com/optimal-length-social-media Manning, C., Raghavan, P. and Schutze, H. (2008). Introduction to Information Retrieval. Cambridge University Press. Retrieved from ​ http://www-nlp.stanford.edu/IR-book/ OpenRefine Wiki. (n.d.) “Recipes”. Retrieved from https://github.com/OpenRefine/OpenRefine/wiki/Recipes Potts, C. (2011). Sentiment Symposium Tutorial: Stemming. Retrieved from http://sentiment.christopherpotts.net/stemming.html Rainie, L., Zickuhr, K., Purcell, K., Madden, M., & Brenner, J. (2012). The Rise of E-Reading. Pew Internet & American Life Project. Retrieved from http://files.eric.ed.gov/fulltext/ED531147.pdf RefinePro. (2012, Feb). “Count how often a character occurs in a cell”. ​ RefinePro Knowledge Base for OpenRefine.​ Retrieved from http://kb.refinepro.com/2012/02/how-to-count-how-often-character-occurs.html DAT!Analysis 64


Santos Silva, D. (2011). The Future of Digital Magazine Publishing. Information Services and Use, 31(3-4), 301-310. Retrieved from http://onlineglobalcareer.in/Magazine/thumbnail/22072015071602.pdf School of Data (n.d.) Cleaning Data with Refine. Retrieved from http://schoolofdata.org/handbook/recipes/cleaning-data-with-refine/ Underwood, T. (2012, Apr 7). “Topic Modeling Made Just Simple Enough.” Retrieved from https://tedunderwood.com/2012/04/07/topic-modeling-made-just-simple-enough/ Verborgh, R., De Wilde, M. (2013). ​ Using OpenRefine.​ Birmingham, UK: Packt Publishing. EBook. Wicentowski, J. (2015, September 4). “GREL Functions.” Retrieved from https://github.com/OpenRefine/OpenRefine/wiki/GREL-Functions

DAT!Analysis 65


Appendices Appendix A: Summit Project Reports Topics Topic 0 - Academic press ubc editor univers manuscript scholarli author review publish market seri product public process academ freelanc depart acquisit member

Topic 1 - Art Publishing art galleri catalogu publish wa exhibit artist tara public canada biennal print market object museum design product distribut vag

Topic 2 - Magazines magazin advertis circul space issu reader subscript canada western editori small vancouv event live sale revenu canadian ad public

Topic 3 - Digital Magazine Workflow ent agazin workflow editor articl system edit im mother jone news softwar sm print classic ha ore editori layout

Topic 4 - Educational Publishing educ student market learn school publish textbook teacher cours wa pearson instructor read technolog oxford develop seri canada canadian

Topic 5 - Open Access access journal publish open googl librari digit research scholarli author academ setlement univers model bloomsburi press onlin avail copyright

Topic 6 - Digital Magazine onlin content media market site websit web social advertis magazin articl digit page blog print reader access search user

Topic 7 - Raincoast & Environment paper publish print environment forest raincoast potter market canadian recycl harri initi shopper drug carbon canada glow mart product DAT!Analysis 66


Topic 8 - Literary Grants tyee writer literari program canada fund public resid author wa reader canadian write member chapter hous organ stori grant

Topic 9 - General Publishing Terms thi publish inform report wa provid project develop program includ public ha result manag current work univers support content

Topic 10 - Houses & Imprints publish brand market author fiction canada hous knopf titl imprint list consum compani random public arsen chronicl total nonfict

Topic 11 - Design & Kids comic wa publish seri titl children word reprint design translat reader prejudic orca pride page zombi read market irst

Topic 12 - Metadata & Cataloguing data level suppli industri onix chain system bibliograph canadian pexod publish text catalogu reader market inform titl booknet standard

Topic 13 - Publishing Technology digit publish content app print ebook applic reader creat format platform user read market access titl author product develop

Topic 14 thi wa ha make work becaus mani time publish onli print product reader howev veri part differ ani read

Topic 15 manag staff product depart system wa canada magazin softwar wide process work user thi data sale ricecook commun function

Topic 16 - Distribution & Retail publish sale market titl canadian sell retail compani author canada wa trade chapter bookstor distribut cost industri raincoast booksel DAT!Analysis 67


Topic 17 - Editorial editor publish author edit editori manuscript process wa imag work copi project handbook detail archiv titl chang task arsen

Topic 18 - Translation (Chinese) chines china publish canadian bertelsmann govern polici languag cultur translat public ethnic industri foreign canada ha bookstor club media

Topic 19 - Ebook Production ebook design text digit epub format file product access thi page content imag wa pdf print technolog respons titl Â

Appendix B: TKBR Post Topics Topic 0 - Mobile Device Reading digital mobile web reading ebooks apps publishers app epub ebook electronic technology format readers developing device devices countries apple

Topic 1 - Scholarly Publishing scholarly access open academic monograph press journal journals monographs model oa impact research altmetrics libraries article humanities citation academics

Topic 2 - Kids & Digital Reading reading read children print digital readers reader online screen young text people learning research devices adult brain paper gamification

Topic 3 ing tion pub lish pro web con li open tal schol di tions work dig read arly tive mag

Topic 4 - Online Publishing online digital content world technology industry make future number publishers published years based ways order find media work works

DAT!Analysis 68


Topic 5 - Ebooks & Subscriptions publishers amazon subscription sales readers ebook publisher services big industry oyster market business digital kindle model titles service company

Topic 6 - Canadian Cultural Grants canada canadian industry market publishers report support program government amp tyee year arts sales cultural business funding culture titles

Topic 7 - Information Age time people world article don internet reader readers create part public good information idea web fact point make back

Topic 8 - Copyright & Access copyright rights drm property free public sharing intellectual law work software piracy legal commons works creative open file laws

Topic 9 - Data & Privacy data information publishers privacy wordpress users access kobo personal user net neutrality darknet google accessed big individuals private companies

Topic 10 - Social Media Marketing media social content marketing web advertising snapchat facebook brand publishers online video brands audience ad platforms digital users mar

Topic 11 - Digital Reading & Writing medium story stories design phone images image caption text hony cell twitter photo line reading form york lines experience

Topic 12 - Magazine Publishing magazine magazines print readers digital content revenue circulation publications issue retrieved kinfolk niche brand publication editorial free food news

Topic 13 - Retailers amazon ref stream india memory small sell bridle products org flipkart indian ibid store shipping booktwo paradigm quality indigo DAT!Analysis 69


Topic 14 - Fiction Writing authors author fiction writers literary story writing work stories writer published readers short literature works quality publish novels mfa

Topic 15 - Journalism journalism news sports newspapers human vice content journalists atlantic local shirky write rural article computer viral times story written

Topic 16 - SEO & Discoverability content metadata context google search publishers bisac online title system print publisher seo container discoverability leary website subject information

Topic 17 - Online Communities web wikipedia online internet users social facebook community information content people february news articles authority communities january platform sites

Topic 18 - Transmedia & Feminism wattpad transmedia women feminist game youtube story stream storytelling feminism tech games pokemon community culture banned gender audience audiences

Topic 19 - Visual & Popular Media content print comics bhaskar ebook ebooks nash comic paperback memes webcomics container printed artists ephemera penguin future culture fiction   Â

DAT!Analysis 70


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.