Big Data Concept, Methodology & Debates YiÄ&#x;italp Ertem New Media, Communication and Society 02-11-2017 Presentation
1
Outline Fancy pictures & graphs Definitions, Contexts & Related Phenomenons Two Rival General Perspectives Datafication & Sense-making Epistemological debates in Social Sciences Critical Questions & Political Economy Perspective Methodology
2
Spotify‟s „Discover Weekly‟ SAGE Case Studies: Case of Mad Men The Great British Class Survey
3
4
5
What kind of Data? digitized books, newspapers, music . . . (everything) transactional data web searches sensor data cell phone records Trade information Scientific instruments Social media
6
Big Data Simple Definitions
Anything too big to fit onto your computer (in terms of storage and analysis) AAPOR (American Association of Public Opinion Research) definition: “an imprecise description of a rich and complicated set of characteristics, practices, techniques, ethical issues, and outcomes all associated with data” (Foster et. al, 2017: 3)
An (i) empirical phenomenon and (ii) an emergent field of practice in which claims to knowledge are made, not least about the social world (Halford & Savage, 2017: 1-2)
7
Big Data General characteristics huge in volume, consisting of terabytes or petabytes of data high in velocity, being created in or near real-time; diverse in variety, being structured and unstructured in nature; exhaustive in scope, striving to capture entire populations or systems fine-grained in resolution and uniquely indexical in identification relational in nature, containing common fields that enable the conjoining of different data sets; flexible, holding the traits of extensionality (can add new fields easily) and scaleability (can expand in size rapidly) Kitchin (2014)
8
Big Data as Technological & Scholarly Phenomenon
Technology: maximizing computation power and algorithmic accuracy to gather, analyze, link, and compare large data sets. Analysis: drawing on large data sets to identify patterns in order to make economic, social, technical, and legal claims. Mythology: the widespread belief that large data sets offer a higher form of intelligence and knowledge that can generate insights that were previously impossible, with the aura of truth, objectivity, and accuracy.
ď ˝
ď ˝
ď ˝
(boyd & Crawford, 2012: 663) 9
Big Data Related Concepts
Data science is the extraction of useful knowledge directly from data through a process of discovery, or of hypothesis formulation and hypothesis testing Data scientist is a practitioner who has sufficient knowledge in the overlapping regimes of business needs; domain knowledge; analytical skills; and software and systems engineering to manage the end-to-end data processes in the analytics life cycle. Algorithms Data Visualization Machine Learning & Deep Learning Cloud Computing Internet of Things (IoT) Quantum Computers 10
11
Two Perspectives on Tech & Society
Webster (2014: x-xi)
“It was once the Microelectronics Revolution that was said to be bringing about the Information Age (back in 1979 the then Prime Minister James Callaghan told us we had to „wake up‟ to the coming of the microchip).”
12
Claude Shannon:
Shannon-Weaver model; analysis of source, message, destination, noise etc.; early AI in mouse and chess experiments; cryptography & communication resemblance: relationship btw. technical advancement and communication
Dystopian & Utopian Rhetoric Utopian
Cancer research (biogenetics studies in general)
Terrorism
Harari‟s worldwide health data collection on avoiding epidemics thesis Big data & suicide attacks?
Climate change
Big data, power & decision-making?
(boyd & Crawford, 2012: 664)
13
Dystopian Big brother Enabling invasions of privacy Decreased civil freedoms, Increased state and corporate control (boyd & Crawford, 2012: 664)
Datafication & Sense-making
Dematerialisation, liquidity & density:
Separate the informational aspect of an asset from physical world (dvd vs. streaming) Information can be manipulated, transferred, (re-)bundling (Re)combination of resources, mobilized for particular context
Gathered data (Netflix): search terms, stream queues and plays, interactions & external resources (film reviews and social data)
Deciding without a hypothesis: means to spot the patterns, trends and relationships (view data mathematically first, establish the context later)
Avenues of research: (i) conceptualization & codification, (ii) algorithmic treatment, (iii) re-presentation of the world
(Lycett, 2013: 382-4)
14
New York City Crime Heatmap
15
Crisis of Empirical Sociology Debates (I)
data that did not require a special effort to collect, but was the digital by-product of the routine operations of a large capitalist institution. It is also private data to which most academics have no access (Savage & Burrows, 2007: 887) our interest in an alternative vision, where sociology seeks to define itself through a concern with research methods (interpreted very broadly), not simply as particular techniques, but as themselves an intrinsic feature of contemporary capitalist organization. This interest in the „politics of method‟ involves sociologists renewing their interests in methodological innovation, and reporting critically on new digitalizations. (Savage & Burrows, 2007: 895-6)
16
Crisis of Empirical Sociology Debates (II)
De Freitas (2017: 29-30): five characteristics of research method (three negative, two emergent):
17
That embody our desire for an origin That entail an act of exclusion That establish regimes of work and labour new research methods can emerge out of floods of data and information new research methods can queer time, and subvert the slow deliberative time of the „„human‟‟ subject, by plugging into a more-than-human worldly becoming
New Epistemology Debates
From “datasets limited by scope, temporality and size” to “problem of handling and analysing enormous, dynamic, and varied datasets”: new forms of data management and analytical techniques that rely on machine learning and new modes of visualization “One potential path . . . quantitative methods and models are employed within a framework that is reflexive and acknowledges the situatedness, positionality and politics of the social science being conducted, rather than rejecting such an approach out of hand.” (Kitchin, 2014: 10)
18
Critical Questions to Big Data (I)
Who gets access to
(boyd & Crawford, 2012)
what data, how data analysis is deployed, to what ends
The data we observe was once public, now it is personalized, which makes public discussion harder to establish. Big Data
19
not raw material, let alone „raw data‟, but a corporate field, and, as such, entirely organised for the extraction and accumulation of value. “its regime of jouissance(enjoyment). This is crucial, for the imperative to enjoy, whose supreme emblems in our present include being connected and the social media, takes the form of specific and constant injunctions for the production of sociality (participate, like, share, choose and so forth), sociality which is then immediately and systematically captured, tracked and subjected to automated algorithmic devices and calculations which feed the results thus obtained back into the system, so that new searches, choices and so on are shaped by those calculations, in endlessly recurring loops.” (Frade, 2016: 869)
Critical Questions to Big Data (II)
Big Data changes the definition of knowledge
Shape the reality they measure Automation of knowledge production Ford: limited action to worker; social media: limited access to researcher
Claims to objectivity and accuracy are misleading
(boyd & Crawford, 2012)
Objectivity-Quantitative/Subjectivity-Qualitative All researchers are interpreters of data? Is big data self-explanatory? Problem of bias
Bigger data are not always better data
20
Twitter=People? API as a firehose Cleaning is subjective; hardly random & representative, sum of little data Using smallest data possible?
Critical Questions to Big Data (III)
Taken out of context, Big Data loses its meaning
Network connections, are all connections same? Articulated & Behavioral networks
Just because it is accessible does not make it ethical
(boyd & Crawford, 2012)
Anonymization & de-anonymization Accountability: to superiors, colleagues, participants & public difference between being in public (i.e. sitting in a park) and being public (i.e. actively courting attention)
Limited access to Big Data creates new digital divides
21
Think of an anthropologist working for Google or FB Access to data & needed skill set Manovich (2011): three classes of people in the realm of Big Data: those who create data (both consciously and by leaving digital footprints), those who have the means to collect it, and those who have expertise to analyze it.
Political Economy Perspective (I)
(Mosco, 2014)
Big data -> global culture of knowing, digital positivism The cloud and big data are engines that power informational capitalism even as they enable an increasingly dominant way of knowing – see Knowing Capitalism (Thrift, 2005) Discontent of cloud –see Striphas (2010) on Kindle Companies moving to cloud, military plans, education system (recent teacher evaluation in Turkey), individual identities Save capitalism or lead to carefully planned hacker attacks 22
Political Economy Perspective (II)
(Mosco, 2014)
A prism that reflects major issues in info-tech: ownership & control, security & privacy, work & labour, nations & global political economy, how we make sense Computer utility as a resource: a public regulation perspective? Concentration of power in top companies + government partnerships = military information complex (NSA) Campaigns promoting cloud computing: not only commercials, but social media, lobbying trade shows Serious problems: power, threatens privacy, not secure
23
Political Economy Perspective (III)
(Mosco, 2014)
No down time: cooling, energy & environment Cyber-attacks: to/by governments/firms: surveillance Preponderance of knowledge labor is now IT work Reliance on quantitative, correlational analysis, free from theoretical considerations and predicting events
“every technology contains an aesthetic, a way of seeing and feeling, that is drawn from the machine‟s design— as well as from its discursive associations” 24
Weapons of Math Destruction
(Cathy O’Neill)
Arms Race: Going to College Propaganda Machine: Online Advertising Civilian Casualties: Justice in the Age of Big Data Ineligible to Serve: Getting a Job Sweating Bullets: On the Job Collateral Damage: Landing Credit No Safe Zone: Getting Insurance The Targeted Citizen: Civic Life
25
Selling Tickets to Las Vegas (Zeynep Tufekci via TED)
Traditional marketing: 25-35 men, recently retired, uppermiddle class Currently:
26
FB data: friends, likes, shares, things you write and delete, external sources Machine Learning classifications (we don‟t understand, enormous amount of data) Easier to sell to people: bipolar people entering to manic phase (overspenders, compulsive gamblers)
Spotify’s ‘Discover Weekly’ Case
27
Collaborative Filtering models (i.e. the ones that Last.fm originally used), which work by analyzing your behavior and others‟ behavior. Natural Language Processing (NLP) models, which work by analyzing text. Audio models, which work by analyzing the raw audio tracks themselves.
https://hackernoon.com/spotifys-discover-weekly-how-machine-learning-finds-your-new-music-19a41ab76efe
Selecting, Scraping, and Sampling Big Data Sets from the Internet: Fan Blogs as Exemplar (I) (Webb et. al., 2015)
Audience reception: content analysis of 11 Mad Men fan-sites
how to select and scrape (transfer in its raw form) data directly from the Internet how to employ random sampling to select representative text for qualitative analysis how to employ grounded theory analysis to analyze „big data‟
Find websites, decide # of websites to study Manually download the content (bias & selection) Simple random sampling: 59 files 28K pages 1M words -> 18 files 721 pages 269k words -> coding 180 randomly selected pages (.064% of harvested content)
28
Selecting, Scraping, and Sampling Big Data Sets from the Internet: Fan Blogs as Exemplar (II) (Webb et. al., 2015)
Grounded theory approach to thematic analysis: coders read twice and take note of potential/discovered themes Some files overlap and coders video-chat to settle Senior scholar –without coding– adds two additional layers across the 76 prior themes: 16 categories and 5 supra-themes
29
The Great British Class Survey (GBCS) (Savage et al. 2013), (Savage & Burrows, 2014)
questions to develop detailed measures of economic, cultural and social capitals web survey, with publicity across BBC television and radio, and newspaper coverage 161K (later, 325K) complete surveys with well-educated bias (1026 face-to-face survey by GfK), not representative Interactive, gamified (class calc., badges), no measures Relatively small but challenges sociological repertoire
30
The Great British Class Survey (GBCS) (Savage et al. 2013), (Savage & Burrows, 2014)
Impact: bounded by media circuits, data intermediaries, journalists, designers and developers: generate & disseminate Stakes and tensions: critique from orthodox sociologists, battle over the „politics of method‟, representation debates Self-referential performativity: recursive feedback between GBCS & elites Different temporal structure:
1. 2. 3. 4. 1. 2.
5.
After publicizing the results, attendees doubled, reactions to news Social data can be studies after sharing the results
New repertoires: „real time‟ of survey submissions
Not seen as validated & legitimate, used crowdsourcing, more specific, refined and granular results, larger sample size 31
Free (Big) Data Sources & Tools
Big Data: 33 Brilliant And Free Data Sources For 2016
https://www.forbes.com/sites/bernardmarr/2016/02/12/big-data-35-brilliant-and-free-datasources-for-2016
15 Great Free Data Sources for 2016
https://medium.com/@Infogram/15-great-free-data-sources-for-2016-25cb455db257
U.S. Census Bureau, Data.gov, Google Public Data Explorer, Social Mention, EU Open Data Portal, Healthdata, Unicef, WHO, Amazon Web Services Public Datasets, Facebook Graph Api, Dbpedia, Google Finance etc.
22 free tools for data visualization and analysis
https://www.computerworld.com/article/2507728/enterprise-applications/enterprise-applications22-free-tools-for-data-visualization-and-analysis.html
https://netlytic.org (Social network analyzer)
https://graphcommons.com (Data visualization)
https://www.kaggle.com/ (Free data sets, competitions etc.)
32
Selected Bibliography (I)
Beer, D., & Burrows, R. (2013). Popular Culture, Digital Archives and the New Social Life of Data. Theory, Culture & Society, 30(4), 47-71.
boyd, d., & Crawford, K. (2012). Critical Questions for Big Data: Provocations for a cultural, technological, and scholarly phenomenon. Information, Communication & Society, 15(5), 662-79.
Burrows, R., & Savage, M. (2014). After the crisis? Big Data and the methodological challenges of empirical sociology. Big Data & Society, 1-6.
Felt, M. (2016). Social media and the social sciences: How researchers employ Big Data analytics. Big Data & Society, 3(1), 1-15.
Foster, I., Ghani, R., Jarmin, R. S., Kreuter, F., & Lane, J. (2017). Big Data and Social Science: A Practical Guide to Methods and Tools. Boca Raton, London & New York: CRC Press, Taylor & Francis Group.
Frade, C. (2016). Social Theory and the Politics of Big Data and Method. Sociology, 50(5), 863877.
Kitchin, R. (2014). Big Data, new epistemologies and paradigm shifts. Big Data & Society, 1-12.
Lycett, M. (2013). 'Datafication': Making Sense of (Big) Data in a Complex World. European Journal of Information Systems(22), 381-6.
33
Selected Bibliography (II)
Mosco, V. (2014). To the Cloud: Big Data in a Turbulent World. Boulder & London: Paradigm Publishers.
O'Neil, C. (2016). Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. New York: Crown Publishers.
Savage, M., & Burrows, R. (2007). The Coming Crisis of Empirical Sociology. Sociology, 41(5), 885-99.
Savage, M., Devine, F., Cunningham, N., Taylor, M., Li, Y., Hjellbrekke, J., et al. (2013, April 2). A New Model of Social Class? Findings from the BBC's Great British Class Survey Experiment. Sociology, 0(0), pp. 1-32.
Striphas, T. (2010). The Abuses of Literacy: Amazon Kindle and the Right to Read. Communication and Critical/Cultural Studies, 7(3), 297-317.
Thrift, N. (2005). Knowing Capitalism. London, Thousand Oaks & New Delhi: SAGE Publications.
Timcke, S. (2017). Capital, State, Empire: The New American Way Digital Warfare. London: University of Westminister Press.
Webb, L. M., Gibson, D. M., Wang, Y., Chang, H.-C., & Thompson-Hayes, M. (2015). Selecting, Scraping, and Sampling Big Data Sets from the Internet: Fan Blogs as Exemplar. London: Sage Research Methods Cases.
Webster, F. (2014). Theories of the Information Society (4 ed.). London & New York: Routledge. 34
35