Data mining

Page 1

I. Introduction to Data Mining? Data mining, the extraction of hidden predictive information from large databases, is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses. Data mining is the computer-assisted process of digging through and analyzing enormous sets of data and then extracting the meaning of the data. Data mining tools predict behaviors and future trends, allowing businesses to make proactive, knowledge-driven decisions. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations. Generally, Data mining is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases. Then, Data mining is the exploration and analysis of large quantities of data in order to discover meaningful patterns and rules. The goal of data mining is to allow a corporation to improve its marketing, sales, and customer support operations through a better understanding of its customers. Data mining comes in two flavors directed and undirected. Directed data mining attempts to explain or categorize some particular target field such as income or response. Undirected data mining attempts to find patterns or similarities among groups of records without the use of a particular target field or collection of predefined classes. Data mining is primarily used day by day comprise with a strong consumer focus retail, financial, communication & marketing organizations. It enables this companies to determine relationships among internal factors such as price, product positioning or staff skills & external factors such as economic indicators, competition & customer demographics.

1


II. How does data mining work? While large-scale information technology has been evolving separate transaction and analytical systems, data mining provides the link between the two. Data mining software analyzes relationships and patterns in stored transaction data based on open-ended user queries. Several types of analytical software are available: statistical, machine learning, and neural networks. Generally, any of four types of relationships are sought: 

Classes: Stored data is used to locate data in predetermined groups. For example, a restaurant chain could mine customer purchase data to determine when customers visit and what they typically order. This information could be used to increase traffic by having daily specials.

Clusters: Data items are grouped according to logical relationships or consumer preferences. For example, data can be mined to identify market segments or consumer affinities.

Associations: Data can be mined to identify associations. The beer-diaper example is an example of associative mining.

Sequential patterns: Data is mined to anticipate behavior patterns and trends. For example, an outdoor equipment retailer could predict the likelihood of a backpack being purchased based on a consumer's purchase of sleeping bags and hiking shoes.

Data mining consists of five major elements: 

Extract, transform, and load transaction data onto the data warehouse system.

Store and manage the data in a multidimensional database system.

Provide data access to business analysts and information technology professionals.

Analyze the data by application software.

Present the data in a useful format, such as a graph or table.

Different levels of analysis are available: 

Artificial neural networks: Non-linear predictive models that learn through training and resemble biological neural networks in structure.

Genetic algorithms: Optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of natural evolution. 2


Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID) . CART and CHAID are decision tree techniques used for classification of a dataset. They provide a set of rules that you can apply to a new (unclassified) dataset to predict which records will have a given outcome. CART segments a dataset by creating 2-way splits while CHAID segments using chi square tests to create multi-way splits. CART typically requires less data preparation than CHAID.

Nearest neighbor method: A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k 1). Sometimes called the k-nearest neighbor technique.

Rule induction: The extraction of useful if-then rules from data based on statistical significance.

Data visualization: The visual interpretation of complex relationships in multidimensional data. Graphics tools are used to illustrate data relationships.

III. Relation between Data mining and Customer services Although data mining is still in its infancy, companies in a wide range of industries - including retail, finance, heath care, manufacturing transportation, and aerospace - are already using data mining tools and techniques to take advantage of historical data. By using pattern recognition technologies and statistical and mathematical techniques to sift through warehoused information, data mining helps analysts recognize significant facts, relationships, trends, patterns, exceptions and anomalies that might otherwise go unnoticed. In business, data mining is the analysis of historical business activities, stored as static data in data warehouse databases. The goal is to reveal hidden patterns and trends. Data mining software uses advanced pattern recognition algorithms to sift through large amounts of data to assist in discovering previously unknown strategic business information. Examples of what businesses use data mining is to include performing market analysis to identify new product bundles, finding the root cause of manufacturing problems, to prevent customer attrition and acquire new customers, cross-selling to existing customers, and profiling customers with more accuracy In addition to, data mining is used to discover patterns and relationships in the data in order to help make better business decisions. Data mining can help spot sales trends, develop smarter marketing campaigns, and accurately predict customer loyalty. Specific uses of data mining include:

3


Market segmentation - Identify the common characteristics of customers who buy the same products from your company. Customer churn - Predict which customers are likely to leave your company and go to a competitor. Fraud detection - Identify which transactions are most likely to be fraudulent. Direct marketing - Identify which prospects should be included in a mailing list to obtain the highest response rate. Interactive marketing - Predict what each individual accessing a Web site is most likely interested in seeing. Market basket analysis - Understand what products or services are commonly purchased together; e.g., beer and diapers. Trend analysis - Reveal the difference between a typical customer this month and last. Data mining technology can generate new business opportunities by: Automated prediction of trends and behaviors: Data mining automates the process of finding predictive information in a large database. Questions that traditionally required extensive handson analysis can now be directly answered from the data. A typical example of a predictive problem is targeted marketing. Data mining uses data on past promotional mailings to identify the targets most likely to maximize return on investment in future mailings. Other predictive problems include forecasting bankruptcy and other forms of default, and identifying segments of a population likely to respond similarly to given events. Automated discovery of previously unknown patterns: Data mining tools sweep through databases and identify previously hidden patterns. An example of pattern discovery is the analysis of retail sales data to identify seemingly unrelated products that are often purchased together. Other pattern discovery problems include detecting fraudulent credit card transactions and identifying anomalous data that could represent data entry keying errors. Using massively parallel computers, companies dig through volumes of data to discover patterns about their customers and products. For example, grocery chains have found that when men go to a supermarket to buy diapers, they sometimes walk out with a six-pack of beer as well. Using that information, it's possible to lay out a store so that these items are closer. AT&T, A.C. Nielson, and American Express are among the growing ranks of companies implementing data mining techniques for sales and marketing. These systems are crunching through terabytes of point-of-sale data to aid analysts in understanding consumer behavior and promotional strategies. Why? To gain a competitive advantage and increase profitability! Similarly, financial analysts are plowing through vast sets of financial records, data feeds, and other information sources in order to make investment decisions. Health-care organizations are examining medical records to understand trends of the past so they can reduce costs in the future.

4


As the marketing director you have access to a lot of information about all of your customers: their age, sex, credit history and long distance calling usage. The good news is that you also have a lot of information about your prospective customers: their age, sex, credit history etc. Your problem is that you don't know the long distance calling usage of these prospects (since they are most likely now customers of your competition). You'd like to concentrate on those prospects who have large amounts of long distance usage. You can accomplish this by building a model. To best apply these advanced techniques, they must be fully integrated with a data warehouse as well as flexible interactive business analysis tools. Many data mining tools currently operate outside of the warehouse, requiring extra steps for extracting, importing, and analyzing the data. Furthermore, when new insights require operational implementation, integration with the warehouse simplifies the application of results from data mining. The resulting analytic data warehouse can be applied to improve business processes throughout the organization, in areas such as promotional campaign management, fraud detection, new product rollout, and so on. Figure 1 illustrates an architecture for advanced analysis in a large data warehouse.

Figure 1 - Integrated Data Mining Architecture

The ideal starting point is a data warehouse containing a combination of internal data tracking all customer contact coupled with external market data about competitor activity. Background information on potential customers also provides an excellent basis for prospecting. This warehouse can be implemented in a variety of relational database systems: Sybase, Oracle, Redbrick, and so on, and should be optimized for flexible and fast data access. An OLAP (On-Line Analytical Processing) server enables a more sophisticated end-user business model to be applied when navigating the data warehouse. The multidimensional structures allow the user to analyze the data as they want to view their business – summarizing 5


by product line, region, and other key perspectives of their business. The Data Mining Server must be integrated with the data warehouse and the OLAP server to embed ROI-focused business analysis directly into this infrastructure. An advanced, process-centric metadata template defines the data mining objectives for specific business issues like campaign management, prospecting, and promotion optimization. Integration with the data warehouse enables operational decisions to be directly implemented and tracked. As the warehouse grows with new decisions and results, the organization can continually mine the best practices and apply them to future decisions. This design represents a fundamental shift from conventional decision support systems. Rather than simply delivering data to the end user through query and reporting software, the Advanced Analysis Server applies users’ business models directly to the warehouse and returns a proactive analysis of the most relevant information. These results enhance the metadata in the OLAP Server by providing a dynamic metadata layer that represents a distilled view of the data. Reporting, visualization, and other analysis tools can then be applied to plan future actions and confirm the impact of those plans.

IV. Real-World Examples Details about who calls whom, how long they are on the phone, and whether a line is used for fax as well as voice can be invaluable in targeting sales of services and equipment to specific customers. But these tidbits are buried in masses of numbers in the database. By delving into its extensive customer-call database to manage its communications network, a regional telephone company identified new types of unmet customer needs. Using its data mining system, it discovered how to pinpoint prospects for additional services by measuring daily household usage for selected periods. For example, households that make many lengthy calls between 3 p.m. and 6 p.m. are likely to include teenagers who are prime candidates for their own phones and lines. When the company used target marketing that emphasized convenience and value for adults - "Is the phone always tied up?" - hidden demand surfaced. Extensive telephone use between 9 a.m. and 5 p.m. characterized by patterns related to voice, fax, and modem usage suggests a customer has business activity. Target marketing offering those customers "business communications capabilities for small budgets" resulted in sales of additional lines, functions, and equipment. The ability to accurately gauge customer response to changes in business rules is a powerful competitive advantage. A bank searching for new ways to increase revenues from its credit card operations tested a nonintuitive possibility: Would credit card usage and interest earned increase significantly if the bank halved its minimum required payment? With hundreds of gigabytes of data representing two years of average credit card balances, payment amounts, payment timeliness, credit limit usage, and other key parameters, the bank used a powerful data mining system to model the impact of the proposed policy change on specific customer categories, such as customers consistently near or at their credit limits who make timely minimum or small payments. The bank discovered that cutting minimum payment requirements for small, targeted customer categories could increase average balances and extend indebtedness periods, generating more than $25 million in additional interest earned, 6


Merck-Medco Managed Care is a mail-order business which sells drugs to the country's largest health care providers: Blue Cross and Blue Shield state organizations, large HMOs, U.S. corporations, state governments, etc. Merck-Medco is mining its one terabyte data warehouse to uncover hidden links between illnesses and known drug treatments, and spot trends that help pinpoint which drugs are the most effective for what types of patients. The results are more effective treatments that are also less costly. Merck-Medco's data mining project has helped customers save an average of 10-15% on prescription costs.

V. Using Data Mining in Marketing The simplest definition of a good prospect—and the one used by many companies—is simply someone who might at least express interest in becoming a customer. Data mining is applied to this problem by first defining what it means to be a good prospect and then finding rules that allow people with those characteristics to be targeted. Prospecting requires communication. Broadly speaking, companies intentionally communicate with prospects in several ways. One way of targeting prospects is to look for people who resemble current customers. For instance, through surveys, one nationwide publication determined that its readers have the following characteristics: ■ 59 percent of readers are college educated. ■ 46 percent have professional or executive occupations. ■ 21 percent have household income in excess of $75,000/year. ■ 7 percent have household income in excess of $100,000/year. One way of determining whether a customer fits a profile is to measure the similarity which we also call distance between the customer and the profile. Several data mining techniques use this idea of measuring similarity as a distance. Consider two survey participants. Amy is college educated, earns $80,000/year, and is a professional. Bob is a high-school graduate earning $50,000/year. Which one is a better match to the readership profile? The answer depends on how the comparison is made. Table1 shows one way to develop a score using only the profile and a simple distance metric. This table calculates a score based on the proportion of the audience that agrees with each characteristic. For instance, because 58 percent of the readership is college educated, Amy gets a score of 0.58 for this characteristic. Bob, who did not graduate from college, gets a score of 0.42 because the other 42 percent of the readership presumably did not graduate from college. This is continued for each characteristic, and the scores are added together. Amy ends with a score of 2.18 and Bob with the higher score of 2.68. His higher score reflects the fact that he is more similar to the profile of current readers than is Amy.

7


5.1 How Data Mining helps in improving direct marketing campaigns? Advertising can be used to reach prospects about whom nothing is known as individuals. Direct marketing requires at least a tiny bit of additional information such as a name and address or a phone number or an email address. Where there is more information, there are also more opportunities for data mining. At the most basic level, data mining can be used to improve targeting by selecting which people to contact. the first level of targeting does not require data mining, only data. Direct marketing campaigns typically have response rates measured in the single digits. Response models are used to improve response rates by identifying prospects that are more likely to respond to a direct solicitation. With existing customers, a major focus of customer relationship management is increasing customer profitability through cross-selling and up-selling. Data mining is used for figuring out what to offer to whom and when to offer it. Charles Schwab, the investment company, discovered that customers generally open accounts with a few thousand dollars even if they have considerably more stashed away in savings and investment accounts. Naturally, Schwab would like to attract some of those other balances. By analyzing historical data, they discovered that customers who transferred large balances into investment accounts usually did so during the first few months after they opened their first account. After a few months, there was little return on trying to get customers to move in large balances. The window was closed. As a result of learning this, Schwab shifted its strategy from sending a constant stream of solicitations throughout the customer life cycle to concentrated efforts during the first few months. Customer attrition is an important issue for any company, and it is especially important in mature industries where the initial period of exponential growth has been left behind. Not surprisingly, churn (or, to look on the bright side, retention) is a major application of data mining. One of the first challenges in modeling churn is deciding what it is and recognizing when it has occurred.

8


When a once loyal customer deserts his regular coffee bar for another down the block, the barista who knew the customer’s order by heart may notice, but the fact will not be recorded in any corporate database. Even in cases where the customer is identified by name, it may be hard to tell the difference between a customer who has churned and one who just hasn’t been around for a while. If a loyal Ford customer who buys a new F150 pickup every 5 years hasn’t bought one for 6 years, can we conclude that he has defected to another brand. Churn is important because lost customers must be replaced by new customers, and new customers are expensive to acquire and generally generate less revenue in the near term than established customers. There are two basic approaches to modeling churn. The first treats churn as a binary outcome and predicts which customers will leave and which will stay. The second tries to estimate the customers’ remaining lifetime.

5.2 Data Mining using familiar tools The null hypothesis is the assumption that differences among observations are due simply to chance. The null hypothesis is not only an approach to analysis; it can als be quantified. The pvalue is the probability that the null hypothesis is true. Remember, when the null hypothesis is true, nothing is really happening, because differences are due to chance. A statistic refers to a measure taken on a sample of data. Statistics is the study of these measures and the samples they are measured on. A good place to start, then, is with such useful measures, and how to look at data. The most basic descriptive statistic about discrete fields is the number of times different values occur. A histogram shows how often each value occurs in the data and can have either absolute quantities (204 times) or percentage (14.6 percent). Histograms are quite useful and easily made with Excel or any statistics package. However, histograms describe a single moment.

Figure 2: Histogram Diagram

9


Data mining is often concerned with what is happening over time. Time series analysis requires choosing an appropriate time frame for the data; this includes not only the units of time, but also when we start counting from. Some different time frames are the beginning of a customer relationship, when a customer requests a stop, the actual stop date, and so on. The next step is to plot the time series. A time series chart provides useful information. However, it does not give an idea as to whether the changes over time are expected or unexpected. For this, we need some tools from statistics. One way of looking at a time series is as a partition of all the data, with a little bit on each day. There is a basic theorem in statistics, called the Central Limit Theorem “As more and more samples are taken from a population, the distribution of the averages of the samples (or a similar statistic) follows the normal distribution. The average (what statisticians call the mean) of the samples comes arbitrarily close to the average of the entire population�. The normal distribution is described by two parameters, the mean and the standard deviation. The mean is the average count for each day. The standard deviation is a measure of the extent to which values tend to cluster around the mean. Assuming that the standardized value follows the normal distribution makes it possible to calculate the probability that the value would have occurred by chance. Actually, the approach is to calculate the probability that something further from the mean would have occurred—the p-value. The reason the exact value is not worth asking is because any given z-value has an arbitrarily small probability. Probabilities are defined on ranges of z-values as the area under the normal curve between two points. Calculating something further from the mean might mean either of two things: 1. The probability of being more than z standard deviations from the mean. 2. The probability of being z standard deviations greater than the mean. The first is called a two-tailed distribution and the second is called a one tailed distribution.

Figure 3: A time chart can also be used for continuous values.

10


Time series are an example of cross-tabulation—looking at the values of two or more variables at one time. For time series, the second variable is the time something occurred. The most commonly used statistic is the mean or average value (the sum of all the values divided by the number of them). Some other important things to look at are: Range. The range is the difference between the smallest and largest observation in the sample. The range is often looked at along with the minimum and maximum values themselves. Mean. This is what is called an average in everyday speech. Median. The median value is the one which splits the observations into two equally sized groups, one having observations smaller than the median and another containing observations larger than the median. Mode. This is the value that occurs most often. Variance is a measure of the dispersion of a sample or how closely the observations cluster around the average. The range is not a good measure of dispersion because it takes only two values into account—the extremes. Removing one extreme can, sometimes, dramatically change the range. The variance, on the other hand, takes every value into account. The difference between a given observation and the mean of the sample is called its deviation. The variance is defined as the average of the squares of the deviations. Standard deviation, the square root of the variance, is the most frequently used measure of dispersion. It is more convenient than variance because it is expressed in the same units as the observations rather than in terms of those units squared. This allows the standard deviation itself to be used as a unit of measurement. The z-score, is an observation’s distance from the mean measured in standard deviations. Using the normal distribution, the z-score can be converted to a probability or confidence level. Correlation is a measure of the extent to which a change in one variable is related to a change in another. Correlation ranges from –1 to 1. A correlation of 0 means that the two variables are not related. A correlation of 1 means that as the first variable changes, the second is guaranteed to change in the same direction, though not necessarily by the same amount. Another measure of correlation is the R2 value, which is the correlation squared and goes from 0 (no relationship) to 1 (complete relationship).

11


5.3 Decision Tree A decision tree is a structure that can be used to divide up a large collection of records into successively smaller sets of records by applying a sequence of simple decision rules. With each successive division, the members of the resulting sets become more and more similar to one another. A decision tree model consists of a set of rules for dividing a large heterogeneous population into smaller, more homogeneous groups with respect to a particular target variable. Decision trees can also be used to estimate the value of a continuous variable, although there are other techniques more suitable to that task. A record enters the tree at the root node. The root node applies a test to determine which child node the record will encounter next. There are different algorithms for choosing the initial test, but the goal is always the same: To choose the test that best discriminates among the target classes. This process is repeated until the record arrives at a leaf node. All the records that end up at a given leaf of the tree are classified the same way. There is a unique path from the root to each leaf. That path is an expression of the rule used to classify the records. Different leaves may make the same classification, although each leaf makes that classification for a different reason. The decision tree can be used to answer that question too. Assuming that order amount is one of the variables available in the pre classified model set, the average order size in each leaf can be used as the estimated order size for any unclassified record that meets the criteria for that leaf. It is even possible to use a numeric target variable to build the tree; such a tree is called a regression tree. Instead of increasing the purity of a categorical variable, each split in the tree is chosen to decrease the variance in the values of the target variable within each child node. Although there are many variations on the core decision tree algorithm, all of them share the same basic procedure: Repeatedly split the data into smaller and smaller groups in such a way that each new generation of nodes has greater purity than its ancestors with respect to the target variable. At the start of the process, there is a training set consisting of pre classified records—that is, the value of the target variable is known for all cases. The goal is to build a tree that assigns a class (or a likelihood of membership in each class) to the target field of a new record based on the values of the input variables. The tree is built by splitting the records at each node according to a function of a single input field. The first task, therefore, is to decide which of the input fields makes the best split. The best split is defined as one that does the best job of separating the records into groups where a single class predominates in each group. The measure used to evaluate a potential split is purity.

12


VI. Definition of Social Media: Social media broadly defined consists of any online platform or channel for user generated content. By this definition, for example, WordPress, Sharepoint, and Lithium qualify as social media, as do YouTube, Facebook and Twitter. Social media more narrowly defined includes only channels for user generated content, as distinguished from platform, which are referred to as social technologies. By this definition for example, YouTube, Facebook, and Twitter are social media, and WordPress, Sharepoint and Lithium are social technologies. Social media is the collective of online communications channels dedicated to community-based input, interaction, content-sharing and collaboration. Websites and applications dedicated to forums, micro blogging, social networking, social bookmarking, social curation, and wikis are among the different types of social media.

6.1 How to Use Data Mining in Social Media? Social media shatters the boundaries between the real world and the virtual world. We can now integrate social theories with computational methods Social Atom to study how individuals (also known as social atoms) interact and how Social Molecule communities (i.e., social molecules) form. The uniqueness of social media data calls for novel data mining techniques that can effectively handle user generated content with rich social relations. The study and development of these new techniques are under the purview of social media mining, an emerging discipline under the umbrella of data mining. Social Media Mining is the process of representing, analyzing, and extracting actionable patterns from social media data. Social Media Mining introduces basic concepts and principal algorithms suitable for investigating massive social media data; it discusses theories and methodologies from different disciplines such as computer science, data mining, machine learning, social network analysis, network science, sociology, ethnography, statistics, optimization, and mathematics. It encompasses the tools to formally represent, measure, model, and mine meaningful patterns from large-scale social media data. Social media mining cultivates a new kind of data scientist who is well versed in social and computational theories, specialized to analyze recalcitrant social media data, and skilled to help bridge the gap from what we know (social and computational theories) to what we want to know about the vast social media world with computational tools. 13


Figure 4: Social Media

6.1.1 Facebook : Facebook has all our information because it has found ingenious ways to collect data as people socialize. Users fill out profiles with their age, gender, and e-mail address; some people also give additional details, such as their relationship status and mobile-phone number. A redesign last fall introduced profile pages in the form of time lines that invite people to add historical information such as places they have lived and worked. Messages and photos shared on the site are often tagged with a precise location, and in the last two years Facebook has begun to track activity elsewhere on the Internet, using an addictive invention called the “Like” button. It appears on apps and websites outside Facebook and allows people to indicate with a click that they are interested in a brand, product, or piece of digital content. Since last fall, Facebook has also been able to collect data on users’ online lives beyond its borders automatically: in certain apps or websites, when users listen to a song or read a news article, the information is passed along to Facebook, even if no one clicks “Like.” Within the feature’s first five months, Facebook catalogued more than five billion instances of people listening to songs online. Combine that kind of information with a map of the social connections Facebook’s users make on the site, and you have an incredibly rich record of their lives and interactions. “This is the first time the world has seen this scale and quality of data about human communication,” Marlow says with a characteristically serious gaze before breaking into a smile at the thought of what he can do with the data. For one thing, Marlow is confident that exploring this resource will revolutionize the scientific understanding of why people behave as they do. His 14


team can also help Facebook influence our social behavior for its own benefit and that of its advertisers. This work may even help Facebook invent entirely new ways to make money.

6.1.2 Twitter Researchers have shown that data mining techniques can be used to understand when Twitter users start displaying supportive behaviour to radical terror groups such as ISIS. Analysis of 154,000 Europe-based Twitter accounts and more than 104 million tweets (in English and Arabic) relating to Syria show that users of the social media platform are more likely to adopt pro-ISIS language – and therefore display potential signs of radicalisation – when connected to other Twitter users who are linked to many of the same accounts and share and retweet similar information. The research, which has been done in close collaboration between Lancaster University and the Open University, is explained in the paper 'Mining pro-ISIS radicalisation signals from social media users'. The research provides evidence that shows when users begin either sharing tweets from known pro-ISIS accounts, or using extremist language – such as anti-western or pro-ISIS statements – they quickly display a large change in the language they use, tweeting new words and terms, and indicating a clear shift in online behaviour. Often before a user shows signals of having become radicalized they discuss topics such as politics, using words such as Syria, Israel and Egypt in a negative context and highly frequently. However, once they display signals of radicalization their language changes to use religious words more frequently, such as Allah, Muslims and Quran, it was found. Dr Matthew Rowe, Lecturer at Lancaster University's School of Computing and Communications, said: "We found that social dynamics play a strong role where Twitter users are more likely to adopt pro-ISIS language from other users with whom they have a lot of shared connections. "Prior to sharing or using radical content or language users go through a period where they display a significant increase in communicating with new users or adopting new terms. This clear change suggests that users are rejecting their prior behavior and escalating their new behavior until displaying radicalized signals." Researchers defined if a Twitter user was using pro-ISIS language by identifying a lexicon of pro-ISIS terms and seeing if they used these words more than five times. They also identified known pro-ISIS Twitter accounts, or accounts suspended for supporting ISIS, and used these to reference where a user shared incitement content from. Analysis also shed light on the sentiment of each term within the context of Tweets. The word ISIS itself was discovered to be used in a negative and likely derogatory context by Twitter users. Researchers believe pro-ISIS users are more likely to use the term 'Islamic State'. However, the researchers recognize more work is needed to check the robustness of their data mining methods as only a relative small sample of 727 Twitter users of the 154,000 accounts analyzed showed signs of pro-ISIS behavior. Most of these displayed radical behaviour during the summer of 2014 when there was significant media and social media attention given to the execution of ISIS hostages. "There does appear to be an association between information, such as of executions, appearing in the public domain and the sharing of ISIS content or adopting pro-ISIS language," said Dr Rowe. 15


VII. Advantages and Disadvantages of Data Mining Today, it's all about the data -- and as a marketer, you know the key to improving marketing performance lies within your data mining. But data is also everywhere today: in flux, from different sources, and in different formats. Data preparation is laborious and time-consuming. Data mining is an important part of knowledge discovery process that we can analyze an enormous set of data and get hidden and useful knowledge. Data mining is applied effectively not only in the business environment but also in other fields such as security agencies and governments. Data mining has a lot of advantages when using in a specific industry. Besides those advantages, data mining also has its own disadvantages e.g., privacy, security and misuse of information.

7.1 Advantages of Data Mining A. Marketing Data mining helps marketing companies build models based on historical data to predict who will respond to the new marketing campaigns such as direct mail, online marketing campaign‌ etc. Through the results, marketers will have an appropriate approach to selling profitable products to targeted customers. Also, Data mining brings a lot of benefits to retail companies in the same way as marketing. Through market basket analysis, a store can have an appropriate production arrangement in a way that customers can buy frequent buying products together with pleasant. In addition, it also helps the retail companies offer certain discounts for particular products that will attract more customers.

B. Governments Data mining helps government agency by digging and analyzing records of the financial transaction to build patterns that can detect money laundering or criminal activities.

7.2 Disadvantages of data mining A. Privacy Issues The concerns about the personal privacy have been increasing enormously recently especially when the internet is booming with social networks, e-commerce, forums, blogs‌. Because of 16


privacy issues, people are afraid of their personal information is collected and used in an unethical way that potentially causing them a lot of troubles. Businesses collect information about their customers in many ways for understanding their purchasing behaviors trends. However businesses don’t last forever, some days they may be acquired by other or gone. At this time, the personal information they own probably is sold to other or leak.

B. Security issues Security is a big issue. Businesses own information about their employees and customers including social security number, birthday, payroll and etc. However how properly this information is taken care is still in questions. There have been a lot of cases that hackers accessed and stole big data of customers from the big corporation such as Ford Motor Credit Company, Sony‌ with so much personal and financial information available, the credit card stolen and identity theft become a big problem.

C. Misuse of information/inaccurate information Information is collected through data mining intended for the ethical purposes can be misused. This information may be exploited by unethical people or businesses to take benefits of vulnerable people or discriminate against a group of people. In addition, data mining technique is not perfectly accurate. Therefore, if inaccurate information is used for decision-making, it will cause serious consequence.

VIII. Techniques that can be used by FBI, NSA and CIA 8.1 Computer The vast majority of computer surveillance involves the monitoring of data and traffic on the Internet. In the United States for example, under the Communications Assistance For Law Enforcement Act, all phone calls and broadband Internet traffic (emails, web traffic, instant messaging, etc.) are required to be available for unimpeded real-time monitoring by Federal law enforcement agencies. There is far too much data on the Internet for human investigators to manually search through all of it. So automated Internet surveillance computers sift through the vast amount of intercepted Internet traffic and identify and report to human investigators traffic considered interesting by using certain "trigger" words or phrases, visiting certain types of web sites, or communicating via email or chat with suspicious individuals or groups. Billions of dollars per year are spent, by agencies such as the Information Awareness Office, NSA, and the FBI, to develop, purchase, implement, and operate systems such as Carnivore, NarusInsight, and ECHELON to intercept 17


and analyze all of this data, and extract only the information which is useful to law enforcement and intelligence agencies. Computers can be a surveillance target because of the personal data stored on them. If someone is able to install software, such as the FBI's Magic Lantern and CIPAV, on a computer system, they can easily gain unauthorized access to this data. Such software could be installed physically or remotely. Another form of computer surveillance, known as van Eck phreaking, involves reading electromagnetic emanations from computing devices in order to extract data from them at distances of hundreds of meters. The NSA runs a database known as "Pinwale", which stores and indexes large numbers of emails of both American citizens and foreigners.

8.2 Social network analysis One common form of surveillance is to create maps of social networks based on data from social networking sites such as Facebook, MySpace, Twitter as well as from traffic analysis information from phone call records such as those in the NSA call database, and others. These social network "maps" are then data mined to extract useful information such as personal interests, friendships & affiliations, wants, beliefs, thoughts, and activities. Many U.S. government agencies such as the Defense Advanced Research Projects Agency (DARPA), the National Security Agency (NSA), and the Department of Homeland Security (DHS) are investing heavily in research involving social network analysis. The intelligence community believes that the biggest threat to U.S. power comes from decentralized, leaderless, geographically dispersed groups of terrorists, subversives, extremists, and dissidents. These types of threats are most easily countered by finding important nodes in the network, and removing them. To do this requires a detailed map of the network. AT&T developed a programming language called "Hancock", which is able to sift through enormous databases of phone call and Internet traffic records, such as the NSA call database, and extract "communities of interest"—groups of people who call each other regularly, or groups that regularly visit certain sites on the Internet. AT&T originally built the system to develop "marketing leads", but the FBI has regularly requested such information from phone companies such as AT&T without a warrant, and after using the data stores all information received in its own databases, regardless of whether or not the information was ever useful in an investigation. Some people believe that the use of social networking sites is a form of "participatory surveillance", where users of these sites are essentially performing surveillance on themselves, putting detailed personal information on public websites where it can be viewed by corporations and governments. In 2008, about 20% of employers reported using social networking sites to collect personal data on prospective or current employees.

8.3 Data mining and profiling Data mining is the application of statistical techniques and programmatic algorithms to discover previously unnoticed relationships within the data. Data profiling in this context is the process of assembling information about a particular individual or group in order to generate a profile — 18


that is, a picture of their patterns and behavior. Data profiling can be an extremely powerful tool for psychological and social network analysis. A skilled analyst can discover facts about a person that they might not even be consciously aware of themselves.

IX. How do FBI, CIA and NSA use data mining? Agency wants to monitor Facebook, Twitter, and other sites for real-time information that could help investigations.

9.1 NSA Most people were introduced to the arcane world of data mining when National Security Agency contractor Edward Snowden allegedly leaked classified documents that detail how the U.S. government uses the technique to track terrorists. The security breach revealed that the government gathers billions of pieces of data—phone calls, emails, photos, and videos—from Google, Facebook, Microsoft, and other communications giants, then combs through the information for leads on national security threats. The disclosure caused a global uproar over the sanctity of privacy, the need for security, and the perils of government secrecy. People rightfully have been concerned about where the government gets the data—from all of us—but equal attention has not been paid to what it actually does with it. A recent study by IBM estimates that humanity creates 2.5 quintillion bytes of data every day. (If these data bytes were pennies laid out flat, they would blanket the earth five times.) That total includes stored information—photos, videos, social-media posts, word-processing files, phonecall records, financial records, and results from science experiments—and data that normally exists for mere moments, such as phone-call content and Skype chats. Then, the concept behind the NSA's data-mining operation is that this digital information can be analyzed to establish connections between people, and these links can generate investigative leads. But in order to examine data, it has to be collected—from everyone. As the data-mining saying goes: To find a needle in a haystack, you first need to build a haystack. In addition to, data mining relies on metadata tags that enable algorithms to identify connections. Metadata is data about data—for example, the names and sizes of files on your computer. In the digital world, the label placed on data is called a tag. Tagging data is a necessary first step to data mining because it enables analysts (or the software they use) to classify and organize the information so it can be searched and processed. Tagging also enables analysts to parse the information without examining the contents. This is an important legal point in NSA data mining because the communications of U.S. citizens and lawful permanent resident aliens cannot be examined without a warrant. Metadata on a tag has no such protection, so analysts can use it to identify suspicious behavior without fear of breaking the law. 19


So, the data-analysis firm IDC estimates that only 3 percent of the information in the digital universe is tagged when it's created, so the NSA has a sophisticated software program that puts billions of metadata markers on the info it collects. These tags are the backbone of any system that makes links among different kinds of data—such as video, documents, and phone records. For example, data mining could call attention to a suspect on a watch list who downloads terrorist propaganda, visits bomb-making websites, and buys a pressure cooker. (This pattern matches the behavior of the Tsarnaev brothers, who are accused of planting bombs at the Boston Marathon.) This tactic assumes terrorists have well-defined data profiles—something many security experts doubt. The NSA has been a big promoter of software that can manage vast databases. One of these programs is called Accumulo, and while there is no direct evidence that it is being used in the effort to monitor global communications, it was designed precisely for tagging billions of pieces of unorganized, disparate data. The secretive agency's custom tool, which is based on Google programming, is actually open-source. This year a company called Sqrrl commercialized it and hopes the healthcare and finance industries will use it to manage their own big-data sets. The NSA, home to the federal government's codemakers and code-breakers, is authorized to snoop on foreign communications. The agency also collects a vast amount of data—trillions of pieces of communication generated by people across the globe. The NSA does not chase the crooks, terrorists, and spies it identifies; it sifts information on behalf of other government players such as the Pentagon, CIA, and FBI. Here are the basic steps: To start, one of 11 judges on a secret Foreign Intelligence Surveillance (FISA) Court accepts an application from a government agency to authorize a search of data collected by the NSA. Once authorized—and most applications are—data-mining requests first go to the FBI's Electronic Communications Surveillance Unit (ECSU), according to PowerPoint slides taken by Snowden. This is a legal safeguard—FBI agents review the request to ensure no U.S. citizens are targets. The ECSU passes appropriate requests to the FBI Data Intercept Technology Unit, which obtains the information from Internet company servers and then passes it to the NSA to be examined with data-mining programs. (Many communications companies have denied they open their servers to the NSA; federal officials claim they cooperate. As of press time, it's not clear who is correct.) The NSA then passes relevant information to the government agency that requested it.

9.1.1 What the NSA Is Up To A. Phone-Metadata Mining Dragged Into the Light The NSA controversy began when Snowden revealed that the U.S. government was collecting the phone-metadata records of every Verizon customer—including millions of Americans. At the request of the FBI, FISA Court judge Roger Vinson issued an order compelling the company to hand over its phone records. The content of the calls was not collected, but national security

20


officials call it "an early warning system" for detecting terror plots (see "Connecting the Dots: Phone-Metadata Tracking").

B. PRISM Goes Public On the heels of the metadata-mining leak, Snowden exposed another NSA surveillance effort, called US-984XN. Every collection platform or source of raw intelligence is given a name, called a Signals Intelligence Activity Designator (SIGAD), and a code name. SIGAD US-984XN is better known by its code name: PRISM. PRISM involves the collection of digital photos, stored data, file transfers, emails, chats, videos, and video conferencing from nine Internet companies. U.S. officials say this tactic helped snare Khalid Ouazzani, a naturalized U.S. citizen who the FBI claimed was plotting to blow up the New York Stock Exchange. Ouazzani was in contact with a known extremist in Yemen, which brought him to the attention of the NSA. It identified Ouazzani as a possible conspirator and gave the information to the FBI, which "went up on the electronic surveillance and identified his coconspirators," according to congressional testimony by FBI deputy director Sean Joyce. (Details of how the agency identified the others has not been disclosed.) The NYSE plot fizzled long before the FBI intervened, but Ouazzani and two others pleaded guilty of laundering money to support al-Qaida. They were never charged with anything related to the bomb plot.

C. Mining Data as It's Created NSA analysts can receive "real-time notification of an email event such as a login or sent message" and "real-time notification of a chat login," the slides say. That's pretty straightforward use, but whether real-time information can stop unprecedented attacks is subject to debate. Alerting a credit-card holder of sketchy purchases in real time is easy; building a reliable model of an impending attack in real time is infinitely harder.

21


Figure 5: WorldWide Data

D. What is XKeyscore? In late July Snowden released a 32-page, top-secret PowerPoint presentation that describes software that can search hundreds of databases for leads. Snowden claims this program enables low-level analysts to access communications without oversight, circumventing the checks and balances of the FISA court. The NSA and White House vehemently deny this, and the documents don't indicate any misuse. The slides do describe a powerful tool that NSA analysts can use to find hidden links inside troves of information. "My target speaks German but is in Pakistan— how can I find him?" one slide reads. Another asks: "My target uses Google Maps to scope target locations—can I use this information to determine his email address?" This program enables analysts to submit one query to search 700 servers around the world at once, combing disparate sources to find the answers to these questions.

E. How Far Can the Data Stretch? Bomb-sniffing dogs sometimes bark at explosives that are not there. This kind of mistake is called a false positive. In data mining, the equivalent is a computer program sniffing around a data set and coming up with the wrong conclusion. This is when having a massive data set may be a liability. When a program examines trillions of connections between potential targets, even a very small false-positive rate equals tens of thousands of dead-end leads that agents must chase down—not to mention the unneeded incursions into innocent people's lives.

22


F. Analytics to See the Future Ever wonder where those Netflix recommendations in your email inbox or suggested reading lists on Amazon come from? Your previous interests directed an algorithm to pitch those products to you. Big companies believe more of this kind of targeted marketing will boost sales and reduce costs. For example, this year Walmart bought a predictive analytics startup called Inkiru. The company makes software that crunches data to help retailers develop marketing campaigns that target shoppers when they are most likely to buy certain products.

G. Pattern Recognition or Prophecy? In 2011 British researchers created a game that simulated a van-bomb plot, and 60 percent of the "terrorist" players were spotted by a program called DScent, based on their "purchases" and "visits" to the target site. The ability of a computer to automatically match security-camera footage with records of purchases may seem like a dream to law-enforcement agents trying to save lives, but it's the kind of ubiquitous tracking that alarms civil libertarians. Although neither the NSA nor any other agency has been accused of misusing the data it collects, the public's fear over its collection remains. The question becomes, how much do you trust the people sitting at the keyboards to use this information responsibly? Your answer largely determines how you feel about NSA data mining.

9.2 CIA The biggest problem for Palantir’s business may be just how well its software works: It helps its customers see too much. In the wake of NSA leaker Edward Snowden’s revelations of the agency’s mass surveillance, Palantir’s tools have come to represent privacy advocates’ greatest fears of data-mining technology — Google-level engineering applied directly to government spying. That combination of Big Brother and Big Data has come into focus just as Palantir is emerging as one of the fastest-growing startups in the Valley, threatening to contaminate its first public impressions and render the firm toxic in the eyes of customers and investors just when it needs those most. “They’re in a scary business,” says Electronic Frontier Foundation attorney Lee Tien. ACLU analyst Jay Stanley has written that Palantir’s software could enable a “true totalitarian nightmare, monitoring the activities of innocent Americans on a mass scale.” It appears that the Central Intelligence Agency has been taking advantage of a legal loophole to avoid submitting reports on cyber surveillance, based on a 2007 definition of “data mining” established during the last Bush administration. According to the Huffington Post, which began to look into a Congressionally mandated report on data mining submitted by agencies such as the Department of Homeland Security (DHS), the CIA itself does not present such information as it does not consider its electronic surveillance activities as data mining at all. 23


Under the current law, the 2007 Federal Agency Data Mining Reporting Act calls for annual agency reports on activities that collect data involving pattern detection within electronic databases. The latter definition does not, however, cover information retrieved by targeting a single individual, even though surveillance could obviously yield a trove of data regarding any number of additional people. An investigation by Wired Magazine in 2009, for example, revealed that the CIA’s investment arm, In-Q-Tel, was funding a software firm that specialized in scraping mounds of data posted to blogs, forums, and social network websites like Twitter. The software firm, called Visible Technologies, was said to crawl, or archive, over half a million websites per day, and produce customized reporting based on real-time keyword searches. At the time of the report, Visible chief executive officer Dan Vetras called the CIA an “end customer” for its product. Pattern-based searches of the sort that must currently be reported to Congress could include detection of suspicious behavior by DHS - a good example cited by Huffington Post would involve a passenger who departed the US with no baggage returning with a suspicious quantity of suitcases. But then, why would the CIA invest in software such as Visible, which can compile mass amounts of information from thousands of individuals, if not to mine that database? The CIA’s chief technology officer raised some eyebrows last month after outlining the agency’s attempt to “collect everything and hang on to it forever,” in reference to the overwhelming amount of information being transmitted via cell phone texts, or online via social media platforms such as Twitter. Those comments were made only a few days after Federal Computer Week reported on the agency’s $600 million deal with Amazon for cloud computing services. Again, in that instance, one might wonder why the agency would require a near-biblical amount of digital storage with Amazon to compile databases which it purports to have no interest in mining. According to Sharon Bradford Franklin, the senior legal counsel for the Constitution Project, the CIA’s own interpretation of the Data Mining Reporting Act may well comply with the current law, though a growth in the CIA’s own capabilities would seem to merit a re-evaluation of that act. "The definition is overly narrow, and so the act cannot fully serve its purpose of providing greater transparency, accountability and oversight.” Even Mary Ellen Callahan, the former chief privacy officer for the DHS, who herself oversaw the agency’s data mining reports, seems unconvinced that the 2007 legislation has aged well. "It is inconsistent with common understandings of data mining," Callahan said. "Congress hasn't changed it, so Congress seems to think that the pattern-based data mining report is more important," she added.

24


Whether or not Congress moves to adapt government’s definition of data mining remains to be seen. According to a CIA spokesperson who responded to the Huffington Post’s report, under the current act the agency “did not have any reportable activities.” Soft Robots that can grasp delicate objects, computer algorithms designed to spot an “insider threat,” and artificial intelligence that will sift through large data sets — these are just a few of the technologies being pursued by companies with investment from In-Q-Tel, the CIA’s venture capital firm, according to a document obtained by The Intercept. Yet among the 38 previously undisclosed companies receiving In-Q-Tel funding, the research focus that stands out is social media mining and surveillance; the portfolio document lists several tech companies pursuing work in this area, including Dataminr, Geofeedia, PATHAR, and TransVoyant.

Figure: 6 In-Q-Tel’s investment process.

A. Screen grab from In-Q-Tel’s website. Those four firms, which provide unique tools to mine data from platforms such as Twitter, presented at a February “CEO Summit” in San Jose sponsored by the fund, along with other In-Q-Tel portfolio companies. The investments appear to reflect the CIA’s increasing focus on monitoring social media. Last September, David Cohen, the CIA’s second-highest ranking official, spoke at length at Cornell University about a litany of challenges stemming from the new media landscape. The Islamic State’s “sophisticated use of Twitter and other social media platforms is a perfect example of the malign use of these technologies,” he said. 25


Social media also offers a wealth of potential intelligence; Cohen noted that Twitter messages from the Islamic State, sometimes called ISIL, have provided useful information. “ISIL’s tweets and other social media messages publicizing their activities often produce information that, especially in the aggregate, provides real intelligence value,” he said. The latest round of In-Q-Tel investments comes as the CIA has revamped its outreach to Silicon Valley, establishing a new wing, the Directorate of Digital Innovation, which is tasked with developing and deploying cutting-edge solutions by directly engaging the private sector. The directorate is working closely with In-Q-Tel to integrate the latest technology into agency-wide intelligence capabilities.

Figure: 7 Q-Tel

Dataminr directly licenses a stream of data from Twitter to spot trends and detect emerging threats.

B. Screen grab from Dataminr’s website. Dataminr directly licenses a stream of data from Twitter to visualize and quickly spot trends on behalf of law enforcement agencies and hedge funds, among other clients.

26


Figure: 8 Geofeedia

Geofeedia collects geotagged social media messages to monitor breaking news events in real time.

C. Screen grab from Geofeedia’s website. Geofeedia specializes in collecting geotagged social media messages, from platforms such as Twitter and Instagram, to monitor breaking news events in real time. The company, which counts dozens of local law enforcement agencies as clients, markets its ability to track activist protests on behalf of both corporate interests and police departments.

Figure 9: PATHAR 27


PATHAR mines social media to determine networks of association.

D. Screen grab from PATHAR’s website. PATHAR’s product, Dunami, is used by the Federal Bureau of Investigation to “mine Twitter, Facebook, Instagram and other social media to determine networks of association, centers of influence and potential signs of radicalization,” according to an investigation by Reveal. Figure 10 : TransVoyant

TransVoyant analyzes data points to deliver insights and predictions about global events.

E. Screen grab from TransVoyant’s website. TransVoyant, founded by former Lockheed Martin Vice President Dennis Groseclose, provides a similar service by analyzing multiple data points for so-called decision-makers. The firm touts its ability to monitor Twitter to spot “gang incidents” and threats to journalists. A team from TransVoyant has worked with the U.S. military in Afghanistan to integrate data from satellites, radar, reconnaissance aircraft, and drones. Dataminr, Geofeedia, and PATHAR did not respond to repeated requests for comment. Heather Crotty, the director of marketing at TransVoyant, acknowledged an investment from In-Q-Tel, but could not discuss the scope of the relationship. In-Q-Tel “does not disclose the financial terms of its investments,” Crotty said. Carrie A. Sessine, the vice president for external affairs at In-Q-Tel, also declined an interview because the fund “does not participate in media interviews or opportunities.” 28


Over the last decade, In-Q-Tel has made a number of public investments in companies that specialize in scanning large sets of online data. In 2009, the fund partnered with Visible Technologies, which specializes in reputation management over the internet by identifying the influence of “positive” and “negative” authors on a range of platforms for a given subject. And six years ago, In-Q-Tel formed partnerships with NetBase, another social media analysis firm that touts its ability to scan “billions of sources in public and private online information,” and Recorded Future, a firm that monitors the web to predict events in the future.

9.3 FBI The FBI has become the latest federal agency interested in mining social media for intelligence information. The agency is looking for ideas for developing a social media application that can search for significant data from social networking activity to be used for intelligence purposes, according to a request for information (RFI) posted on FedBizOpps.gov. The FBI is looking for a "geospatial alert and analysis mapping application" that will allow its Strategic Information and Operations Center (SIOC) to "quickly vet, identify and geo-locate breaking events, incidents and emerging threats," according to the RFI. From 2009 through 2011, according to data provided by the FBI, the bureau spent a significant amount of its limited time and resources conducting almost 43,000 assessments related to either counterterrorism or counterintelligence. Fewer than 5 percent of them turned up any suspicion of criminal wrongdoing. And what does the FBI do with all of the information it has gathered on innocent Americans? The bureau maintains it for decades, just in case it may be useful in the future. The official guidelines governing the agency’s activities are explicit: All information it collects is kept and sometimes shared, “regardless of whether it furthers investigative objectives,” because it may “eventually serve a variety of valid analytic purposes” -- even if that means keeping the information in an FBI database for as long as 30 years. The policy is similar for information gathered through “national security letters”: the secretive legal procedure that allows the FBI to collect specific information on Americans if the bureau completes paperwork saying the information may be ”relevant” to a terrorism investigation. That data -- which include many of the same kind of telephone records the NSA is acquiring -- can also be stored for up to 30 years if it has even potential investigative value. The federal government’s use of “suspicious activity reports” tells a similar story. Local, state and federal law-enforcement officials use them to file alerts about a wide range of “suspicious activity.” The activity reports that are deemed to have some connection to terrorism are widely shared throughout the government. Yet there doesn’t even need to be “reasonable suspicion” of a terrorist connection for a report to be filed. As the Department of Homeland Security has acknowledged, this practically ensures that these alerts will sweep up information about innocent Americans. Again, one would think a suspicious-activity report that provided no evidence of possible 29


terrorist threats would be discarded immediately. To the contrary, even a report without any link to terrorism is kept in a widely available FBI database for six months, in a separate classified database for five years, and in yet another FBI database for at least 25 more years. The agency wants the tool to be in the form of a "secure, lightweight web application portal, using mashup technology," and plans to use it to share information with intelligence partners to coordinate and synchronize awareness of events across operations, it said. Moreover, the application must be "infinitely flexible" to adapt to changing threats, and those using it must have access to a common operating dashboard from which they can view both unclassified open-source information feeds and use tools to analyze social media during a crisis as it happens. Other features the FBI hopes its data-mining tool will have include the ability to automatically "search and scrape" social-networking and open-source news websites for information about breaking world events. It also wants to give users of the tool the ability to do relevant keyword searches on sites such as Facebook, CNN, Fox News, and other popular information outlets on the Internet. The FBI is certainly not the first federal agency to recognize the value in information being shared via social media. In addition to its own aim to build a data-mining tool, the FBI also will likely benefit from the fruits of IARPA’s research efforts in this area. IARPA is seeking to create technology that will continuously analyze and mine data from websites, blogs, social media, and other public information to help it better forecast global events. The FBI, too, has said it monitors Twitter, Facebook, and other popular websites to help it maintain situational awareness and perform its necessary duties in support of international crises and events.

30


X. Case Study Twitter Inc. TWTR 0.72 % cut off U.S. intelligence agencies from access to a service that sifts through the entire output of its social-media postings, the latest example of tension between Silicon Valley and the federal government over terrorism and privacy. The move, which hasn’t been publicly announced, was confirmed by a senior U.S. intelligence official and other people familiar with the matter. The service—which sends out alerts of unfolding terror attacks, political unrest and other potentially important events—isn’t directly provided by Twitter, but instead by Dataminr Inc., a private company that mines public Twitter feeds for clients. Twitter owns about a 5% stake in Dataminr, the only company it authorizes both to access its entire real-time stream of public tweets and sell it to clients. Dataminr executives recently told intelligence agencies that Twitter didn’t want the company to continue providing the service to them, according to a person familiar with the matter. The senior intelligence official said Twitter appeared to be worried about the “optics” of seeming too close to American intelligence services. Twitter said it has a long-standing policy barring third parties, including Dataminr, from selling its data to a government agency for surveillance purposes. The company wouldn’t comment on how Dataminr—a close business partner—was able to provide its service to the government for two years, or why that arrangement came to an end. In a statement, Twitter said its “data is largely public and the U.S. government may review public accounts on its own, like any user could.” The move doesn’t affect Dataminr’s service to financial industry, news media or other clients outside the intelligence community. The Wall Street Journal is involved in a trial of Dataminr’s news product. Dataminr’s software detects patterns in hundreds of millions of daily tweets, traffic data, news wires and other sources. It matches the data with market information and geographic data, among other things, to determine what information is credible or potentially actionable. For instance, Dataminr gave the U.S. intelligence community an alert about the Paris terror attacks shortly after they began to unfold last November. That type of information makes it “an extremely valuable tool” to detect events in real time, the intelligence official said. In March, the company says it first notified clients about the Brussels attacks 10 minutes ahead of news media, and has provided alerts on ISIS attacks on the Libya oil sector, the Brazilian political crisis, and other sudden upheaval in the world.

31


U.S. government agencies that used the Dataminr service are unhappy about the decision and are hoping the companies will reconsider, according to the intelligence official. “If Twitter continues to sell this [data] to the private sector, but denies the government, that’s hypocritical,” said John C. Inglis, a former deputy director of the National Security Agency who left in 2014. “I think it’s a bad sign of a lack of appropriate cooperation between a private-sector organization and the government.” Analysis of Twitter and other social-media services has become increasingly important to intelligence and law-enforcement agencies tracking terror groups. Islamic State posts everything from battlefield positions to propaganda and threats over Twitter. San Francisco-based Twitter deletes thousands of accounts a month for violating its antiterror policies, but Islamic State supporters create new accounts almost as quickly. “The volume of the group’s activity on Twitter yields a vast amount of data that is a crucial tool for counterterrorism practitioners working to manage threats,” said Michael S. Smith II, chief operating officer of the security consulting firm Kronos Advisory. “Twitter’s decision could have grave consequences.” In a speech last September, David S. Cohen, a deputy director of the Central Intelligence Agency, discussed the importance of “open source” social-media data gathered by the CIA, saying Islamic State’s “tweets and other social-media messages publicizing their activities often produce information that, especially in the aggregate, provides real intelligence value.” Silicon Valley and the U.S. government have been locked in intensifying conflicts over cooperation since the revelations by former National Security contractor Edward Snowden about government surveillance of electronic communication. Most recently, Apple Inc. AAPL 1.52 % and the Justice Department were embroiled in a legal showdown over demands by the Federal Bureau of Investigation to unlock an iPhone used by one of the killers in the San Bernardino, Calif., attack in December. That fight—which unlike the Dataminr product involved the release of private data—ended in March when the FBI found another way to access the phone. In-Q-Tel, a venture-capital arm of the U.S. intelligence community, has been investing in datamining companies to beef up the government’s ability to sort through massive amounts of information. In-Q-Tel, for example, has invested in data-mining firms Palantir Technologies Inc. and Recorded Future Inc. U.S. intelligence agencies gained access to Dataminr’s service after an In-Q-Tel investment in the firm, according to a person familiar with the matter. When a pilot program arranged by In-Q-Tel ended recently, Twitter told Dataminr it didn’t want to continue the relationship with intelligence agencies, this person said.

32


“Post-Snowden, American-based information technology companies don’t want to be seen as an arm of the U.S. intelligence community,” said Peter Swire, a Georgia Institute of Technology law professor and expert on data privacy. Dataminr, based in New York, was launched seven years ago by three former Yale University roommates. A financing round early last year valued it at $700 million, according to Dow Jones VentureSource. Its product goes beyond what a typical Twitter user could find in the jumble of daily tweets, employing sophisticated algorithms and geolocation tools to unearth relevant patterns. Dataminr has a separate, $255,000 contract to provide its breaking news-alert service to the Department of Homeland Security, which is still in force.

33


XI. Conclusion: With the increase of economic globalization and evolution of information technology, financial data are being generated and accumulated at an unprecedented pace. As a result, there has been a critical need for automated approaches to effective and efficient utilization of massive amount of financial data to support companies and individuals in strategic planning and investment decision making. Data mining techniques have been used to uncover hidden patterns and predict future trends and behaviors in financial markets. The competitive advantages achieved by data mining include increased revenue, reduced cost, and much improved marketplace responsiveness and awareness. Social media mining is the process of representing, analyzing, and extracting actionable patterns from social media data. Social media mining introduces basic concepts and principal algorithms suitable for investigating massive social media data; it discusses theories and methodologies from different disciplines such as computer science, data mining, machine learning, social network analysis, network science, sociology, ethnography, statistics, optimization, and mathematics. It encompasses the tools to formally represent, measure, model, and mine meaningful patterns from large-scale social media data. Data mining brings a lot of benefits to businesses, society, governments as well as the individual. However, privacy, security, and misuse of information are the big problems if they are not addressed and resolved properly. Adopt the mindset of only giving out the personal data that you absolutely must—for example, at checkout or when signing up for an online account—to significantly reduce your digital footprint. Avoid companies that don’t respect your privacy. Just as one bad actor can induce a privacy scare, one good actor – like Edward Snowden, or you – can take the necessary steps to reduce your exposure and strengthen your sense of privacy. Most of the recent stories about big data collection and breaches have a central theme: the little guy matters and can do something. Whether that individual is a Facebook user who refuses to give the site her real name, an NSA whistleblower who tells the world when it’s being watched, or a person using a tool to block companies from tracking him online, each person has the power to move privacy forward or diminish it.

34


XII. The Future of Data Mining In the short-term, the results of data mining will be in profitable, if mundane, business related areas. Micro-marketing campaigns will explore new niches. Advertising will target potential customers with new precision. In the medium term, data mining may be as common and easy to use as e-mail. We may use these tools to find the best airfare to New York, root out a phone number of a long-lost classmate, or find the best prices on lawn mowers. The long-term prospects are truly exciting. Imagine intelligent agents turned loose on medical research data or on sub-atomic particle data. Computers may reveal new treatments for diseases or new insights into the nature of the universe. There are potential dangers, though, as discussed below.

35


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.