Sentimental Analysis: 4月 2017

2017年4月10日月曜日

3-1. Assessment of the Sentimental Analysis Tools

Before conducting the actual sentimental analysis on the news headlines, it is preferable to test the credibility of each sentimental analysis software which was used for this research. This time, we used three different sentimental evaluation tools; “Corenlp made by Stanford University’s staff”, “Alchemy API by IBM”, and “Text Analytics by Microsoft”.

In order to attain the proper efficiency of the software, we ran an experiment about how accurately those applications can evaluate the sentimental elements of short texts like news headlines. Also, we tried to find a method to make the calculated results plausible enough to elicit the logical conclusion. For this preliminary survey, we collected both “50 positive news headlines” and “50 negative news headlines” from the “Newsnow” that is a news site which has contents called “Bad news” and “Good news” where we can find possible “positive” and “negative” news headlines. It goes without saying that all titles were carefully examined to check whether the sorting was completely correct.

Then we ran those software to find out how precisely they could discern the sentiments in the way of “positive headlines” as “positive” and “negative headlines” as “negative”. One thing that needs to be noted for this investigation is that those applications have different styles. "Corenlp" is a Java-based program, so it can be performed on command prompt. Likewise, the other two software are provided as web-based tools, which enable researchers to use them relatively easier. As for the classification method, three different emotions “Positive”, “Negative” and “Neutral” were adopted as indicators for the experiment. Although “Corenlp” and “IBM’s Alchemy API” exactly can clearly divide each text’s sentiment into the three category, “Microsoft’s Text Analytics” only shows those sentiments as numerical values. Therefore, the results from “Text Analytics” were converted to the three emotions manually according to the level of the value. If the value was below 40, then the text's sentiment was regarded as "negative". Similarly, if it showed 40 to 60, then the outcome was determined as "neutral". Moreover, if it became more than 60, it was set as "positive".

The results are shown below from Chart.1 to Chart.3. Firstly, it is noticeable that the low discrimination rate for “positive headlines” (Chart.1). This disappointing outcome might suggest that the three tools are not good at feeling the “positive” emotions of the short texts. In contrast, however, they showed a superb reaction to “negative headlines”. All of them were able to regard the tendency of “negative headlines” as “negative” quite well. Notably, the accuracy of IBM’s Alchemy API for “negative headlines” reached almost 100% if we ignore the values of “Neutrals” (Chart.2). In addition, the mean accuracy for the emotion detectors was all below 80%, which is relatively lower than the reported data from various theses (Table.1).

Table.1

Thesis	Classification	Accuracy
Sentiment Analysis: Capturing Favorability Using Natural Language Processing	Negative or Positive	75 – 95%
Thumbs up? Sentiment Classification using Machine Learning Techniques	Negative or Positive	72.8 - 82.9%
Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank	Negative or Positive	79.4 - 85.4%

2017年4月8日土曜日

2-3. Previous Research

Those studies above clearly suggest that “Sentimental Analysis” contains a lot of aspects which have been developed in various fields. Additionally, it would be beneficial for us to systematize the whole elements which are introduced the previous part.

It must be safe to say that the direct ancestors of “Sentimental Analysis” are “Content Analysis”, “Statistics” and “Computer Science” because the new style uses every component those preceding research have established for a long time. If we show the relationship in a figure, that would be like Figure.5 below. The illustration explains how different territories of linguistic synergize with each other as well as how Sentimental Analysis is constructed academically.

In “Foundations of behavioural research (1986)”, Kerlinger stated that “Content Analysis” is characterized as “a method of studying and analyzing communication in a systematic, objective, and quantitative manner for the purpose of measuring variables (Content Analysis A method, p.2).” On the other hand, “Quantitative Content Analysis” is a part of “Content Analysis” and it especially features the aspect of “Statistics” more to attain the objective. Zhang and Wildemuth introduced the three definitions for “Quantitative Content Analysis”;

• “a research method for the subjective interpretation of the content of text data through the systematic classification process of coding and identifying themes or patterns”

• “an approach of empirical, methodological controlled analysis of texts within their context of communication, following content analytic rules and step by step models, without rash quantification”

• “any qualitative data reduction and sense-making effort that takes a volume of qualitative material and attempts to identify core consistencies and meanings” (Zhang & Barbara, p. 1)

Moreover, they also indicate the strong tie between the two approaches saying; “The quantitative approach produces numbers that can be manipulated with various statistical methods (Zhang & Barbara, p. 2).” Therefore, it would be rational inference to formulate that the combination of “Content Analysis” and “Statistics” creates “Quantitative Content Analysis”. The same thing can be said about the rapport between “Content Analysis” and “Computer Science” which creates “Computer Content Analysis”. “Computer Content Analysis” is a computer-assisted evaluation method that treats text or its contents (Alexa, 1997, p. 5). This modern technics has been evolved largely due to the abundant electronic text and databases from large variety of sources which are available for researchers who dedicate their time and effort to mine the information for text structures. One notable trait that machine-readable text has is that it is recognizable by application so that many data the target materials have can easily be collected and sorted in order to analyze them. As stated in Alexa’s report in 1997, computer systems such as "General Inquirer" which is an IBM 7090 program system that was developed at Harvard in the spring of 1961 by Philip J. Stone and his colleague and "WORDS" proposed in 1969 by H.P. Iker and N.I. Harway contributed to the establishment of “Computer Content Analysis” (Alexa, 1997, p. 5). In addition to those software for massive “Content Analysis”, Alexa pointed out that DeWeese’s proposals and technics shown in his two theses during late 70’s (DeWeese, 1976 & 1977) may be regarded as path-breaking works of the inception of “Computer Content Analysis”. The progress above clearly suggests that “Computer Content Analysis” consists of “Content Analysis” and “Computer Content Analysis”. Though there might be some possibilities about the involvement of “Statistics” in the aseessment, it is a logical determination to omit “Statistics” from the components of “Computer Content Analysis” because the main purpose of the computerized approach focuses on the exploration of the text structures, rather than numerical facts or data.

The third academic territory given birth of as a result of theoretical fusion is “Statistical Machine Learning (a.k.a: Statistical learning theory)” which is defined as a unification of statistics and the computational sciences. “Statistical Machine Learning” generally means the self-educated computer system for “the automated detection of meaningful patterns in data (Shai, 2014, 9. 7).” Recently, this kind of technology has widely used for creating various applications we use every day. For example, search engines represented by Google, Yahoo, MSN, and portal sites like that are made up based on this methodical approach. Furthermore, “Statistical Machine Learning” has a unique feature that can execute any complicated tasks by its ability to “learn” and “adapt”. Those capacity is able to be achieved only by “Statistical Machine Learning”. This is because that it is impossible for human engineers to program a system to deal with every assignment which needs an explicit and detailed specification. Thus, the mechanics also plays a crucial role in Artificial Intelligence which performs operations analogous to learning and decision making in humans. As the name suggests, “Statistical Machine Learning” has deep relationship with “Statistics”. In a book “An Introduction to Statistical Learning”, the authors say that “statistical learning (“Statistical Machine Learning”) has emerged as a new subfield in statistics, focused on supervised and unsupervised modeling and prediction (James, 2013).”As shown above, “Sentimental Analysis” is a trinity complex including the three basic elements; “Content Analysis”, “Statistics” and “Computer Science”. Moreover, it is hugely affected by the derivatives like “Quantitative Content Analysis”, “Computer Content Analysis” and “Statistical Machine Learning”. In other word, “Sentimental Analysis” represents a fruits of long-term research from various fields.

In summary, “Content Analysis”, “Statistics” and “Computer Science” are distinct stand-alone area respectively. Nevertheless, by cooperating with each other, they successfully invented different genres. As a result, the movement make it possible to emerge the three subsidiaries and one next-generation technology; “Quantitative Content Analysis”, “Computer Content Analysis”, “Statistical Machine Learning” and “Sentimental analysis”. As for the three subsidiaries, it might be difficult to set a boundary between them strictly because all of them are influenced by the three basic parent elements to some extent. However, in this thesis, it would be convenient to treat them as independent existences for the purpose to grasp the latent functions of “Sentimental Analysis” systematically.

2017年4月5日水曜日

2-2. Previous Research

　　　Likewise, many researchers confirm that the 2003 thesis submitted by Nasukawa and Yi (Nasukawa & Jeonghee) used the term “Sentimental analysis” first.[1] In the report, they show how to extract the sentiments and divide them into “Positive” and “Negative”. The survey, unlike the previous two examples, treats many sentences of various domains from news article to camera reviews on the web pages in order to find out the way to pave the new linguistic approach. In the last part of the paper, they concluded that the more difficult the target documents, which are news articles and descriptions in some official organizational Web pages, becomes, the lower the precision rate goes. The main method which this paper used for sentimental analysis on texts is A Part-Of-Speech Tagger (POS Tagger). POS Tagger “is a piece of software that reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc.”[2] It seems that this technology has widely adapted by several cognitive services or applications.

It would also be noteworthy that both authors of this report are IBM’s staff. Intriguingly, one of the previous thesis’s author “Shivakumar Vaithyanathan” is also a staff from IBM research center. Those people’s untiring efforts would be one of the reasons that IBM has recently been regarded as a leading company in the cognitive system due to the creation of the revolutionary computer system “Watson” which has offered many APIs including machine learning techniques to the companies and concerned individuals. Among the features, what IBM Watson provides is “AlchemyLanguage”, which enables us to analyze text and help us to understand its sentiment, keywords, entities, high-level concepts and more. Probably, the work by Nasukawa and Yi and other pioneers’ feats greatly contributed to the progress of the “AlchemyLanguage”.

Title is a crucial part for news articles and books which need to attract potential readers’ interest. This short sentence has also been a hot field which many researchers have devoted themselves to. According to Charles Grivel, title functions can be divided into three categories; “to identify the work”, “to designate its topic” and “to make the book stand out.” There seem to be general acceptance of the sorting for title functions (Genette, 1997, p. 69). On the other hand, some people raised an objection his theory. For instance, Gerard Genette, who is a French literary theorist, insisted that the Grivel’s classifications for title are “not necessarily all fulfilled at the same time”. Furthermore, he added that even though the first one is a must, the other two are “optional or supplementary (Genette, 1997, p. 69).” In addition, Genette asserted that the role of title should be parted into “thematic” and “formal”. These days, the combination of Text analysis and Sentimental analysis has led to new discoveries. For one thing, it is found that negative news headlines get much more reader’s attention, which is the third title function Grivel invented, than positive ones (“Psychology: Why bad”, 2014). This surprising fact hidden under our feet was revealed thanks to the technical collaboration. Also, this partnership is applied to assess the media trend over the certain social issues. The paper led by Erik Bleich (Bleich, Nisar, and Abdelhamid, 2015) successfully uncovered the unique media tendency of the news coverages on Islam and Muslims by conducting Sentimental investigation on New York times’ news headlines related to the topics. This survey greatly contributes to this research process. In fact, some of the method this paper use are referred from Bleich’s study.

The rapid increase Internet diffusion rate enabled researchers to collect massive data required for calculate social phenomenon. The technics has now been used not only by scholars who hope to verify their theories, but also by companies which want to monitor the customers’ feelings or complains about their products more accurately. In recent years, the heated demands for analysis on people’s latent minds and the accumulated legacies incredibly accelerated technological innovation. As a result, it has hugely broaden the usage of Sentimental analysis from finance to politics.

Financial themes have always been a popular subject in “Sentimental analysis”. So far, a numerous number of economic treatises connected to the emotional evaluation have been proposed. It seems that the monetary scholarly field has quickly accepted this new style because the markets have a long history to cherish the miscellaneous data for the prediction of the stock movement. The data is called “Fundamentals”[3] and “Sentimental analysis” has smoothly mingled with them. In fact, as this thesis mentioned earlier, the leading paper for “Sentimental analysis” was about stock prediction using this method (Yahoo! for Amazon, 2004). Since then, there are a lot of theories proposed concerning the evaluation for the market trend.

Johan Bollen and his team’s thesis in 2010 (Bollen & Huina & Xiao-Jun, 2010) is well cited and it achieves more than 2000 times quotations from other studies according to “Google Scholar”. In Bollen’s research, they proved that sentiments of the sentences on Twitter clearly have something to do with the value of the Dow Jones Industrial Average (DJIA). They used two mood tracking tools for their survey; one is “OpinionFinder” which measures positive or negative and the other is “Google-Profile of Mood States (GPOMS) that measures mood in terms of 6 dimensions (Calm, Alert, Sure, Vital, Kind, and Happy)”. Eventually, they reported that the overall accuracy to predict the value of DJIA reached as high as 86.7%.

The utilizations of “Sentimental Analysis” encompass tangled political matters such as “poll survey” that is one of the topics for this thesis. Nowadays, SNS tools are the primary places for the estimation of political sentiments among public mind. Out of many communication services, Twitter is presumably more chosen for the analysis due to its outstanding feature that set the 140-character limitation for each posting on every user. This function enabled researchers to collect and inspect the target materials easier and faster because those texts were much shorter and clearer than the descriptions of conventional blogs or homepages. (However, Twitter announced the abolishment of the system in May, 2016) What is more, according to the research conducted by Tom Jackson and Martin Sykora who are professors and lecturer of Loughborough University, analyzing emotions on Twitter suggested Donald Trump’s victory over Hillary Clinton for “The United States presidential election of 2016” (Jackson & Sykora, 2016). This amazing discovery implies not only the premonition for the new dawn of the outmoded classic polling system but also the possibility for the innovation of other long-established customs. Traditional survey theories for voters’ phycology status like “Bando wagon” and “Underdog effect” might be able to be more sophisticated by cooperating with “Sentimental Analysis”

2017年4月3日月曜日

2-1. Previous Research

In principle, Sentimental Analysis can be categorized as a spin-out stemmed from Text Analysis field. Similarly, Text Analysis pertains to Content Analysis. Thus, it would be necessary to date back to its root theories when it comes to seeking for the origin of the Sentimental analysis. Philipp Mayring, Professor of Psychology from Alps Adria University Klagenfurt, lines up the Bible interpretation as one of the precursors for Content Analysis (Mayring & Klagenfurt, 2008, para. 6). For many centuries, every one of the words both in the Old and New testaments have long been scrutinized thoroughly by not only scholars, but also ordinary individuals. Though the motivation for the challenge extensively varies from protecting faith to attacking it, it is certain that there would have never been such fully examined books for various purposes from diverse aspects in human history. For instance, the comparison between Luke and Acts has been a popular subject in an effort to confirm or deny the common connections in them (Walters, 2009). Moreover, some critics pointed out that the preface of Luke is resemble to the traditional Greek-speaking historians' writing style like the opening statement of Josephus' "Against Apion" (Allison A. Trites、William J. Larkin, 2016). It's just an aside, but assumedly the custom established due to the fear of the demagogues who agitated people in the ancient Greek and led to the collapse. Those theories clearly specify how text analysis has been utilized in the theological world.

As for the more contemporary text analysis, the 18th century Methodist leader John Wesley, who was an Anglican cleric, leader of Methodism and the spiritual founder of Aoyama Gakuin, proposed a primitive method to understand the true meaning of the Holy book saying;

Many biblical texts are intertexts, composed with other biblical texts in

mind and heart, and still other texts, unknown or unintended by the

author, that come to the interpreter’s mind in canonical context. The

talented interpreter listens for echoes of other biblical texts, however low

their volume, and looks for allusions, however dim their reflection, that

link biblical texts together, the one glossing and thickening the meaning

of the other. (How to Read, p.43)

This explanation can be grasped as a Content analysis for Bible’s stories that have a numerous number of intricate descriptions for which careful exegeses are needed.

However, it took a long time for Content analysis to be systematically compiled. In the early 20^th century, Bernard Berelson firstly published a book about Content analysis “Content analysis in Communication Research” in 1952, which was presumably effected by the prior works of Paul F. Lazarsfeld and Harold D. Lasswell in 20ies and 30ies of 20th century such as the U.S. government sponsored a project under the directorship of Harold Lasswell to evaluate enemy propaganda during World War II. (Content Analysis A method, p.1). After that, the technics began to be employed by many scholars for assessment of various fields from politics to finance.

As for the “Sentimental analysis” or its essential function “Text mining”, its emergence was much more recent compared to those ancestors. According to Bing lu, a professor at University of Illinois at Chicago (UIC), before 2000, there were few investigations using this method conducted due to the deficiency of the network environment causing poor data mining results. However, since 2000, thanks to the rising Information technology, this field has become one of the major research areas with rapidly developing computer software (Sentiment Analysis and, p.10). There began to appear various reports for this approach since then. For example, the 2001 paper written by Sanjiv Das and Mike Chen (Yahoo! for Amazon, 2004) is regarded as one of the pioneer works for this category by related researchers (Lee, 2008, para. 1). In the thesis, they collect investors’ postings from stock message board and assessed how their sentiments are affected by management announcements, press releases, third-party news, regulatory changes, and thing like that. The probe notably employs five algorithms; Naive Classifier, Vector Distance Classifier, Discriminant-Based Classifier, Adjective-Adverb Phrase Classifier and Bayesian Classifier as classifiers for each message. Moreover, they created additional programs to collect data and help those classifiers to evaluate the text. It is remarkable that even the earliest stage of the “Sentimental analysis” study already utilized the computer based applications. Similarly, their analysis process that they divided the opinions into three groups which are bullish (optimistic), bearish (pessimistic) and neutral (comprising either spam or messages that are neither bullish nor bearish) is also impressive, though these days some papers treating “Sentimental analysis” seem to be fond of using the terms “Positive”, “Negative” and “Neutral” instead. Not only their methods but also their research results are quite interesting.

Figure 1 and 2 from the thesis show a perfect correlation between the sentiment indexes that they calculated from messages from the “Yahoo” boards and actual stock prices for APPLE on 18-October-2000 and 20-October-2000. Surprisingly, figure.2 slightly indicates that the sentiment index implies the imminent rising of the stock price before it actually happens. On the contrary, Figure 3, which illustrates the Amazon’s comparison on 11th December, 2000, reveals the opposite outcome in which the two trends are in inverse proportion to each other. Interestingly, the authors defend their untoward consequence by desperately saying “On other days, such as for Amazon on 11th December, 2000 (Figure 3), there appears to be almost no relationship between sentiment and stock price, was a precursor to stock price change (Yahoo! for Amazon, 2004, P. 7).”

The dissertation in 2002 proposed by Bo Pang, Lillian Lee and Shivakumar Vaithyanathan (“Thumbs up?” 2002) is also viewed as the early sentimental analysis document (Structured Models). In the paper, they use the movie reviews and decide whether they are “Positive” or “Negative”. Besides, comparisons between human and algorithms on the accuracy for distinguishing positive from negative words is conducted. One of the unique features this thesis offers is that they employ the “machine learning techniques”. According to “WhatIs.com”, machine learning means “a type of artificial intelligence that provides computers with the ability to learn without being explicitly programmed”.[1] Also, those machine learning algorithms can educate themselves using given data and specialize in any fields’ analysis depending on the type of the information. Three different machine learning methods are used in this research; “Naive Bayes classification”, “maximum entropy classification,” and “support vector machines”. Each one of them is a traditional analysis method which has been studied for a long time in other fields. Another interesting point is that this thesis successfully proves that humans’ instincts are much inferior to those standard machine learning techniques for analyzing the text’s sentimental elements like positive or negative words against our general expectations. The figure 4 from the thesis clearly suggests that algorithms’ accuracies outnumber the humans’.

[1] http://whatis.techtarget.com/definition/machine-learning

[2] http://www.kimschouten.com/papers/paper-discussion-nasukawa-yi-2003/

http://www.ijcaonline.org/archives/volume150/number6/alhojely-2016-ijca-911545.pdf

2017年4月1日土曜日

1. Introduction

In the U.S. election 2016, surprisingly Donald Trump was elected as a new president. This result astonished a lot of people because the previous poll surveys predicted Trump’s lose (Kirk & Scott, 2016). These days, predicting public emotions by ordinary ways such as telephone interviews or face-to-face survey seems to be much more difficult than before due to the growing complexity of the social elements like the widespread use of mobile phones, the hidden distrust of media coverages or concerns about privacy violation. One media report that the incredible victory hugely attributed to the hidden supporters who were not willing to express their tendency toward Trump (“Trump campaign,” 2016). On the contrary, analyzing social networking services might have a significant potential power to reveal the true sentimental values among people on particular topics, which might change the survey style in the near future.

Nowadays, not only the U.S., but also many countries are forced to deal with the unprecedented and sometimes unstable progress. For example, the America’s old friend, U.K., also experienced the painful detachment from EU called “Briexit”, which nobody had anticipated the realization until the results turned out. It is said that some of the avid supporters for it could not have imagined the results, which is called “Bregret” (Dearden, 2016).

It is not too much to say that the world is now facing with rapid upcoming of a new order. What is, if any, the main cause of those unpredictable events? Generally, big changes stem from big impulses. If so, there must be some kind of the emerging power which created the current unstable situation. In this thesis, the measurement for how Syrian refugee crisis from early 2015 has affected the U.K. society is offered in order to glimpse the entire influence of the calamity

When it comes to thinking about the displaced people, the important thing is that the media coverages have played an important role to pervade those events around the world. This means their influence has the crucial power not only to report the problem, but also to build up the public images or opinions which sometimes change the course of the political decision on particular matters.

The main objective of this paper is to find out how the media trends have changed by critical events and what kind of influence the shifting seems to have on the public reaction toward refugees or politics. For appraising the media trend, articles on BBC news website were analyzed by three sentimental tools. Additionally, with regard to verifying those effects, “Syrian refugee crisis” was chosen as a primal topic for evaluation because the tragic event has several notable features. One is that recently, especially in European countries, there have been serious concerns about immigrants from Muslim nations. What is more, some people there feel that those people might change their cultures entirely based on the sharia law. On the other hand, there are always heartfelt supports from the local citizens to those new comers. The coexistence of those two groups seems to create other problems in their countries that is worth to scrutinizing.

The motivation for choosing those subjects comes from the fact that many people these days begin to think the media reports are not reliable enough (“Further decline,” 2012). Therefore, “Syrian refugee crisis” could be a perfect example to exemplify how the public reactions are effected by the media coverages. Moreover, the historical tragedies include many aspects which the world today is facing with. For example, multiculturalism, Islamization, anti-immigrant sentiment, and social matters like that. The aim of this study is chiefly to show how the public emotions towards the refugee crisis have changed over a period of time and what influence the media coverages have attributed to them. Besides, the evaluations of the reliabilities of the sentimental analysis tools which have been developed to analyze the various social tendency including the media trend were conducted as well.

登録: 投稿 (Atom)