A note on extracting 'sentiments' in financial news in English, Arabic ...
A note on extracting 'sentiments' in financial news in English, Arabic ...
A note on extracting 'sentiments' in financial news in English, Arabic ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
A <str<strong>on</strong>g>note</str<strong>on</strong>g> <strong>on</strong> <strong>extract<strong>in</strong>g</strong> ‘sentiments’ <strong>in</strong> f<strong>in</strong>ancial <strong>news</strong> <strong>in</strong> <strong>English</strong>, <strong>Arabic</strong> & Urdu<br />
Yousif Almas<br />
Department of Comput<strong>in</strong>g<br />
University of Surrey<br />
Guildford, Surrey, GU2 7XH, UK<br />
y.almas@surrey.ac.uk<br />
Khurshid Ahmad<br />
Department of Computer Science<br />
Tr<strong>in</strong>ity College,<br />
Dubl<strong>in</strong>-2. Ireland<br />
kahmad@cs.tcd.ie<br />
Abstract<br />
The growth of specialist texts <strong>in</strong> languages<br />
used by hundreds of milli<strong>on</strong>s of<br />
people, say, <strong>in</strong> the Middle East (<strong>Arabic</strong>)<br />
and <strong>in</strong> Pakistan (Urdu), provides benchmarks<br />
for the Anglo-centric <strong>in</strong>formati<strong>on</strong><br />
extracti<strong>on</strong> and polarity analysis systems.<br />
Public sentiment, openly expressed or<br />
surreptitiously recorded, is be<strong>in</strong>g analysed<br />
by us<strong>in</strong>g methods and techniques of<br />
computati<strong>on</strong>al l<strong>in</strong>guistics <strong>on</strong> documents<br />
compris<strong>in</strong>g everyday usage of general<br />
language, especially teleph<strong>on</strong>e c<strong>on</strong>versati<strong>on</strong>s<br />
and <strong>news</strong> reports. It is important to<br />
evaluate whether studies <strong>in</strong> the use of<br />
language <strong>in</strong> special registers, say f<strong>in</strong>ancial<br />
trad<strong>in</strong>g, may benefit sentiment analysis:<br />
sentiment analysis or c<strong>on</strong>sumer/<strong>in</strong>stituti<strong>on</strong>al<br />
c<strong>on</strong>fidence is measured<br />
through op<strong>in</strong>i<strong>on</strong> polls <strong>on</strong> the basis<br />
that <strong>in</strong>vestors occasi<strong>on</strong>ally <strong>in</strong>dulge <strong>in</strong><br />
exuberant behaviour and make decisi<strong>on</strong>s<br />
that are irrati<strong>on</strong>al or have bounded rati<strong>on</strong>ality.<br />
We show that the noti<strong>on</strong> of<br />
lexical signature of a specialism together<br />
with the existence of local grammars<br />
with<strong>in</strong> a specialist register, rooted <strong>in</strong> collocati<strong>on</strong>al<br />
patterns, will help <strong>in</strong> identify<strong>in</strong>g<br />
and deploy<strong>in</strong>g the signature and patterns<br />
for detect<strong>in</strong>g sentiment that is<br />
openly expressed <strong>in</strong> f<strong>in</strong>ancial <strong>news</strong>. Our<br />
study orig<strong>in</strong>ates <strong>in</strong> the analysis of (British)<br />
<strong>English</strong> f<strong>in</strong>ancial texts (c. 3.5 milli<strong>on</strong><br />
tokens) and the sentiment-bear<strong>in</strong>g patterns<br />
found <strong>in</strong> these texts can help to extract<br />
sentiment bear<strong>in</strong>g patterns <strong>in</strong> <strong>Arabic</strong><br />
(c. 3.5 milli<strong>on</strong>) and <strong>in</strong> Urdu (c. 1 milli<strong>on</strong>).<br />
1 Introducti<strong>on</strong><br />
There are two senses of the term sentiment<br />
analysis. One which is used <strong>in</strong> computati<strong>on</strong>al<br />
l<strong>in</strong>guistics: ‘sentiment analysis is the task of<br />
identify<strong>in</strong>g positive and negative op<strong>in</strong>i<strong>on</strong>s, emoti<strong>on</strong>s,<br />
and evaluati<strong>on</strong>s’ (Wils<strong>on</strong> et al., 2005). The<br />
sec<strong>on</strong>d sense relates to the percepti<strong>on</strong> of the behaviour<br />
of human be<strong>in</strong>gs by other human be<strong>in</strong>gs<br />
either with<strong>in</strong> an office envir<strong>on</strong>ment (Sim<strong>on</strong>,<br />
1979), or with<strong>in</strong> a f<strong>in</strong>ancial market (Kahneman,<br />
2003), or <strong>in</strong>deed the percepti<strong>on</strong> of the worth of a<br />
film or c<strong>on</strong>sumer goods (Turney, 2002). It is<br />
well known that the otherwise rati<strong>on</strong>al traders<br />
and <strong>in</strong>vestors <strong>in</strong> a f<strong>in</strong>ancial market do <strong>in</strong>dulge <strong>in</strong><br />
irrati<strong>on</strong>al behaviour (Shiller, 2000) lead<strong>in</strong>g to<br />
the metaphoric and literal boom and bust c<strong>on</strong>diti<strong>on</strong>s<br />
<strong>in</strong> such markets. The analysis of the prices<br />
and trad<strong>in</strong>g volumes of shares and commodities,<br />
or that of composite <strong>in</strong>dices, is typically carried<br />
out by c<strong>on</strong>duct<strong>in</strong>g fundamental and/or technical<br />
analysis – both based <strong>on</strong> quantitative data,<br />
mathematical models and statistical techniques.<br />
Over the last ten years or so a third form of<br />
analysis is tak<strong>in</strong>g root – sentiment analysis.<br />
Fundamental analysis is ‘a method of security<br />
valuati<strong>on</strong> which <strong>in</strong>volves exam<strong>in</strong><strong>in</strong>g the company's<br />
f<strong>in</strong>ancials and operati<strong>on</strong>s, especially sales,<br />
earn<strong>in</strong>gs, growth potential, assets, debt, management,<br />
products, and competiti<strong>on</strong>.’ Technical<br />
analysis is ‘a method of evaluat<strong>in</strong>g securities by<br />
rely<strong>in</strong>g <strong>on</strong> the assumpti<strong>on</strong> that market data, such<br />
as charts of price, volume, [..which], can help<br />
predict future (usually short-term) market trends.<br />
[..] [It is assumed that] market psychology <strong>in</strong>fluences<br />
trad<strong>in</strong>g <strong>in</strong> a way that enables predict<strong>in</strong>g<br />
when a stock will rise or fall.’<br />
Sentiment analysis is a related branch of technical<br />
analysis and sentiment is def<strong>in</strong>ed as ‘a<br />
measurement of the mood of a given <strong>in</strong>vestor or<br />
the overall <strong>in</strong>vest<strong>in</strong>g public, either bullish or<br />
bearish.’(Def<strong>in</strong>iti<strong>on</strong>s from www.<strong>in</strong>vestorwords.com)<br />
Yousif Almas and Khurshid Ahmad, A <str<strong>on</strong>g>note</str<strong>on</strong>g> <strong>on</strong> <strong>extract<strong>in</strong>g</strong> ‘sentiments’ <strong>in</strong> f<strong>in</strong>ancial <strong>news</strong> <strong>in</strong> <strong>English</strong>, <strong>Arabic</strong> & Urdu, The 2 nd Workshop <strong>on</strong> Computati<strong>on</strong>al Approaches<br />
to <strong>Arabic</strong> Script-based Languages, L<strong>in</strong>guistic Soc America 2007 L<strong>in</strong>guistic Institute, Stanford University, Stanford, California., July 21-22, 2007, L<strong>in</strong>guistic<br />
Society of America, 2007, pp 1 – 12<br />
1
Ever s<strong>in</strong>ce Maynard Keynes suggested that<br />
there are ‘animal spirits’ <strong>in</strong> the markets, ‘ec<strong>on</strong>omists<br />
have devoted substantial attenti<strong>on</strong> to try<strong>in</strong>g<br />
to understand the determ<strong>in</strong>ants of wild movements<br />
<strong>in</strong> stock market prices that are seem<strong>in</strong>gly<br />
unjustified by fundamentals’ (Tetlock, 2007).<br />
Keynes’s sentiment (sic) were echoed by the<br />
Nobel Laureates Herbert Sim<strong>on</strong> (Prize awarded<br />
1978) and Daniel Kahneman (Prize awarded<br />
2002), when they dem<strong>on</strong>strated that limited rati<strong>on</strong>ality<br />
behaviour, that is mak<strong>in</strong>g decisi<strong>on</strong>s that<br />
are c<strong>on</strong>tradicted by facts <strong>on</strong> the ground, characterises<br />
the behaviour of <strong>in</strong>dividual and <strong>in</strong>stituti<strong>on</strong>al<br />
<strong>in</strong>vestors. This irrati<strong>on</strong>al exuberance is<br />
be<strong>in</strong>g studied systematically to produce a ‘sentiment<br />
<strong>in</strong>dex’ at Yale 1 for US and Japanese markets,<br />
and is used <strong>in</strong> produc<strong>in</strong>g ‘c<strong>on</strong>sumer c<strong>on</strong>fidence’<br />
<strong>in</strong>dices across the world.<br />
Robert Engle, another Nobel Laureate (Prize<br />
awarded 2003), has argued that the impact of<br />
<strong>news</strong> <strong>on</strong> f<strong>in</strong>ancial markets can be substantial and<br />
asymmetric <strong>in</strong> that ‘bad <strong>news</strong>’ has a l<strong>on</strong>ger last<strong>in</strong>g<br />
effect <strong>on</strong> prices, and particularly <strong>on</strong> volumes<br />
of shares traded, when compared to the effect of<br />
‘good <strong>news</strong>’ (Engle, 2004). Ec<strong>on</strong>omists typically<br />
use proxies for the <strong>news</strong> c<strong>on</strong>tent – change<br />
<strong>in</strong> the values of currencies, b<strong>on</strong>ds, or aggregated<br />
<strong>in</strong>dices like the Dow J<strong>on</strong>es and NASDAQ, especially<br />
changes that are co<strong>in</strong>cident with the various<br />
fiscal and m<strong>on</strong>etary announcement by nati<strong>on</strong>al<br />
and trans-nati<strong>on</strong>al f<strong>in</strong>ancial and m<strong>on</strong>etary<br />
authorities. Over the last 15 years or so, however,<br />
there is an <strong>in</strong>creas<strong>in</strong>g trend for actually analys<strong>in</strong>g<br />
the c<strong>on</strong>tents of f<strong>in</strong>ancial <strong>news</strong> items, and<br />
f<strong>in</strong>ancial traders’ blogs and e-mails (Antweiler<br />
and Frank, 2004; Hardie and Mackenzie 2007<br />
and references there<strong>in</strong>) with the help of a security-specific<br />
lexic<strong>on</strong> of positive and negative<br />
words or phrases together with a list of the<br />
names of securities (see for example, Surdenau et<br />
al, 2003; Chan and Wai 2005; Lan et al., 2005;<br />
Das and Chen, 2006). The lexic<strong>on</strong> c<strong>on</strong>ta<strong>in</strong>s<br />
metaphorical expressi<strong>on</strong>s – spatial metaphors of<br />
movement like up/down and rise/fall, animal<br />
metaphors of aggressi<strong>on</strong> and submissi<strong>on</strong>, like<br />
bears/bulls, health metaphors of vitality/sickness.<br />
In many studies (e.g. Tetlock et al., 2005; Tetlock,<br />
2007), the analysis of <strong>news</strong> c<strong>on</strong>tents shows<br />
that there is a degree of correlati<strong>on</strong> between the<br />
use of negative sentiment bear<strong>in</strong>g terms and the<br />
downward trends <strong>on</strong> or displayed by the prices of<br />
securities. The security-specific a priori lexic<strong>on</strong><br />
and the pre-selected choice of patterns means<br />
1 available at http://icf.som.yale.edu/c<strong>on</strong>fidence.<strong>in</strong>dex<br />
that the techniques used by the authors cannot be<br />
transferred to other securities and are <strong>in</strong> a sense<br />
analyst-dependent. As almost all the studies of<br />
sentiment analysis focus <strong>on</strong> <strong>news</strong> reports <strong>in</strong> <strong>English</strong><br />
language – so the transference of techniques<br />
developed thus far to other typologically similar<br />
languages <strong>in</strong> some degree (say, Urdu) and typologically<br />
dist<strong>in</strong>ct languages (say <strong>Arabic</strong>) may not<br />
be straightforward if at all possible.<br />
Perhaps, even if these patterns were generalisable<br />
over the f<strong>in</strong>ancial doma<strong>in</strong>, the metaphors<br />
may not transfer as well across languages. Furthermore,<br />
if the pre-dom<strong>in</strong>ant trad<strong>in</strong>g changes<br />
from f<strong>in</strong>ancial <strong>in</strong>struments, e.g. shares, currencies,<br />
b<strong>on</strong>ds, to commodities, will the patterns<br />
then survive? The overarch<strong>in</strong>g questi<strong>on</strong>s we ask<br />
are as follows: First, how to c<strong>on</strong>struct the lexic<strong>on</strong><br />
for a given security (shares for <strong>in</strong>stance) or a<br />
given commodity (oil or rice for example). One<br />
corollary to this questi<strong>on</strong> is whether a corpus of<br />
specialist text will help <strong>in</strong> creati<strong>on</strong> of such a lexic<strong>on</strong>.<br />
Another corollary is whether the noti<strong>on</strong> of<br />
collocati<strong>on</strong> and local grammar – c<strong>on</strong>structs <strong>on</strong>ly<br />
found <strong>in</strong> <strong>on</strong>e or few registers (Gross, 1993 and<br />
1997) will provide us with extended phrasal patterns<br />
that comprise sentiment words <strong>in</strong> the lexic<strong>on</strong>,<br />
Sec<strong>on</strong>d, we wish to exam<strong>in</strong>e the veracity of<br />
observati<strong>on</strong>s <strong>in</strong> the language-for-specialpurposes,<br />
and <strong>in</strong> the literature <strong>on</strong> corpus-based<br />
lexicography and term<strong>in</strong>ology, that doma<strong>in</strong>specific<br />
term<strong>in</strong>ology is to a large extent language<br />
<strong>in</strong>dependent. Both these questi<strong>on</strong>s may help<br />
workers <strong>in</strong> computati<strong>on</strong>al l<strong>in</strong>guistics who are<br />
otherwise focussed <strong>on</strong> general language texts, the<br />
Wordnet, the <strong>English</strong> language, and methods and<br />
techniques of ‘universal’ grammars of language.<br />
In order to answer the questi<strong>on</strong>s above, we follow<br />
the methodology <strong>in</strong> theoretical and computati<strong>on</strong>al<br />
l<strong>in</strong>guistics. Specifically, when a l<strong>in</strong>guist<br />
proposes and defends his or her theory about an<br />
aspect of language <strong>in</strong> general, he or she studies a<br />
specific language. These proposals are then<br />
tested <strong>on</strong> other languages by others. This c<strong>on</strong>trastive<br />
approach to the study of language has an<br />
important advantage for us: f<strong>in</strong>ancial <strong>news</strong>, and<br />
the c<strong>on</strong>comitant political <strong>news</strong>, is now published<br />
and broadcast widely not <strong>on</strong>ly <strong>in</strong> <strong>English</strong> but <strong>in</strong><br />
other languages as well. So, if <strong>on</strong>e could c<strong>on</strong>struct<br />
a lexic<strong>on</strong> of sentiment words, and a grammar<br />
that governs the pattern formati<strong>on</strong> of such<br />
words, <strong>in</strong> <strong>on</strong>e language and dem<strong>on</strong>strate that the<br />
method works <strong>in</strong> another language, then we have<br />
a validati<strong>on</strong> <strong>on</strong> the <strong>on</strong>e hand and an applicati<strong>on</strong><br />
<strong>on</strong> the other. We beg<strong>in</strong> by giv<strong>in</strong>g the reas<strong>on</strong>s for<br />
Yousif Almas and Khurshid Ahmad, A <str<strong>on</strong>g>note</str<strong>on</strong>g> <strong>on</strong> <strong>extract<strong>in</strong>g</strong> ‘sentiments’ <strong>in</strong> f<strong>in</strong>ancial <strong>news</strong> <strong>in</strong> <strong>English</strong>, <strong>Arabic</strong> & Urdu, The 2 nd Workshop <strong>on</strong> Computati<strong>on</strong>al Approaches<br />
to <strong>Arabic</strong> Script-based Languages, L<strong>in</strong>guistic Soc America 2007 L<strong>in</strong>guistic Institute, Stanford University, Stanford, California., July 21-22, 2007,<br />
L<strong>in</strong>guistic Society of America, 2007, pp 1 – 12<br />
2
why we have chosen three languages – <strong>English</strong>,<br />
<strong>Arabic</strong> and Urdu. Then we describe a method<br />
for automatically <strong>extract<strong>in</strong>g</strong> specialist terms, both<br />
s<strong>in</strong>gle terms and compound terms together with a<br />
method for isolat<strong>in</strong>g patterns governed <strong>in</strong>corporat<strong>in</strong>g<br />
these words that are idiosyncratic of the<br />
specialism – the so called local grammar. It was<br />
perhaps Zellig Harris who had implicitly described<br />
a local grammar when he talked about<br />
‘selecti<strong>on</strong>al restricti<strong>on</strong>s’ <strong>on</strong> doma<strong>in</strong> specific<br />
terms. We show how another c<strong>on</strong>trastive approach,<br />
where we compare the behaviour of s<strong>in</strong>gle<br />
and compound tokens <strong>in</strong> specialist and general<br />
language corpora to collect the evidence as<br />
to whether a token is behav<strong>in</strong>g like a term or not.<br />
Here the provisi<strong>on</strong> of a representative sample of<br />
general language texts is of key import, as, say,<br />
is the case for British <strong>English</strong> where we have the<br />
British Nati<strong>on</strong>al Corpus; the specialist corpora<br />
can be c<strong>on</strong>structed easily, say through the use of<br />
texts produced by major f<strong>in</strong>ancial <strong>in</strong>formati<strong>on</strong><br />
vendors like Reuters, Bloomberg or Dow J<strong>on</strong>es.<br />
We describe the local grammar patterns for identify<strong>in</strong>g<br />
sentiment-bear<strong>in</strong>g <strong>in</strong>formati<strong>on</strong> <strong>in</strong> the<br />
three languages. We c<strong>on</strong>clude by not<strong>in</strong>g that<br />
when applied to an unseen text our method has a<br />
high precisi<strong>on</strong> but low recall. The low recall can<br />
be improved by user feedback.<br />
2 F<strong>in</strong>ancial Language?<br />
2.1 F<strong>in</strong>ancial News <strong>in</strong> <strong>English</strong>, <strong>Arabic</strong> &<br />
Urdu<br />
The use of <strong>English</strong> as an ‘<strong>in</strong>ternati<strong>on</strong>al language’<br />
of bus<strong>in</strong>ess, science and technology is <strong>in</strong>disputable:<br />
major f<strong>in</strong>ancial <strong>news</strong> vendors, Reuters, AP,<br />
BBC, Bloomberg, are focussed <strong>on</strong> <strong>English</strong> and it<br />
has been argued that their ‘other languages’ output<br />
is ma<strong>in</strong>ly translati<strong>on</strong> from <strong>English</strong>; publishers<br />
<strong>in</strong> science do produce the bulk of their output <strong>in</strong><br />
<strong>English</strong>. The advent of new players, like the<br />
New Ch<strong>in</strong>a News Agency, is currently Anglocentric<br />
but has a c<strong>on</strong>siderable output <strong>in</strong> Ch<strong>in</strong>ese<br />
and <strong>Arabic</strong> as well.<br />
The lexical resources for text analysis (e.g.<br />
electr<strong>on</strong>ic dicti<strong>on</strong>aries and thesauri) and for sentiment/polarity<br />
analysis (‘sentiment dicti<strong>on</strong>aries’<br />
used <strong>in</strong> qualitative text analysis programs like<br />
Harvard University’s General Inquirer (St<strong>on</strong>e et<br />
al., 1966) and the WordNet (Fellbaum, 1998)),<br />
together with a host of <strong>in</strong>formati<strong>on</strong> extracti<strong>on</strong><br />
systems and resources, reported <strong>in</strong> the TREC<br />
proceed<strong>in</strong>gs and elsewhere, and syntactic and<br />
morphological analysis, provides a str<strong>on</strong>g basis<br />
for c<strong>on</strong>duct<strong>in</strong>g sentiment analysis <strong>in</strong> <strong>English</strong>.<br />
There is no need for segmentati<strong>on</strong> – white space<br />
separates lexical units neatly- and morphology of<br />
the <strong>English</strong> language does not have the complexity<br />
of Semitic and Indo-Persian languages. Furthermore,<br />
<strong>English</strong> has been standardised to a<br />
greater or lesser extent and there are ‘representative’<br />
general language corpora for the language<br />
that are easily accessible.<br />
The rise of the electr<strong>on</strong>ic media has led to the<br />
availability of f<strong>in</strong>ancial <strong>news</strong> <strong>in</strong> <strong>Arabic</strong> and <strong>in</strong>creas<strong>in</strong>gly<br />
so <strong>in</strong> Urdu. This is especially true of<br />
the reportage <strong>in</strong> nati<strong>on</strong>al and regi<strong>on</strong>al media.<br />
Given the strategic importance of energy exports<br />
from the Middle East –c.60% of world oil reserves<br />
and c<strong>on</strong>tribut<strong>in</strong>g to an estimated c.$280<br />
billi<strong>on</strong> worth of current-account surplus <strong>in</strong> 2006<br />
trade balance of oil-export<strong>in</strong>g middle-east<br />
emerg<strong>in</strong>g ec<strong>on</strong>omies - <strong>news</strong> about oil tends to<br />
dom<strong>in</strong>ate f<strong>in</strong>ancial <strong>news</strong> output <strong>in</strong> <strong>Arabic</strong> across<br />
the <strong>Arabic</strong> speak<strong>in</strong>g world (estimates between<br />
186 and 422 milli<strong>on</strong> speakers accord<strong>in</strong>g to<br />
Wikipedia). As the f<strong>in</strong>ancial markets <strong>in</strong> the regi<strong>on</strong><br />
c<strong>on</strong>t<strong>in</strong>ue to mature, there is c<strong>on</strong>comitant<br />
<strong>news</strong> about securities and currencies <strong>in</strong> <strong>Arabic</strong>.<br />
Currently, market sentiment report<strong>in</strong>g <strong>in</strong> the<br />
<strong>Arabic</strong> speak<strong>in</strong>g world is c<strong>on</strong>ducted and reported<br />
by f<strong>in</strong>ancial <strong>in</strong>stituti<strong>on</strong>s and there are references<br />
to c<strong>on</strong>sumer and <strong>in</strong>vestor sentiment <strong>in</strong> the reports<br />
of regulatory authorities like Reserve or State<br />
Banks; the report<strong>in</strong>g is <strong>in</strong> <strong>English</strong> but the data is<br />
collected <strong>in</strong> <strong>Arabic</strong> and <strong>English</strong>.<br />
The sentiment analysis will benefit from the<br />
named entity (NE) extracti<strong>on</strong> systems reported <strong>in</strong><br />
the literature (see, for example, Florian et al.,<br />
2004; Abuleil, 2004 and Zitouni et al., 2005 for<br />
<strong>Arabic</strong> NE extracti<strong>on</strong>). Despite the best efforts<br />
of the L<strong>in</strong>guistic Data C<strong>on</strong>sortium and the European<br />
Languages Research Associati<strong>on</strong>, and <strong>in</strong> the<br />
absence of Pan Arab <strong>in</strong>itiatives, there is an urgent<br />
need of lexical resources, especially dicti<strong>on</strong>aries/thesauri<br />
of sentiment words, and software<br />
systems for analys<strong>in</strong>g <strong>Arabic</strong> texts. Moreover,<br />
there is <strong>on</strong>e emerg<strong>in</strong>g standardised versi<strong>on</strong><br />
of <strong>Arabic</strong> (Modern Standard <strong>Arabic</strong>) used by the<br />
media and the compliance is not as universal as<br />
is, say, the case for <strong>English</strong>.<br />
Urdu is a language spoken by around 100 milli<strong>on</strong><br />
people, and if <strong>on</strong>e is look<strong>in</strong>g at spoken language,<br />
then the <strong>in</strong>ter-<strong>in</strong>telligibility of H<strong>in</strong>di and<br />
Urdu br<strong>in</strong>gs the total to around 300-milli<strong>on</strong> people.<br />
The language is written us<strong>in</strong>g a modified<br />
<strong>Arabic</strong> script (Perso-<strong>Arabic</strong> script) and borrows a<br />
sizeable vocabulary from <strong>Arabic</strong>. Pakistan (nati<strong>on</strong>al<br />
language: Urdu) is <strong>on</strong>e of the emergent<br />
f<strong>in</strong>ancial markets, where the stock markets go<br />
Yousif Almas and Khurshid Ahmad, A <str<strong>on</strong>g>note</str<strong>on</strong>g> <strong>on</strong> <strong>extract<strong>in</strong>g</strong> ‘sentiments’ <strong>in</strong> f<strong>in</strong>ancial <strong>news</strong> <strong>in</strong> <strong>English</strong>, <strong>Arabic</strong> & Urdu, The 2 nd Workshop <strong>on</strong> Computati<strong>on</strong>al Approaches<br />
to <strong>Arabic</strong> Script-based Languages, L<strong>in</strong>guistic Soc America 2007 L<strong>in</strong>guistic Institute, Stanford University, Stanford, California., July 21-22, 2007,<br />
L<strong>in</strong>guistic Society of America, 2007, pp 1 – 12<br />
3
through amaz<strong>in</strong>g peaks and troughs, and as it is a<br />
language of a large Diaspora, liv<strong>in</strong>g across the<br />
world, there is a fairly sophisticated audience for<br />
political and f<strong>in</strong>ancial <strong>news</strong> and analysis. Commodity<br />
trad<strong>in</strong>g, especially cott<strong>on</strong>, wheat and rice,<br />
together with foreign exchange remittances (c.3-<br />
4 Billi<strong>on</strong> dollars yearly) from the Diaspora,<br />
dom<strong>in</strong>ate the f<strong>in</strong>ancial <strong>news</strong> published, <strong>in</strong> pr<strong>in</strong>t<br />
and electr<strong>on</strong>ically, ma<strong>in</strong>ly by quality Urdu <strong>news</strong>papers<br />
and the BBC and Voice of America electr<strong>on</strong>ically.<br />
Urdu has received even less attenti<strong>on</strong> from the<br />
computati<strong>on</strong>al l<strong>in</strong>guistics, corpus l<strong>in</strong>guistics and<br />
<strong>in</strong>formati<strong>on</strong> extracti<strong>on</strong> communities than is the<br />
case with <strong>Arabic</strong>. The key excepti<strong>on</strong> is the<br />
EU/Xerox-sp<strong>on</strong>sored ParGram (Parallel Grammar)<br />
project (Butt et al., 2002). The key objective<br />
of ParGram was to build ‘a s<strong>in</strong>gle grammar<br />
development platform and a unified methodology<br />
of grammar writ<strong>in</strong>g to develop large-scale<br />
grammars for typologically different languages’.<br />
Initially, the researchers built grammars for three<br />
typologically similar European grammars and<br />
subsequently built grammars for Urdu and Japanese<br />
with a degree of success. The basic framework<br />
of Butt et al. is a lexical-functi<strong>on</strong>al grammar<br />
and they report <strong>on</strong> idiosyncratic phenomen<strong>on</strong><br />
<strong>in</strong> Urdu: ‘complex predicati<strong>on</strong> and rampant<br />
pro-drop’ (Butt and K<strong>in</strong>g, 2002:4) – ‘pro-drop <strong>in</strong><br />
Urdu correlates with discourse strategies: c<strong>on</strong>t<strong>in</strong>u<strong>in</strong>g<br />
topics and known background <strong>in</strong>formati<strong>on</strong><br />
tend to be dropped’ (ibid). The prodropp<strong>in</strong>g<br />
will cause problems <strong>in</strong> semantic analysis<br />
<strong>in</strong> general and <strong>in</strong> automatic sentiment analysis<br />
<strong>in</strong> particular.<br />
David Graddol (1997) published ‘The Future<br />
of <strong>English</strong>’ study which forecasts that by the<br />
middle of the twenty-first century the <strong>English</strong><br />
language will not hold its ‘m<strong>on</strong>opoly’ positi<strong>on</strong> <strong>in</strong><br />
the world. Instead, there will be an ‘oligopoly’ of<br />
several world major languages. Namely Ch<strong>in</strong>ese,<br />
H<strong>in</strong>di/Urdu, Spanish and <strong>Arabic</strong> each with a<br />
degree of <strong>in</strong>fluence <strong>in</strong> additi<strong>on</strong> to <strong>English</strong> which<br />
will rema<strong>in</strong> as a l<strong>in</strong>gua franca and the language<br />
of <strong>in</strong>ternati<strong>on</strong>al communicati<strong>on</strong> and science.<br />
2.2 Special Language of F<strong>in</strong>ance<br />
About 65 years ago, Stanley Gerr, a philosopher<br />
of science, argued that an evolv<strong>in</strong>g scientific language<br />
is characterised by <strong>in</strong>creas<strong>in</strong>gly larger and<br />
‘complex vocabulary’ and ‘by a progressive reducti<strong>on</strong><br />
<strong>in</strong> syntactic complexity’ (Gerr,<br />
1943:151). Zellig Harris <strong>in</strong>troduced to theoretical<br />
l<strong>in</strong>guists the new branch of ‘sublanguages’ because<br />
he was curious about the use of language<br />
<strong>in</strong> proof and argumentati<strong>on</strong> (Harris, 1991). Halliday<br />
and Mart<strong>in</strong> have <str<strong>on</strong>g>note</str<strong>on</strong>g>d that when a specialist<br />
creates a discourse <strong>in</strong>, say <strong>English</strong> or <strong>Arabic</strong>,<br />
then he or she ‘borrow[s] the resources that already<br />
existed <strong>in</strong> <strong>English</strong> [<strong>Arabic</strong>]’, especially<br />
nom<strong>in</strong>al elements, <strong>in</strong>clud<strong>in</strong>g technical tax<strong>on</strong>omies,<br />
and verbal elements that relate to the<br />
nom<strong>in</strong>alised processes (Halliday and Mart<strong>in</strong>,<br />
1993:63). A special language is a ‘l<strong>in</strong>guistic subsystem<br />
<strong>in</strong>tended for unambiguous communicati<strong>on</strong><br />
<strong>in</strong> a particular subject field us<strong>in</strong>g a term<strong>in</strong>ology<br />
and other l<strong>in</strong>guistic means.’ (ISO, 1990).<br />
M<strong>in</strong>dful of the negative c<strong>on</strong>notati<strong>on</strong> of excessive<br />
use of specialist terms, major <strong>news</strong> vendors<br />
have style books that emphasise clarity and simplicity:<br />
Reuters journalists are advised they ‘must<br />
use simple and straightforward <strong>English</strong> both for<br />
subscribers read<strong>in</strong>g ec<strong>on</strong>omic <strong>news</strong> <strong>on</strong> screen<br />
term<strong>in</strong>als under great pressure and for media<br />
subscribers of which a majority translate our service<br />
<strong>in</strong>to a language other than <strong>English</strong>.’ (Mac-<br />
Dowall, 1995). Furthermore, timel<strong>in</strong>ess is<br />
stressed because <strong>news</strong> is a ‘perishable’ resource<br />
(F<strong>in</strong>k, 2000).<br />
The shared c<strong>on</strong>cepts and objects <strong>in</strong> the f<strong>in</strong>ancial<br />
doma<strong>in</strong> facilitate shar<strong>in</strong>g of <strong>in</strong>formati<strong>on</strong><br />
across languages. ‘Media <strong>Arabic</strong>’ and ‘Media<br />
Urdu’ are <strong>in</strong>fluenced by external causes, namely<br />
by the global dom<strong>in</strong>ance of western media particularly<br />
<strong>in</strong> <strong>English</strong> where many <strong>news</strong> stories <strong>in</strong><br />
<strong>Arabic</strong> and Urdu are translated quickly from<br />
<strong>English</strong>. Other <strong>in</strong>fluences are lexical and phraseological.<br />
There are loan translati<strong>on</strong>s of phrases <strong>in</strong><br />
ec<strong>on</strong>omics from European languages to <strong>Arabic</strong><br />
العملة al-sa’ba, like hard currency (al-‘umla<br />
and (سيولة نقدية naqdiya, cash flow (seyula ,(الصعبة<br />
(تعويم الجنيه al-junayh, float<strong>in</strong>g the pound (ta’wim<br />
(Holes, 2004). In the case of Urdu there are<br />
many loan transliterati<strong>on</strong>s from <strong>English</strong> like<br />
.(اسٹاک) and stock (کمپنی) company<br />
2.3 Local Grammar of F<strong>in</strong>ancial Language<br />
Maurice Gross, follow<strong>in</strong>g the footsteps of Zellig<br />
Harris has used aspects of transformati<strong>on</strong>al<br />
grammar to analyse some idiosyncratic use of<br />
languages; e.g. <strong>in</strong> idiomatic language, and has<br />
argued that there exist selecti<strong>on</strong>al restricti<strong>on</strong>s<br />
which exist locally without reference to a higher<br />
level and more universal grammar rules and can<br />
therefore be regarded as local-grammars (Gross,<br />
1997). Corpus l<strong>in</strong>guists use a simple and scientific<br />
approach to language that stems from the<br />
lexic<strong>on</strong> <strong>in</strong> describ<strong>in</strong>g syntax <strong>in</strong>stead of abstracti<strong>on</strong>s<br />
and formal representati<strong>on</strong>s, local grammars<br />
can describe lexico-grammatical phenomena that<br />
Yousif Almas and Khurshid Ahmad, A <str<strong>on</strong>g>note</str<strong>on</strong>g> <strong>on</strong> <strong>extract<strong>in</strong>g</strong> ‘sentiments’ <strong>in</strong> f<strong>in</strong>ancial <strong>news</strong> <strong>in</strong> <strong>English</strong>, <strong>Arabic</strong> & Urdu, The 2 nd Workshop <strong>on</strong> Computati<strong>on</strong>al Approaches<br />
to <strong>Arabic</strong> Script-based Languages, L<strong>in</strong>guistic Soc America 2007 L<strong>in</strong>guistic Institute, Stanford University, Stanford, California., July 21-22, 2007,<br />
L<strong>in</strong>guistic Society of America, 2007, pp 1 – 12<br />
4
can not be abstracted <strong>in</strong> general syntactic rules<br />
(Laporte, 2004).<br />
It has been argued that local grammar patterns<br />
can describe syntactic structures as l<strong>in</strong>ear sequences<br />
compris<strong>in</strong>g lexical and abstract elements,<br />
<strong>in</strong> each local grammar an element can be<br />
fixed, variable or opti<strong>on</strong>al. A local grammar can<br />
capture the c<strong>on</strong>textual properties of lexical items<br />
exhaustively regardless of the wider material<br />
form<strong>in</strong>g the sentences and describe l<strong>in</strong>guistic<br />
c<strong>on</strong>stituents that look like idiomatic phrases<br />
more than general sentences, hence they can be<br />
useful for process<strong>in</strong>g c<strong>on</strong>stra<strong>in</strong>ed texts from specialised<br />
doma<strong>in</strong>s and tasks that depend <strong>on</strong> term<strong>in</strong>ology<br />
process<strong>in</strong>g (Balvet, 2002).<br />
The table below shows the ‘selecti<strong>on</strong>al restricti<strong>on</strong>s’<br />
placed by percent <strong>on</strong> ‘rose by’ <strong>in</strong> much the<br />
same way as the selecti<strong>on</strong>al restricti<strong>on</strong>s are<br />
placed by the appearance of the name of the day<br />
<strong>in</strong> the way we tell dates <strong>in</strong> day/m<strong>on</strong>th/year for<br />
example (Gross, 1993):<br />
News Reports<br />
General Language<br />
Average weekly earn<strong>in</strong>gs rose by 0.1 Dual Loyalties. A Rose By Another<br />
percent <strong>in</strong> January to $577.64 Other Name.<br />
(www.bls.gov/<strong>news</strong>.release/cpi.toc.htm) (www.counterpunch.org)<br />
D<strong>on</strong>ati<strong>on</strong>s to human-services organizati<strong>on</strong>s<br />
rose by 28 percent, to $25.4- Another Name.<br />
“Learn<strong>in</strong>g Disability”: A Rose by<br />
billi<strong>on</strong> (philanthropy.com/free/update) (naturalchild.org/jan_hunt/learn<strong>in</strong>g.html)<br />
Poverty levels rose by 43 Percent <strong>in</strong><br />
Last Decade<br />
(allafrica.com/stories/200405210072.html)<br />
The value of positi<strong>on</strong>ed fund […],<br />
rose by 29.6 percent<br />
(stats.gov.cn/english/<strong>news</strong>andcom<strong>in</strong>gevents)<br />
“A rose by any other name…” Macworld<br />
Expo has come and g<strong>on</strong>e.<br />
(www.atpm.com)<br />
A Rose By Any Other Color Photo of<br />
dozen red roses<br />
(www.shopp<strong>in</strong>gblog.com/cgi-b<strong>in</strong>/sblog)<br />
Table 1. The use of ‘rose by’ <strong>in</strong> <strong>news</strong> reports and<br />
general language.<br />
The use of ‘percent’ tends to c<strong>on</strong>stra<strong>in</strong> the use<br />
of ‘rose by’ <strong>in</strong> <strong>news</strong> stories <strong>in</strong> a way that is not<br />
seen <strong>in</strong> general language.<br />
2.4 Metaphors and Sentiment<br />
Metaphor essentially means to move from <strong>on</strong>e<br />
place to the other. The physical noti<strong>on</strong>s of spatial<br />
locati<strong>on</strong>, and that of velocity and accelerati<strong>on</strong>,<br />
related to tangible objects are used to describe<br />
entities as <strong>in</strong>tangible and as complex as ec<strong>on</strong>omies,<br />
commodities and f<strong>in</strong>ancial <strong>in</strong>struments (see<br />
Table 2).<br />
However, culture/envir<strong>on</strong>ment specific metaphors,<br />
bear and bull markets, used extensively <strong>in</strong><br />
the <strong>English</strong> special language of f<strong>in</strong>ance, are virtually<br />
not used <strong>in</strong> <strong>Arabic</strong> special language of f<strong>in</strong>ance;<br />
similar evidence exists for Urdu. The<br />
word (hamoor, (هامور which is a large fish found<br />
<strong>in</strong> the shores of the Gulf States is used metaphorically<br />
<strong>in</strong> <strong>Arabic</strong> (especially editorials) to refer<br />
to ‘greedy’ big <strong>in</strong>vestors who are perceived<br />
negatively as be<strong>in</strong>g able to move the markets up<br />
or down by buy<strong>in</strong>g or sell<strong>in</strong>g large quantities of<br />
stocks affect<strong>in</strong>g smaller <strong>in</strong>vestors.<br />
Source<br />
Physics/<br />
Movement<br />
Positive<br />
Negative<br />
<strong>English</strong> <strong>Arabic</strong> <strong>English</strong> <strong>Arabic</strong><br />
accelerate<br />
تسارع tasaro‘, slowdown تباطؤ tabatoa, up<br />
فوق fawq, down تحت tahat, rise<br />
ارتفاع ertifa‘, fall انخفاض enkifaad, ascent<br />
صعودso’od descent هبوط , hoboot healthy<br />
صحي sihay, sick مريض mareed, str<strong>on</strong>g<br />
قوي qawi, Weak ضعيف dh‘eef, grow<br />
نمو nomow, stagnate رآود rookood, vital<br />
حيوي hayawi, puny هزيل hazeel, Yousif Almas and Khurshid Ahmad, A <str<strong>on</strong>g>note</str<strong>on</strong>g> <strong>on</strong> <strong>extract<strong>in</strong>g</strong> ‘sentiments’ <strong>in</strong> f<strong>in</strong>ancial <strong>news</strong> <strong>in</strong> <strong>English</strong>, <strong>Arabic</strong> & Urdu, The 2 nd Workshop <strong>on</strong> Computati<strong>on</strong>al Approaches<br />
to <strong>Arabic</strong> Script-based Languages, L<strong>in</strong>guistic Soc America 2007 L<strong>in</strong>guistic Institute, Stanford University, Stanford, California., July 21-22, 2007,<br />
L<strong>in</strong>guistic Society of America, 2007, pp 1 – 12<br />
5<br />
Biological<br />
Other<br />
فقاعة foqa‘a, bubble آرة ثلج thalj, snowball korat<br />
انهيار enhiyar, bust طفرة tafra, boom<br />
Table 2. Metaphors used <strong>in</strong> <strong>English</strong> and <strong>Arabic</strong><br />
f<strong>in</strong>ancial report<strong>in</strong>g (<strong>Arabic</strong> is written right to left).<br />
3 Method<br />
Our method relies (a) <strong>on</strong> the Quirkian noti<strong>on</strong> that<br />
the acceptability of a l<strong>in</strong>guistic unit can be typically<br />
correlated with the frequency of the use of<br />
that unit (Quirk et al, 1985), and (b) <strong>on</strong> the observati<strong>on</strong>s<br />
of John Firth, ref<strong>in</strong>ed substantially by<br />
the late John S<strong>in</strong>clair, that the collocati<strong>on</strong> of a<br />
l<strong>in</strong>guistic unit will <strong>in</strong>form us about its properties.<br />
A special language vocabulary item, especially<br />
the name of a key c<strong>on</strong>cept or the name of an important<br />
process or object, will dom<strong>in</strong>ate the writ<strong>in</strong>gs<br />
with<strong>in</strong> a specialism. This frequent item is<br />
seldom used <strong>on</strong> its own and is found either as a<br />
part of a compound or with<strong>in</strong> syntactic c<strong>on</strong>structs<br />
not found elsewhere <strong>in</strong> the general language. It<br />
turns out that these items place selecti<strong>on</strong>al restricti<strong>on</strong>s<br />
with<strong>in</strong> a doma<strong>in</strong> special texts.<br />
We derive the local-grammar us<strong>in</strong>g a bottomup<br />
approach which is based <strong>on</strong> and expands/modifies<br />
the work of Ahmad et al. (2006)<br />
and Almas and Ahmad (2006) al<strong>on</strong>g with new<br />
experiments and evaluati<strong>on</strong> where the distributi<strong>on</strong>al<br />
differences of a word <strong>in</strong> special and general<br />
language corpora is taken as a measure of its<br />
significance for a given doma<strong>in</strong>. The differences<br />
are measured at the level of s<strong>in</strong>gle word frequencies<br />
first and s<strong>in</strong>gle keywords are selected. Subsequently<br />
collocati<strong>on</strong>s of the keywords are selected<br />
us<strong>in</strong>g frequency-based criteri<strong>on</strong>. The<br />
wider collocati<strong>on</strong>al patterns are then written as<br />
regular expressi<strong>on</strong>s and can be used to parse unseen<br />
texts us<strong>in</strong>g pattern match<strong>in</strong>g (see Figure 1).<br />
To illustrate how the algorithm work we show<br />
the results obta<strong>in</strong>ed from runn<strong>in</strong>g the algorithm<br />
<strong>on</strong> two comparable <strong>English</strong> and <strong>Arabic</strong> f<strong>in</strong>ancial<br />
corpora of 2.75 milli<strong>on</strong> tokens and a smaller 1.03<br />
milli<strong>on</strong> tokens Urdu f<strong>in</strong>ancial corpus,
Analysis<br />
1. ANALYSE a special language corpus (SL, compris<strong>in</strong>g Nspecial tokens) as<br />
<strong>in</strong>put.<br />
i. USE a frequency list of s<strong>in</strong>gle words from a corpus of texts used <strong>in</strong><br />
day-to-day communicati<strong>on</strong>s (S G compris<strong>in</strong>g N general tokens):<br />
( N g ) w1<br />
w2<br />
w3<br />
w<br />
F<br />
general<br />
= { fg<br />
, fg<br />
, fg<br />
... fg<br />
n }<br />
ii. CREATE a frequency ordered list of words <strong>in</strong> SL texts:<br />
N s ) w1<br />
w2<br />
w3<br />
F = { f , f , f<br />
... f<br />
( wn<br />
special s s s s<br />
iii. COMPUTE the differences <strong>in</strong> the distributi<strong>on</strong> of the same words (weirdness)<br />
<strong>in</strong> SG and SL:<br />
g<br />
⎛<br />
=<br />
⎜<br />
f<br />
W<br />
s<br />
s<br />
( wi<br />
)<br />
⎝<br />
wi<br />
⎞ ⎛<br />
∗ N<br />
w<br />
⎟ ⎜<br />
i<br />
f<br />
g ⎠ ⎝<br />
iv. CALCULATE z-score for the Fspeical and W:<br />
general<br />
wi<br />
g<br />
g<br />
w f<br />
s<br />
− f<br />
s<br />
g<br />
ws<br />
( wi<br />
) − w<br />
i<br />
s<br />
z(<br />
f<br />
s<br />
) =<br />
z(<br />
w<br />
f<br />
s<br />
( wi<br />
) =<br />
w<br />
σ<br />
s<br />
σ<br />
s<br />
Identificati<strong>on</strong><br />
2. IDENTIFY keywords and their c<strong>on</strong>texts<br />
i. CREATE KEY a set of N key keywords ordered accord<strong>in</strong>g to the magnitude<br />
of the above two z-scores:<br />
For i = 1to<br />
N<br />
j = j + 1<br />
key(<br />
j)<br />
= w<br />
N<br />
wi<br />
g<br />
if [ z(<br />
f ) > t ⊗ z(<br />
w ( w ) > t ] then<br />
s<br />
s<br />
1<br />
i<br />
}<br />
special<br />
end if<br />
Next i<br />
ii. EXTRACT collocates of each keyword <strong>in</strong> S L over a w<strong>in</strong>dow of M word<br />
neighbourhood:<br />
collocate(<br />
w ) = { f<br />
i<br />
−M<br />
wi<br />
... f<br />
s<br />
, f<br />
i<br />
, f<br />
2<br />
, f<br />
, f<br />
⎞<br />
⎟<br />
⎠<br />
, f<br />
... f<br />
−3 −2<br />
−1<br />
1 2 3 M<br />
wi<br />
wi<br />
wi<br />
wi<br />
wi<br />
wi<br />
wi<br />
iii. COMPUTE the strength of each collocate <strong>in</strong> each positi<strong>on</strong> us<strong>in</strong>g three<br />
measures due to Smadja (1994):U-score, k-score and p-strength<br />
iv. EXTRACT N-grams <strong>in</strong> the corpus that comprise highly collocat<strong>in</strong>g keywords<br />
(U,k,p) ≥ (10,1,1) with weirdness z-score > t 3<br />
v. FORM sub corpus<br />
key<br />
s<br />
j<br />
L '<br />
for each keyj<br />
Representati<strong>on</strong><br />
3. REPRESENT the discovered patterns of each key j as regular expressi<strong>on</strong>s<br />
i.<br />
key j<br />
COMPUTE the frequency of every word <strong>in</strong> s<br />
L '<br />
ii. REPLACE words with frequency z-score less than a threshold t4 <strong>in</strong><br />
key<br />
s<br />
j<br />
L '<br />
by a place marker , merge c<strong>on</strong>tiguous place markers<br />
iii. DISCOVER l<strong>on</strong>gest possible patterns where each pattern =<br />
-N -1<br />
1 N<br />
{ w …,<br />
w , key<br />
j<br />
, w ,…<br />
w } nucleat<strong>in</strong>g around keyj with frequency<br />
threshold > t5<br />
iv. PRUNE patterns by remov<strong>in</strong>g any preced<strong>in</strong>g or trail<strong>in</strong>g place markers.<br />
Figure 1. Pattern discovery algorithm.<br />
Table 3 shows the properties of the keyword<br />
percent <strong>in</strong> the three corpora, percent is the first<br />
candidate keyword selected from the <strong>English</strong><br />
corpus, the third from the <strong>Arabic</strong> corpus and the<br />
10 th <strong>in</strong> the case of Urdu when us<strong>in</strong>g the British<br />
Nati<strong>on</strong>al Corpus (BNC), the Modern Standard<br />
<strong>Arabic</strong> Corpus (MSAC) (Almas and Ahmad,<br />
2007) and Urdu <strong>news</strong> corpus as general language<br />
(reference) corpora.<br />
Candidate Accepted Rejecti<strong>on</strong><br />
Language f Z f Z w<br />
Collocates Collocates Ratio<br />
<strong>English</strong> 25,622 20.48 0.80 6,162 93 98.5%<br />
<strong>Arabic</strong> 17,809 27.06 2.87 6,566 40 99.4%<br />
Urdu 2,469 4.01 0.10 2,071 55 97.3%<br />
Table 3. Properties of the keyword percent <strong>in</strong> the<br />
<strong>English</strong>, <strong>Arabic</strong> and Urdu corpora (f = frequency, Z<br />
= Z-Score, W = Weirdness).<br />
}<br />
The ratio of accepted collocates over a w<strong>in</strong>dow<br />
of five words <strong>in</strong> Table 3 shows that <strong>Arabic</strong><br />
has the highest degree of rejecti<strong>on</strong> (99.4%). The<br />
total number of <strong>English</strong> collocates are 93 compared<br />
to <strong>on</strong>ly 40 <strong>in</strong> <strong>Arabic</strong>, this is due to the<br />
complex morphology <strong>in</strong> <strong>Arabic</strong> where lexical<br />
items have more surface forms compared to <strong>English</strong>.<br />
The smaller Urdu corpus shows the lowest<br />
rejecti<strong>on</strong> ratio (97.3%).<br />
Metaphorical movement words that are related<br />
to sentiments <strong>in</strong> f<strong>in</strong>ancial texts appear <strong>in</strong> the<br />
three languages <strong>in</strong> similar c<strong>on</strong>texts around the<br />
keyword percent. Instruments menti<strong>on</strong>ed are<br />
ma<strong>in</strong>ly shares and currencies. A comparis<strong>on</strong> between<br />
the collocates of percent <strong>in</strong> <strong>English</strong>, <strong>Arabic</strong><br />
and Urdu is presented <strong>in</strong> Table 4, it categorises<br />
the str<strong>on</strong>g collocates based <strong>on</strong> their mean<strong>in</strong>g<br />
and shows the difference <strong>in</strong> their distributi<strong>on</strong><br />
across the three corpora.<br />
Category Language<br />
Instruments<br />
Metaphoric<br />
movement<br />
Collocate<br />
Yousif Almas and Khurshid Ahmad, A <str<strong>on</strong>g>note</str<strong>on</strong>g> <strong>on</strong> <strong>extract<strong>in</strong>g</strong> ‘sentiments’ <strong>in</strong> f<strong>in</strong>ancial <strong>news</strong> <strong>in</strong> <strong>English</strong>, <strong>Arabic</strong> & Urdu, The 2 nd Workshop <strong>on</strong> Computati<strong>on</strong>al Approaches<br />
to <strong>Arabic</strong> Script-based Languages, L<strong>in</strong>guistic Soc America 2007 L<strong>in</strong>guistic Institute, Stanford University, Stanford, California., July 21-22, 2007,<br />
L<strong>in</strong>guistic Society of America, 2007, pp 1 – 12<br />
6<br />
words<br />
<strong>English</strong><br />
<strong>Arabic</strong><br />
Shares<br />
$<br />
pounds<br />
Yen<br />
(دولار (dollar, dollar<br />
(سهم (saham, share<br />
(جنيه (jeneah, pound<br />
(النفط (al-neft, the-oil<br />
Left<br />
f<br />
1,796<br />
98<br />
182<br />
15<br />
421<br />
126<br />
379<br />
140<br />
Right<br />
f<br />
188<br />
880<br />
667<br />
607<br />
256<br />
425<br />
157<br />
221<br />
U-<br />
Score<br />
79,674<br />
21,012<br />
24,239<br />
27,993<br />
7,141<br />
6,424<br />
5,573<br />
1,380<br />
K-<br />
Score<br />
8.61<br />
4.18<br />
3.61<br />
2.70<br />
2.50<br />
1.20<br />
1.17<br />
1.11<br />
41 24 66.25 1.02 (حصص (hassas), Urdu shares<br />
<strong>English</strong><br />
<strong>Arabic</strong><br />
Urdu<br />
up<br />
rose<br />
rise<br />
fell<br />
Down<br />
(مرتفعا (mortafi‘n, rise<br />
(ارتفع,‘ertif) rose<br />
(منخفضا (m<strong>on</strong>kafidan, fall<br />
(اضافہ (izaafaa, <strong>in</strong>crease<br />
(کمی (kami, decrease/fell<br />
(زائد (zaied, exceed<br />
(بڑه,barh) <strong>in</strong>crease<br />
(زيادہ (ziyada, more<br />
3,143<br />
2,146<br />
283<br />
1,252<br />
1,413<br />
20<br />
54<br />
14<br />
288<br />
106<br />
77<br />
65<br />
57<br />
201 516,487<br />
72 248,725<br />
941 65,063<br />
60 85,773<br />
133 116,197<br />
502<br />
284<br />
321<br />
15<br />
9<br />
11<br />
18<br />
6<br />
16,213<br />
2,460<br />
6,445<br />
3,464<br />
443<br />
174<br />
150<br />
153<br />
6.15<br />
4.05<br />
2.20<br />
2.37<br />
2.80<br />
1.13<br />
1.04<br />
1.03<br />
6.37<br />
2.23<br />
1.70<br />
1.60<br />
1.16<br />
Table 4. Str<strong>on</strong>g and similar collocates of percent<br />
<strong>in</strong> <strong>English</strong>, <strong>Arabic</strong> and Urdu (analysis under a<br />
w<strong>in</strong>dow of 5 words and exclud<strong>in</strong>g punctuati<strong>on</strong><br />
and numbers).<br />
Many of the <strong>in</strong>stances of metaphorical words<br />
occur around percent (almeaa, (المئة 2 <strong>in</strong> <strong>Arabic</strong><br />
with<strong>in</strong> distances more than the typical w<strong>in</strong>dow of<br />
five words used <strong>in</strong> <strong>English</strong> collocati<strong>on</strong> analysis<br />
and so they do not appear as str<strong>on</strong>g collocates, <strong>in</strong><br />
<strong>Arabic</strong> we may need a wider w<strong>in</strong>dow because of<br />
l<strong>on</strong>ger distance co-occurrences between words.<br />
<strong>Arabic</strong> is predom<strong>in</strong>ately a VSO language unlike<br />
2 The mean<strong>in</strong>g of percent <strong>in</strong> <strong>Arabic</strong> is created by<br />
preced<strong>in</strong>g the word ‘hundredth’ (al-meaa, (المئة with<br />
the word ‘<strong>in</strong>’ (phee, (في or the proclitic ‘with/by’ (be,<br />
.(ب
<strong>English</strong> which usually follows an SVO word order.<br />
For example, the proper noun phrases of<br />
company names (subjects) <strong>in</strong> lead sentences frequently<br />
<strong>in</strong>tervene between the occurrences of the<br />
token percent and metaphorical words:<br />
138 في <strong>Arabic</strong>:<br />
Gloss: percent by-ratio Gulf Cement profit rise<br />
Translati<strong>on</strong>: Gulf Cement profit up 138 percent<br />
أرباح ارتفاع<br />
اسمنت الخليج<br />
بنسبة<br />
المئة<br />
News items titles <strong>in</strong> <strong>Arabic</strong> frequently follow<br />
an SVO word order as <strong>in</strong> <strong>English</strong> <strong>in</strong> order to emphasise<br />
the subject. Urdu is usually written us<strong>in</strong>g<br />
a SOV word order and hence there exist more<br />
collocati<strong>on</strong>al patterns that are similar to <strong>English</strong><br />
but <strong>in</strong> the reverse order where the object occurs<br />
adjacent or close to the verb.<br />
After normalis<strong>in</strong>g the N-grams (step 3-ii <strong>in</strong><br />
Figure 1), the metaphorical words appear more<br />
proximate to percent, and the reduced N-grams<br />
are the pr<strong>in</strong>ciple <strong>in</strong>put to the representati<strong>on</strong> phase<br />
were patterns are discovered. Table 5 shows examples<br />
of the discovered patterns <strong>in</strong> <strong>Arabic</strong> and<br />
Urdu.<br />
Pattern f Example Sentence<br />
*<br />
percent<br />
<strong>in</strong>crease<br />
* <strong>in</strong>crease percent<br />
توقع زيادة إنفاق الطاقة بأمريكا في<br />
المئة عام<br />
زيادة<br />
ارتفاع<br />
في<br />
المئة في<br />
*<br />
percent<br />
* rise percent<br />
المئة<br />
rise<br />
#<br />
<strong>in</strong> percent rose<br />
rose # percent <strong>in</strong><br />
<br />
324 It is expected that the US c<strong>on</strong>sumpti<strong>on</strong><br />
of energy will <strong>in</strong>crease <strong>in</strong> the<br />
year <br />
وقالت أديداس انها تتوقع ارتفاع أرباحها<br />
السنوية بنسبة لا تقل عن في المئة<br />
عن أرباح العام الماضي التي بلغت <br />
مليون يورو<br />
ارتفعت<br />
في المئة في<br />
<br />
318 And Adidas said they expect that<br />
their yearly profit will rise but not<br />
less than percent compared to<br />
last year profits that were <br />
milli<strong>on</strong> euros<br />
60<br />
وآان البنك أعلن أن أرباحه ارتفعت بنسبة<br />
في المئة في الاشهر التسعة الاولى من<br />
العام الماضي<br />
فيصد<br />
اضافہ<br />
<strong>in</strong>crease percent<br />
<strong>in</strong>creased percent<br />
<br />
fell percent<br />
fell percent<br />
<br />
And the bank announced that their<br />
profits rose by percent <strong>in</strong> the<br />
first n<strong>in</strong>e m<strong>on</strong>ths of the last year<br />
ٹيکسٹائل مصنوعات کی برآمدات ميں<br />
فيصد اضافہ ہو گيا<br />
فيصد<br />
کمی<br />
…<br />
181<br />
[The] export of textile goods <strong>in</strong>creased<br />
percent<br />
68<br />
…لائٹ ڈيزل کی پيداوار ميں فيصد<br />
کمی ہوئی…<br />
producti<strong>on</strong> of light diesel fell by<br />
percent<br />
Table 5. Examples of discovered patterns with<br />
the keyword percent <strong>in</strong> <strong>Arabic</strong> and Urdu (* = multi<br />
word slot, # = s<strong>in</strong>gle word slot).<br />
Many of the extracted patterns <strong>in</strong> <strong>English</strong> had<br />
similar equivalent patterns <strong>in</strong> <strong>Arabic</strong> and usually<br />
are related to the rise and fall of shares, <strong>in</strong>dices<br />
and other <strong>in</strong>struments:<br />
<strong>English</strong> f <strong>Arabic</strong> f<br />
the * <strong>in</strong>dex was up <br />
percent<br />
shares <strong>in</strong> * rose <br />
percent<br />
was down percent<br />
at <br />
the * <strong>in</strong>dex was down<br />
percent<br />
shares <strong>in</strong> # were up<br />
percent at<br />
81<br />
93<br />
118<br />
*<br />
percent <strong>in</strong>dex <strong>in</strong>creased 355<br />
ارتفع<br />
زاد<br />
مؤشر<br />
في المئة<br />
سهم * الى <br />
to percent share rose<br />
منخفضا<br />
في المئة<br />
في المئة الى نقطة<br />
po<strong>in</strong>t to percent fall<strong>in</strong>g<br />
مؤشر * منخفضا #<br />
64<br />
58<br />
percent po<strong>in</strong>t fall<strong>in</strong>g <strong>in</strong>dex<br />
زاد<br />
نقطة<br />
في المئة<br />
في المئة<br />
سهم * الى <br />
40<br />
55<br />
to percent share <strong>in</strong>creased<br />
Table 6. Examples of similar patterns <strong>in</strong> <strong>English</strong><br />
and <strong>Arabic</strong>.<br />
4 Experiments<br />
We use LoLo ( ) (Almas and Ahmad, 2007), a<br />
system which c<strong>on</strong>ta<strong>in</strong>s comp<strong>on</strong>ents for analys<strong>in</strong>g<br />
specialist texts, particularly <strong>Arabic</strong> script-based<br />
languages and <strong>English</strong>. The Analyser was used<br />
for discover<strong>in</strong>g doma<strong>in</strong> specific patterns from the<br />
f<strong>in</strong>ancial corpora, the Editor for validat<strong>in</strong>g and<br />
tagg<strong>in</strong>g the patterns as either positive or negative,<br />
the Extractor for automatically <strong>extract<strong>in</strong>g</strong><br />
and labell<strong>in</strong>g sentiment bear<strong>in</strong>g sentences <strong>in</strong> unseen<br />
corpora.<br />
4.1 Data<br />
لؤلؤ<br />
Two 3.52 milli<strong>on</strong> token corpora of f<strong>in</strong>ancial<br />
<strong>news</strong>wire published by Reuters UK and Reuters<br />
<strong>Arabic</strong> services between 2005 and 2006 were<br />
compiled. Reuters covers the Middle East well<br />
and we have surveyed many <strong>Arabic</strong> <strong>news</strong>papers<br />
and web portals and found Reuters to be a major<br />
provider of local and <strong>in</strong>ternati<strong>on</strong>al f<strong>in</strong>ancial<br />
<strong>news</strong>. For Urdu, we have so far compiled a 1.03<br />
milli<strong>on</strong> token corpus compris<strong>in</strong>g f<strong>in</strong>ancial <strong>news</strong><br />
published between 2006 and 2007 by a major<br />
daily <strong>news</strong>paper <strong>in</strong> Pakistan. For <strong>extract<strong>in</strong>g</strong> terms<br />
<strong>in</strong> the three languages, three comparable corpora<br />
compris<strong>in</strong>g 2.9 milli<strong>on</strong> tokens of top n<strong>on</strong>f<strong>in</strong>ancial<br />
<strong>news</strong> were compiled.<br />
Table 7 shows the relati<strong>on</strong>ship between types<br />
and tokens; we understand this is a vexed issue.<br />
But for the purpose of compil<strong>in</strong>g corpora, it turns<br />
out that <strong>in</strong> similar size n<strong>on</strong> f<strong>in</strong>ancial <strong>news</strong> corpora<br />
<strong>Arabic</strong> has the highest type/token ratio<br />
(3.31%) whereas <strong>English</strong> has the lowest (1.38%)<br />
and Urdu closer to <strong>English</strong> with a ratio of<br />
(1.93%).<br />
64<br />
58<br />
Yousif Almas and Khurshid Ahmad, A <str<strong>on</strong>g>note</str<strong>on</strong>g> <strong>on</strong> <strong>extract<strong>in</strong>g</strong> ‘sentiments’ <strong>in</strong> f<strong>in</strong>ancial <strong>news</strong> <strong>in</strong> <strong>English</strong>, <strong>Arabic</strong> & Urdu, The 2 nd Workshop <strong>on</strong> Computati<strong>on</strong>al Approaches<br />
to <strong>Arabic</strong> Script-based Languages, L<strong>in</strong>guistic Soc America 2007 L<strong>in</strong>guistic Institute, Stanford University, Stanford, California., July 21-22, 2007,<br />
L<strong>in</strong>guistic Society of America, 2007, pp 1 – 12<br />
7
Corpus Language Texts Tokens Types Ratio<br />
F<strong>in</strong>ancial<br />
Tra<strong>in</strong><strong>in</strong>g<br />
Test<br />
News<br />
<strong>English</strong> 9,917 3.52 M 40,232 1.14<br />
<strong>Arabic</strong> 20,642 3.52 M 85,716 2.43<br />
Urdu 4,700 1.03 M 21,360 2.07<br />
<strong>English</strong> 7,710 2.75 M 35,903 1.30<br />
<strong>Arabic</strong> 16,374 2.75 M 75,277 2.73<br />
Urdu 4,700 1.03 M 21,360 2.07<br />
<strong>English</strong> 2,207 0.77 M 20,029 2.60<br />
<strong>Arabic</strong> 4,268 0.77 M 40,705 5.28<br />
Urdu - - - -<br />
<strong>English</strong> 6,495 2.95 M 40,963 1.38<br />
<strong>Arabic</strong> 9,750 2.94 M 97,388 3.31<br />
Urdu 11,004 2,95 M 5,7081 1.93<br />
Table 7. Corpora used <strong>in</strong> the experiments (ratio =<br />
types/tokens × 100).<br />
4.2 Tra<strong>in</strong><strong>in</strong>g and Test<strong>in</strong>g 3<br />
Before calculat<strong>in</strong>g the z-scores of the tokens, for<br />
pre-select<strong>in</strong>g keywords, we have excluded the<br />
words with frequency of <strong>on</strong>e (Hapax Legomena)<br />
from the frequency lists of the f<strong>in</strong>ancial corpora,<br />
the exclusi<strong>on</strong> of the Hapax reduced the vocabulary<br />
<strong>in</strong> the tra<strong>in</strong><strong>in</strong>g corpora frequency list by<br />
28.16% (10,112 words) for <strong>English</strong> and by<br />
38.35% (28,873 words) for <strong>Arabic</strong>, the difference<br />
<strong>in</strong> the Hapax Legomena could be another<br />
sign of the higher lexical diversity found <strong>in</strong> the<br />
<strong>Arabic</strong> corpus compared to <strong>English</strong>. The Urdu<br />
corpus had the highest number of s<strong>in</strong>gle occurr<strong>in</strong>g<br />
words compris<strong>in</strong>g 43.29% (9247 words) of<br />
all the words found <strong>in</strong> the corpus, but 7% of<br />
those s<strong>in</strong>gle occurr<strong>in</strong>g words were <strong>English</strong> words<br />
written <strong>in</strong> Lat<strong>in</strong> letters and ma<strong>in</strong>ly denot<strong>in</strong>g<br />
company names or f<strong>in</strong>ancial terms accompany<strong>in</strong>g<br />
Urdu translati<strong>on</strong>s.<br />
First we have compared the keywords selected<br />
<strong>in</strong> the tra<strong>in</strong><strong>in</strong>g corpora us<strong>in</strong>g the BNC and the<br />
MSAC with Reuters <strong>English</strong> and Reuters <strong>Arabic</strong><br />
top <strong>news</strong> corpora. We observe that us<strong>in</strong>g Reuters<br />
top <strong>news</strong> for weirdness analysis promotes more<br />
doma<strong>in</strong> specific and productive words, especially<br />
for <strong>Arabic</strong>.<br />
The MSAC is not as representative for <strong>Arabic</strong><br />
as the BNC is for <strong>English</strong>: the balance of genre<br />
and the other attributes <strong>in</strong> the MSAC is not as<br />
f<strong>in</strong>ely tuned perhaps as it is the case <strong>in</strong> the BNC.<br />
Words that are not doma<strong>in</strong> specific like ‘U.S.’,<br />
‘Reuters’, and ‘<strong>in</strong>ternet’ appear as candidate<br />
words when us<strong>in</strong>g the BNC but this is not the<br />
case when us<strong>in</strong>g Reuters UK top <strong>news</strong> which<br />
comprises newer words and words that are used<br />
frequently <strong>in</strong> <strong>news</strong> report<strong>in</strong>g and with the same<br />
spell<strong>in</strong>g. In <strong>Arabic</strong>, general words like ‘day’<br />
3 The compilati<strong>on</strong> of the Urdu f<strong>in</strong>ancial corpus is not<br />
yet complete and so Urdu text is not fully analysed<br />
and evaluated.<br />
(yawm, ,(يوم ‘Reuters’ (royterz, (رويترز and ‘to’<br />
(ela, (الى appear as top candidates when us<strong>in</strong>g the<br />
MSAC.<br />
The identificati<strong>on</strong> of keywords was based <strong>on</strong><br />
select<strong>in</strong>g a z-score for frequency and for weirdness<br />
and us<strong>in</strong>g them together to create a threshold<br />
for term hood. For weirdness z-score, we<br />
choose all words as candidate terms whose<br />
weirdness was equal or greater than the average<br />
for a given corpus of texts: Z w ≥ 0.<br />
For identify<strong>in</strong>g keywords <strong>on</strong> the basis of frequency,<br />
we choose a value of frequency which<br />
was <strong>on</strong>e standard deviati<strong>on</strong> above the average<br />
frequency <strong>in</strong>itially for all three languages: Z f ≥ 1.<br />
However, due to a high degree of clitisati<strong>on</strong> <strong>in</strong><br />
<strong>Arabic</strong>, sort<strong>in</strong>g the frequency distributi<strong>on</strong> descend<strong>in</strong>gly<br />
shows a slower decrease <strong>in</strong> the values<br />
compared to <strong>English</strong> and Urdu. Our system identified<br />
a number of candidate terms which were<br />
different forms and spell<strong>in</strong>gs of each other. Nevertheless,<br />
there generally is a form that dom<strong>in</strong>ates<br />
all others. This led us to choose a higher Z f<br />
for <strong>Arabic</strong>. By repeated experimentati<strong>on</strong> we<br />
found that the value Z f ≥ 3 for <strong>Arabic</strong> is a better<br />
filter for candidate terms. These thresholds selected<br />
60 keywords <strong>in</strong> <strong>English</strong>, 54 <strong>in</strong> <strong>Arabic</strong> and<br />
58 <strong>in</strong> Urdu:<br />
<strong>English</strong> <strong>Arabic</strong> Urdu<br />
No<br />
Keyword Z f Z w Keyword Z f Z w Keyword Z f Z w<br />
1 percent 17.3 0.2 dollar<br />
37.1 0.5 rupees 13.0 0.9<br />
(دولار (dollar,<br />
(روپے,rupi)<br />
2 billi<strong>on</strong> 7.1 0.4 percent<br />
(المئة (al-meaa,<br />
3 $ 7.1 0.1 the-oil<br />
(النفط (al-neft,<br />
4 pounds 6.6 0.7 milli<strong>on</strong><br />
(مليون (milyo<strong>on</strong>,<br />
5 market 5.6 0.3 prices<br />
) أسعار,as‘ar)<br />
6 shares 5.3 4.5 company<br />
(شرآة (sharika,<br />
21.3 0.4 bank<br />
(بينک,b<strong>in</strong>k)<br />
20.0 0.5 <strong>in</strong>crease<br />
(بڑه,barh)<br />
17.2 0.8 less<br />
(کم,kam)<br />
10.8 3.7 milli<strong>on</strong><br />
(ملين,mlen)<br />
10.7 0.8 <strong>in</strong>crease<br />
(اضافہ,izafa)<br />
Yousif Almas and Khurshid Ahmad, A <str<strong>on</strong>g>note</str<strong>on</strong>g> <strong>on</strong> <strong>extract<strong>in</strong>g</strong> ‘sentiments’ <strong>in</strong> f<strong>in</strong>ancial <strong>news</strong> <strong>in</strong> <strong>English</strong>, <strong>Arabic</strong> & Urdu, The 2 nd Workshop <strong>on</strong> Computati<strong>on</strong>al Approaches<br />
to <strong>Arabic</strong> Script-based Languages, L<strong>in</strong>guistic Soc America 2007 L<strong>in</strong>guistic Institute, Stanford University, Stanford, California., July 21-22, 2007,<br />
L<strong>in</strong>guistic Society of America, 2007, pp 1 – 12<br />
8<br />
4.6 1.7<br />
3.7 0.7<br />
3.7 0.1<br />
3.3 0.6<br />
3.2 0.2<br />
7 company 5.3 0.9 the-dollar<br />
capital<br />
10.4 7.8 3.1 0.1<br />
(سرمايہ,sarmaya) (الدولار (al-dolar,<br />
8 share (مليار (milyar, 3.6 1.0 billi<strong>on</strong><br />
9 bid 3.5 0.3 barrel<br />
(برميل (bermeal,<br />
(سعر (si‘r, 10 growth 3.3 0.7 price<br />
10.4 0.5 percent<br />
(فيصد,feesad)<br />
10 milli<strong>on</strong><br />
9.9 3.0<br />
(کروڑ,karor)<br />
9.9 8.1 100,000<br />
(لاکه,lakh)<br />
Table 8. Top 10 keywords <strong>in</strong> tra<strong>in</strong><strong>in</strong>g corpora.<br />
3.0 0.1<br />
2.9 0.5<br />
2.9 0.2<br />
In all our experiments, we have <strong>in</strong>cluded up to<br />
10 words <strong>on</strong> each side of the occurrences of the<br />
keywords <strong>in</strong> the sub-corpora and replaced words<br />
with Z f less than 1 with place markers. Patterns<br />
with frequency less than 20 were also filtered<br />
out. First, we extracted patterns from the tra<strong>in</strong><strong>in</strong>g<br />
corpora us<strong>in</strong>g the discover local-grammar algorithm.<br />
(و (w, We did remove the c<strong>on</strong>juncti<strong>on</strong> ‘and’<br />
attached to the first word of the patterns dur<strong>in</strong>g<br />
the prun<strong>in</strong>g step of the algorithm, this improves<br />
the recall given that <strong>Arabic</strong> favours coord<strong>in</strong>ati<strong>on</strong>
ecall given that <strong>Arabic</strong> favours coord<strong>in</strong>ati<strong>on</strong><br />
over subord<strong>in</strong>ati<strong>on</strong>, and does not change the<br />
mean<strong>in</strong>g of the pattern. Then dur<strong>in</strong>g the extracti<strong>on</strong><br />
phase, a regular expressi<strong>on</strong> representati<strong>on</strong> of<br />
the pattern that allows prefixes (soft match<strong>in</strong>g)<br />
can match occurrences with and without the<br />
‘and’ (w, (و or other proclitics us<strong>in</strong>g <strong>on</strong>ly regular<br />
expressi<strong>on</strong>s without segment<strong>in</strong>g the test corpus.<br />
Examples of the productive patterns found for<br />
the top 10 keywords extracted from the <strong>English</strong><br />
and <strong>Arabic</strong> corpora are shown <strong>in</strong> Table 9 and<br />
Table 10. The tables also show how often these<br />
discovered patterns occur <strong>in</strong> the future <strong>in</strong> the test<br />
corpora (us<strong>in</strong>g soft match<strong>in</strong>g <strong>in</strong> the case of <strong>Arabic</strong>).<br />
Keyword<br />
Example Pattern f tra<strong>in</strong><strong>in</strong>g f test<br />
percent shares were up percent at 247 48<br />
billi<strong>on</strong> revenue rose percent to billi<strong>on</strong> pounds 16 10<br />
$ up $ at 41 11<br />
pounds profit of milli<strong>on</strong> pounds 149 47<br />
market <strong>in</strong> # with market expectati<strong>on</strong>s 32 10<br />
shares shares <strong>in</strong> * fell percent 71 8<br />
company a # offer for the company 20 2<br />
share dividend of pence per share 127 32<br />
Bid a hostile bid for 20 9<br />
growth percent growth <strong>in</strong> 40 14<br />
Table 9. Examples of productive patterns found<br />
<strong>in</strong> <strong>English</strong> tra<strong>in</strong><strong>in</strong>g corpus.<br />
Keyword Example Pattern f tra<strong>in</strong><strong>in</strong>g f test<br />
dollar<br />
(دولار (dollar,<br />
percent<br />
(المئة (al-meaa,<br />
the-oil<br />
(النفط (al-neft,<br />
milli<strong>on</strong><br />
(مليون (milyo<strong>on</strong>,<br />
prices<br />
) أسعار,as‘ar)<br />
company<br />
(شرآة (sharika,<br />
the-dollar<br />
(الدولار (al-dolar,<br />
Billi<strong>on</strong><br />
) مليار (milyar,<br />
barrel<br />
(برميل (bermeal,<br />
price<br />
(سعر (si‘r,<br />
الى <br />
per-barrel dollar to a-cent rise<br />
ris<strong>in</strong>g cents to dollars per barrel<br />
مرتفعا<br />
زاد<br />
سنتا<br />
مؤشر<br />
دولار<br />
للبرميل<br />
*<br />
percent <strong>in</strong>dex <strong>in</strong>creased<br />
* <strong>in</strong>dex <strong>in</strong>creased by percent<br />
نمو<br />
الطلب<br />
العالمي<br />
في<br />
على<br />
المئة<br />
النفط<br />
oil <strong>on</strong> <strong>in</strong>ternati<strong>on</strong>al demand growth<br />
<strong>in</strong>ternati<strong>on</strong>al demand for oil has grown<br />
# #<br />
milli<strong>on</strong> to <strong>in</strong>crease<br />
an # <strong>in</strong>crease to # milli<strong>on</strong><br />
زيادة<br />
ارتفاع<br />
الى<br />
النفط أسعار<br />
مليون<br />
oil prices rise<br />
oil prices rose<br />
مئة<br />
شرآة<br />
بريطانية آبرى<br />
مرتفعا نقطة<br />
po<strong>in</strong>t rise big British company hundred<br />
FTSE100 ris<strong>in</strong>g po<strong>in</strong>ts<br />
ارتفع<br />
الدولار<br />
the-dollar rose<br />
the dollar rose<br />
*<br />
dollar billi<strong>on</strong> the-deficit<br />
the deficit is * billi<strong>on</strong> dollars<br />
العجز<br />
زيادة<br />
مليار<br />
دولار<br />
*<br />
daily barrel thousand <strong>in</strong>crease<br />
<strong>in</strong>crease * thousand barrels per day<br />
ارتفع<br />
ألف<br />
سعر<br />
برميل<br />
البنزين<br />
يوميا<br />
benzene price rose<br />
benzene prices rose<br />
40 9<br />
355 109<br />
75 12<br />
26 9<br />
652 91<br />
43 2<br />
273 58<br />
161 36<br />
171 6<br />
103 22<br />
Table 10. Examples of productive patterns found<br />
<strong>in</strong> <strong>Arabic</strong> tra<strong>in</strong><strong>in</strong>g corpus.<br />
The recall is improved when us<strong>in</strong>g soft match<strong>in</strong>g<br />
for extracti<strong>on</strong> <strong>in</strong> the <strong>Arabic</strong> test corpus, Table<br />
11 shows some <strong>in</strong>stances where soft match<strong>in</strong>g<br />
was utilised.<br />
Pattern f exact f soft Example Sentence<br />
*<br />
barrel milli<strong>on</strong> <strong>in</strong>crease<br />
* <strong>in</strong>creased by milli<strong>on</strong><br />
barrel<br />
Yousif Almas and Khurshid Ahmad, A <str<strong>on</strong>g>note</str<strong>on</strong>g> <strong>on</strong> <strong>extract<strong>in</strong>g</strong> ‘sentiments’ <strong>in</strong> f<strong>in</strong>ancial <strong>news</strong> <strong>in</strong> <strong>English</strong>, <strong>Arabic</strong> & Urdu, The 2 nd Workshop <strong>on</strong> Computati<strong>on</strong>al Approaches<br />
to <strong>Arabic</strong> Script-based Languages, L<strong>in</strong>guistic Soc America 2007 L<strong>in</strong>guistic Institute, Stanford University, Stanford, California., July 21-22, 2007,<br />
L<strong>in</strong>guistic Society of America, 2007, pp 1 – 12<br />
9<br />
زيادة<br />
مليون<br />
برميل<br />
* بنسبة <br />
percent by-ratio rise<br />
* rose by percent<br />
70 111<br />
وزادت مخزونات الخام المحلية مليون<br />
برميل الاسبوع الماضي متخطية التنبؤات<br />
بزيادة قدرها مليون برميل<br />
ارتفاع<br />
المئة في<br />
37 43<br />
And the local crude reserves <strong>in</strong>creased<br />
by milli<strong>on</strong> barrel last<br />
week exceed<strong>in</strong>g forecasts by <br />
milli<strong>on</strong> barrels<br />
وبلغ سعر برنت في عقود التسليم لشهر<br />
دولار للبرميل بارتفاع <br />
دولار اي بنسبة في المئة <br />
ارتفع<br />
الدولار<br />
dollar rose<br />
the dollar rose<br />
36 58<br />
<br />
مايو <br />
And the price of Brent <strong>in</strong> May<br />
c<strong>on</strong>tracts reached dollars a<br />
barrel with a dollar <strong>in</strong>crease<br />
that is percent<br />
وارتفع الدولار بالمئة الى ين<br />
سعر ارتفع<br />
البنزين<br />
rose price benzene<br />
benzene prices rose<br />
3 16<br />
And the dollar rose percent<br />
to yens<br />
وارتفع سعر البنزين سنت الى<br />
دولار للجالون<br />
And the price of benzene rose<br />
cents to dollar per<br />
gall<strong>on</strong><br />
Table 11. Soft extracti<strong>on</strong> <strong>in</strong> <strong>Arabic</strong>.<br />
5 Evaluati<strong>on</strong> and C<strong>on</strong>clusi<strong>on</strong><br />
One important discovery for us was the availability<br />
of texts <strong>in</strong> languages other than <strong>English</strong>, and<br />
perhaps, more importantly, texts <strong>in</strong> specialist<br />
subjects <strong>in</strong> other languages. There is a dist<strong>in</strong>ct<br />
lack of public-doma<strong>in</strong> resources, like lexica and<br />
grammars, and, <strong>in</strong>deed, a dist<strong>in</strong>ct lack of theoretical<br />
observati<strong>on</strong>s about the most popular, and<br />
politically sensitive, languages of our times.<br />
There are observati<strong>on</strong>s about morphology and<br />
grammar scattered <strong>in</strong> the literature – there are<br />
excepti<strong>on</strong>s (Badawi et al., 2004; Schmidt, 1999),<br />
but they are characterised more by the s<strong>in</strong>gularity<br />
of their presence. For corpus based work, the<br />
existence of texts is critical and we are satisfied<br />
by the size of our specialist corpora <strong>in</strong> the two<br />
languages; more work is needed <strong>on</strong> the general<br />
language corpora.<br />
The <strong>in</strong>gress of <strong>English</strong> is noticeable <strong>in</strong> Urdu<br />
and to a lesser extent <strong>in</strong> <strong>Arabic</strong>. However, a<br />
closer look at the apparent use of transliterated<br />
,(فنڈ) and fund (اسٹاک) specialist terms, like stock<br />
<strong>in</strong> Urdu shows that the predom<strong>in</strong>ant use of these<br />
‘terms’ was <strong>in</strong> proper nouns – like <strong>in</strong> Karachi<br />
Stock Exchange, and <strong>in</strong> Livestock Authority;<br />
similarly the term fund was used as <strong>in</strong> the names<br />
of the various mutual funds. Shares were typically<br />
referred to as (hassas, (حصص and all the<br />
commodities were described <strong>in</strong> Urdu words. The<br />
corpora show that there are much less transliterated<br />
terms <strong>in</strong> <strong>Arabic</strong> compared to Urdu.
Our work appears to facilitate sentiment analysis<br />
<strong>in</strong> that we have captured the essence of f<strong>in</strong>ancial<br />
<strong>news</strong>: it is typically <strong>news</strong> about change<br />
and here, for example, the token percent plays a<br />
key role <strong>in</strong> <strong>English</strong>, <strong>Arabic</strong> and Urdu (Our earlier<br />
papers have a similar c<strong>on</strong>clusi<strong>on</strong> about Ch<strong>in</strong>ese).<br />
Our method can generate value judgement<br />
about whether a sentence c<strong>on</strong>ta<strong>in</strong>s a positive sentiment<br />
or a negative sentiment or both. In a similar<br />
ve<strong>in</strong> our method does not aggregate the sentiment<br />
over many sentences that comprise a <strong>news</strong><br />
report to generate a sentiment <strong>in</strong>dex of the report.<br />
But our method shows that the pr<strong>in</strong>cipal key<br />
word collocates <strong>in</strong> a statistically significant manner<br />
with metaphorical and literal words for the<br />
directi<strong>on</strong> of change. And, that <strong>on</strong>e can generate<br />
an aggregated (directi<strong>on</strong> of change) <strong>in</strong>dex for a<br />
<strong>news</strong> story and <strong>in</strong>deed for a collecti<strong>on</strong> of texts.<br />
Such an approach is comm<strong>on</strong> and used most recently<br />
by Tetlock (2007) but by us<strong>in</strong>g a prec<strong>on</strong>structed<br />
lexic<strong>on</strong>.<br />
The fact that the tokens used to describe the<br />
change follow a local grammar is particularly<br />
gratify<strong>in</strong>g <strong>in</strong> that, given the subjectivity related to<br />
the whole noti<strong>on</strong> of change, <strong>on</strong>e would expect<br />
many a patterns here. Even more gratify<strong>in</strong>g for<br />
us is the fact that our method tends to extract<br />
most of the keywords based purely <strong>on</strong> statistical<br />
measures <strong>in</strong> the three languages and that the local<br />
grammar patterns are similar despite the key differences<br />
<strong>in</strong> the languages – <strong>English</strong> is predom<strong>in</strong>antly<br />
SVO, Urdu SOV and <strong>Arabic</strong> VSO.<br />
We randomly selected 30 Reuters <strong>Arabic</strong><br />
(7,373 tokens) and 20 Reuters <strong>English</strong> (UK)<br />
(7,561 tokens) <strong>news</strong> stories from the test corpora<br />
for an evaluati<strong>on</strong>. Two knowledgeable humans<br />
were asked to label each sentence as positive,<br />
negative, mixed 4 (positive/negative), or neutral<br />
(unknown). The <strong>English</strong> evaluator was asked to<br />
label the sentences based <strong>on</strong> what is positive or<br />
negative to the UK ec<strong>on</strong>omy while the <strong>Arabic</strong><br />
evaluator was asked to annotate with respect to<br />
Middle Eastern ec<strong>on</strong>omies. The breakdown of<br />
the labelled sentences <strong>in</strong> the two corpora is<br />
shown <strong>in</strong> Table 12.<br />
The patterns discovered us<strong>in</strong>g the pattern discovery<br />
algorithm were manually validated and<br />
tagged as positive or negative by check<strong>in</strong>g their<br />
predom<strong>in</strong>ate orientati<strong>on</strong> <strong>in</strong> the tra<strong>in</strong><strong>in</strong>g corpus.<br />
4 For example the sentence ‘Drax promises special<br />
dividend but shares slip’ c<strong>on</strong>ta<strong>in</strong>s both positive and<br />
negative <strong>in</strong>formati<strong>on</strong>.<br />
Language Positive Negative Mixed Neutral Total<br />
<strong>English</strong> 88 142 8 90 328<br />
<strong>Arabic</strong> 135 80 6 135 356<br />
Table 12. Breakdown of sentences <strong>in</strong> the evaluati<strong>on</strong><br />
corpora as labelled by humans.<br />
The results of the prelim<strong>in</strong>ary evaluati<strong>on</strong> are<br />
shown <strong>in</strong> Table 13 for each category of the sentences<br />
and the overall system, the figures show<br />
high precisi<strong>on</strong> and low recall with the <strong>Arabic</strong> F-<br />
measure be<strong>in</strong>g slightly better than <strong>English</strong> due to<br />
a higher recall, which suggest that the <strong>Arabic</strong><br />
evaluati<strong>on</strong> corpus has more idiosyncrasies<br />
masked beh<strong>in</strong>d a complex morphology, further<br />
<strong>in</strong>vestigati<strong>on</strong> is required with larger corpora. A<br />
high precisi<strong>on</strong> is essential <strong>in</strong> a doma<strong>in</strong> like f<strong>in</strong>ance.<br />
There are two probable reas<strong>on</strong>s for the<br />
low recall: First, the values of the thresholds<br />
used lead to the loss of valuable patterns and<br />
there are complex structures which the algorithm<br />
can’t detect based <strong>on</strong> frequency bases; sec<strong>on</strong>d,<br />
the small evaluati<strong>on</strong> corpus which had a lot of<br />
m<strong>in</strong>ority articles with sentences that do not c<strong>on</strong>ta<strong>in</strong><br />
frequent patterns.<br />
Positives Negatives Mixed Overall<br />
Yousif Almas and Khurshid Ahmad, A <str<strong>on</strong>g>note</str<strong>on</strong>g> <strong>on</strong> <strong>extract<strong>in</strong>g</strong> ‘sentiments’ <strong>in</strong> f<strong>in</strong>ancial <strong>news</strong> <strong>in</strong> <strong>English</strong>, <strong>Arabic</strong> & Urdu, The 2 nd Workshop <strong>on</strong> Computati<strong>on</strong>al Approaches<br />
to <strong>Arabic</strong> Script-based Languages, L<strong>in</strong>guistic Soc America 2007 L<strong>in</strong>guistic Institute, Stanford University, Stanford, California., July 21-22, 2007,<br />
L<strong>in</strong>guistic Society of America, 2007, pp 1 – 12<br />
10<br />
Lang.<br />
<strong>English</strong> 94.6%<br />
(11/13)<br />
Prec. Recall Prec. Recall Prec. Recall Prec. Recall<br />
12.5%<br />
(11/88)<br />
93.7%<br />
(15/16)<br />
<strong>Arabic</strong> 88.2% 22.2% 100%<br />
(30/34) (30/135) (7/7)<br />
10.6% 100% 12.5%<br />
(15/142) (1/1) (1/8)<br />
8.6%<br />
(7/80)<br />
0%<br />
(0/1)<br />
0%<br />
(0/6)<br />
90%<br />
(27/30)<br />
88.1%<br />
(37/42)<br />
F<br />
(β=1)<br />
11.3%<br />
(27/238) 20.1%<br />
17.2%<br />
(38/221) 28.8%<br />
Table 13. Evaluati<strong>on</strong> results for <strong>extract<strong>in</strong>g</strong> sentiment<br />
bear<strong>in</strong>g phrases.<br />
We are <strong>in</strong>vestigat<strong>in</strong>g the effect of preprocess<strong>in</strong>g<br />
the <strong>Arabic</strong> corpora and particularly<br />
collaps<strong>in</strong>g clitics, and how our method can utilise<br />
other properties <strong>in</strong> the distributi<strong>on</strong> of words <strong>in</strong><br />
the f<strong>in</strong>ancial corpora for <strong>extract<strong>in</strong>g</strong> features for<br />
the automatic classificati<strong>on</strong> (cluster<strong>in</strong>g) of words<br />
and patterns as positive and negative, particularly:<br />
(a) the word order of each language (e.g.<br />
<strong>Arabic</strong> lead sentences start with a verb that is<br />
predom<strong>in</strong>antly a report<strong>in</strong>g or a movement verb<br />
and weirdness analysis can filter the movement<br />
verbs) and the relati<strong>on</strong>ship between words <strong>in</strong> the<br />
titles and lead sentences (b) the prelim<strong>in</strong>ary observati<strong>on</strong><br />
that positive <strong>news</strong> is more abundant<br />
than negative <strong>news</strong> (asymmetry) across the three<br />
languages.<br />
So what of the future for our research: how<br />
safely we can c<strong>on</strong>clude whether or not a sentence,<br />
c<strong>on</strong>ta<strong>in</strong><strong>in</strong>g a subjectively perceived positive/negative,<br />
as whole can be assigned positive/negative<br />
polarity. For example, it is perhaps
true to say that decl<strong>in</strong>e <strong>in</strong> the stock markets <strong>in</strong> the<br />
USA will be treated as negative <strong>news</strong> <strong>in</strong> the<br />
Middle East and <strong>in</strong> Pakistan. But does the rise <strong>in</strong><br />
the price of oil, that is predom<strong>in</strong>antly produced<br />
<strong>in</strong> the Middle East, or that of rice or cott<strong>on</strong><br />
prices, the ma<strong>in</strong>stay of Pakistani ec<strong>on</strong>omy, is<br />
perceived similarly <strong>in</strong> the USA? We have c<strong>on</strong>ducted<br />
some experiments to <strong>in</strong>vestigate this cross<br />
polarity.<br />
References<br />
Abuleil, S. (2004) ‘Extract<strong>in</strong>g Names from <strong>Arabic</strong><br />
Text for Questi<strong>on</strong>-Answer<strong>in</strong>g Systems’. In Proc. of<br />
RIAO 2004, Avign<strong>on</strong>, France. (Available at:<br />
http://www.riao.org/sites/RIAO-2004/Proceed<strong>in</strong>gs-<br />
2004/papers/0640.pdf)<br />
Ahmad, K., Cheng, D. and Almas, Y. (2006) ‘Multil<strong>in</strong>gual<br />
Sentiment Analysis of F<strong>in</strong>ancial News<br />
Streams’. In Proc. of the 1st Internati<strong>on</strong>al C<strong>on</strong>ference<br />
<strong>on</strong> Grid <strong>in</strong> F<strong>in</strong>ance, Palermo. (Available at:<br />
http://pos.sissa.it/archive/c<strong>on</strong>ferences/026/001/GRI<br />
D2006_001.pdf)<br />
Almas, Y. and Ahmad, K. (2006) ‘LoLo: A System<br />
based <strong>on</strong> Term<strong>in</strong>ology for Multil<strong>in</strong>gual Extracti<strong>on</strong>’.<br />
In Proc. of COLING/ACL’06 Workshop <strong>on</strong> Informati<strong>on</strong><br />
Extracti<strong>on</strong> Bey<strong>on</strong>d a Document, Sydney:<br />
Associati<strong>on</strong> for Computati<strong>on</strong>al L<strong>in</strong>guistics, pp. 56-<br />
65.<br />
Almas, Y. and Ahmad, K. (2007) ‘LoLo: A Tool for<br />
Extract<strong>in</strong>g Statistical Informati<strong>on</strong> and Lexical Resources<br />
from <strong>Arabic</strong> Corpora’. In Proc. of workshop<br />
<strong>on</strong> <strong>Arabic</strong> Natural Language Process<strong>in</strong>g (2nd<br />
Informati<strong>on</strong> & Communicati<strong>on</strong> Technology Internati<strong>on</strong>al<br />
Symposium). IEEE Morocco: Fez.<br />
Antweiler, W., and Frank, M. Z. (2004) ‘Is all that<br />
talk just noise: The <strong>in</strong>formati<strong>on</strong> c<strong>on</strong>tent of Internet<br />
Stock Message Boards’. Journal of F<strong>in</strong>ance,<br />
59(3), pp. 1259-1295.<br />
Badawi, E., Carter, M.G. and Gully, A. (2004) Modern<br />
Written <strong>Arabic</strong>: A Comprehensive Grammar,<br />
L<strong>on</strong>d<strong>on</strong>: Routledge.<br />
Balvet, A. (2002) 'Design<strong>in</strong>g Text Filter<strong>in</strong>g Rules:<br />
Interacti<strong>on</strong> Between General and Specific Lexical<br />
Resources'. In Proc. of LREC Workshop <strong>on</strong> Us<strong>in</strong>g<br />
Semantics for Informati<strong>on</strong> Retrieval, Las Palmas,<br />
Spa<strong>in</strong>. (Available at:<br />
http://<strong>in</strong>folang.uparis10.fr/modyco/textes/balvet/Lrec_su<br />
bmissi<strong>on</strong>_paper_revised.PDF)<br />
Bazerman, C. (1988) Shap<strong>in</strong>g Written Knowledge: the<br />
Genre and Activity of the Experimental Article <strong>in</strong><br />
Science. Madis<strong>on</strong>: Wisc<strong>on</strong>s<strong>in</strong> Univ. Press.<br />
Butt, M., Dyvik, H., K<strong>in</strong>g, T., Masuichi, H., and<br />
Rohrer, C. (2002) ‘The Parallel Grammar Project’.<br />
In Proc. of COLING-2002 Workshop <strong>on</strong><br />
Grammar Eng<strong>in</strong>eer<strong>in</strong>g and Evaluati<strong>on</strong>, Taipei: Internati<strong>on</strong>al<br />
Committee <strong>on</strong> Computati<strong>on</strong>al L<strong>in</strong>guistics,<br />
pp 1-7.<br />
Butt, M. and K<strong>in</strong>g, T. (2002) ‘Urdu and the Parallel<br />
Grammar Project’. In Proc. of the COLING-2002<br />
Workshop <strong>on</strong> Asian Language Resources and Internati<strong>on</strong>al<br />
Standardizati<strong>on</strong>. Taipei: Internati<strong>on</strong>al<br />
Committee <strong>on</strong> Computati<strong>on</strong>al L<strong>in</strong>guistics, pp 9-13.<br />
Chan, K. and Lam, W. (2005) ‘Extract<strong>in</strong>g Causati<strong>on</strong><br />
Knowledge from Natural Language Texts’. Internati<strong>on</strong>al<br />
Journal of Intelligent Systems, 20(3), pp.<br />
327-358.<br />
Das, S. and Chen, M. (2006) Yahoo! for Amaz<strong>on</strong>:<br />
Sentiment Extracti<strong>on</strong> From Small Talk <strong>on</strong> the Web.<br />
Work<strong>in</strong>g Paper, Leavey School of Bus<strong>in</strong>ess, Santa<br />
Clara University, California. (Available at:<br />
http://scumis.scu.edu/~srdas/chat.pdf).<br />
Engle, R. (2004) ‘Risk and volatility: Ec<strong>on</strong>ometric<br />
models and f<strong>in</strong>ancial practice’ (revisi<strong>on</strong> of a 2003<br />
Nobel Prize Lecture), American Ec<strong>on</strong>omic Review,<br />
94(3), pp 405-420.<br />
F<strong>in</strong>k, C. (2000) Bottom L<strong>in</strong>e Writ<strong>in</strong>g: Report<strong>in</strong>g the<br />
Dense of Dollars. Iowa: Iowa State University<br />
Press.<br />
Florian, R., Hassan, H., Ittycheriah, A., J<strong>in</strong>g, H.,<br />
Kambhatla, N., Luo, X., Nicolov, N. and Roukos,<br />
S. (2004) ‘A Statistical Model for Multil<strong>in</strong>gual Entity<br />
Detecti<strong>on</strong> and Track<strong>in</strong>g’. In Proc. of HLT-<br />
NAACL 2004, Bost<strong>on</strong>, USA, pp. 1-8.<br />
Fellbaum, C. (1998) WordNet: An Electr<strong>on</strong>ic Lexical<br />
Database. Cambridge, MA: The MIT Press.<br />
Gerr, S. (1943) ‘Language and Science: the Rati<strong>on</strong>al,<br />
Functi<strong>on</strong>al Language of Science and Technology’.<br />
Philosophy of Science. Vol 9 (No.2), pp 146-161.<br />
Graddol, D. (1997) The Future of <strong>English</strong>?, L<strong>on</strong>d<strong>on</strong>:<br />
The British Council.<br />
Gross, M. (1993) ‘Local Grammars and their Representati<strong>on</strong><br />
by F<strong>in</strong>ite Automata’. In (Eds.) Hoey, M.,<br />
Data, Descripti<strong>on</strong>, Discourse: Papers <strong>on</strong> the <strong>English</strong><br />
Language <strong>in</strong> H<strong>on</strong>our of John McH S<strong>in</strong>clair.<br />
L<strong>on</strong>d<strong>on</strong>: HarperColl<strong>in</strong>s, pp. 26-38.<br />
Gross, M. (1997) ‘The C<strong>on</strong>structi<strong>on</strong> of Local Grammars’.<br />
In (Eds.) Roche, E. and Schabès, Y., F<strong>in</strong>ite-<br />
State Language Process<strong>in</strong>g, Language, Speech,<br />
and Communicati<strong>on</strong>. Cambridge, MA.: MIT Press,<br />
pp. 329-354.<br />
Halliday, M.A.K., & Mart<strong>in</strong>, J. (1993). Writ<strong>in</strong>g Science:<br />
Literary and Discursive Power. L<strong>on</strong>d<strong>on</strong> and<br />
Wash<strong>in</strong>gt<strong>on</strong>, DC: The Falmer Press.<br />
Harris, Z. (1991) A Theory of Language and Informati<strong>on</strong>:<br />
A Mathematical Approach. Oxford: Clarend<strong>on</strong><br />
Press.<br />
Yousif Almas and Khurshid Ahmad, A <str<strong>on</strong>g>note</str<strong>on</strong>g> <strong>on</strong> <strong>extract<strong>in</strong>g</strong> ‘sentiments’ <strong>in</strong> f<strong>in</strong>ancial <strong>news</strong> <strong>in</strong> <strong>English</strong>, <strong>Arabic</strong> & Urdu, The 2 nd Workshop <strong>on</strong> Computati<strong>on</strong>al Approaches<br />
to <strong>Arabic</strong> Script-based Languages, L<strong>in</strong>guistic Soc America 2007 L<strong>in</strong>guistic Institute, Stanford University, Stanford, California., July 21-22, 2007,<br />
L<strong>in</strong>guistic Society of America, 2007, pp 1 – 12<br />
11
Hardie, I. and Mackenzie, D. (2007) ‘Assembl<strong>in</strong>g an<br />
Ec<strong>on</strong>omic Actor: The Agencement of a Hedge<br />
Fund’. Sociological Review. 55 (1), pp 57-80.<br />
Holes, C. (2004) Modern <strong>Arabic</strong>: Structure, Functi<strong>on</strong>s,<br />
and Varieties. Wash<strong>in</strong>gt<strong>on</strong>: Georgetown<br />
University Press.<br />
ISO (1990) Internati<strong>on</strong>al Organizati<strong>on</strong> for Standardizati<strong>on</strong>:<br />
Term<strong>in</strong>ology - Vocabulary (ISO 1087). Geneva:<br />
ISO.<br />
Kahneman, D. (2003) ‘Maps of Bounded Rati<strong>on</strong>ality:<br />
A Perspective <strong>on</strong> Intuitive Judgment and Choice’<br />
(revisi<strong>on</strong> of a 2002 Nobel Prize Lecture), American<br />
Ec<strong>on</strong>omic Review, 93(5), pp. 1449-1475.<br />
Lan, C., Ho, K., Luk, R. and Yeung, D. (2005)<br />
‘FNDS: a Dialogue-based System for Access<strong>in</strong>g<br />
Digested F<strong>in</strong>ancial News’. Journal of Systems and<br />
Software, 78(2), pp. 180-193.<br />
Laporte, E. (2004) 'Foreward'. In (Eds.) Leclère, C.,<br />
Laporte, E., Piot, M. and Silberzte<strong>in</strong>, M. Syntax,<br />
Lexis & Lexic<strong>on</strong>-Grammar: Papers <strong>in</strong> h<strong>on</strong>our of<br />
Maurice Gross. Philadelphia: John Benjam<strong>in</strong>s, pp.<br />
xi-xxi.<br />
MacDowall, I. (1995) Reuters Style Guide, 2nd Editi<strong>on</strong>.<br />
L<strong>on</strong>d<strong>on</strong>: Reuters Ltd.<br />
Poibeau, T. and Kosseim, L. (2001) ‘Proper Name<br />
Extracti<strong>on</strong> from N<strong>on</strong>-Journalistic Texts’. In (Eds.)<br />
Daelemans, W., Sima'an, K., Veenstra, J. and<br />
Zavrel, J., Computati<strong>on</strong>al L<strong>in</strong>guistics <strong>in</strong> the Netherlands<br />
2000: Selected Papers from the Eleventh<br />
CLIN Meet<strong>in</strong>g. Amsterdam: Rodopi,. pp. 144-157.<br />
Quirk, R., Greenbaum, S., Leech, G. and Svartvik, J.<br />
(1985) A Comprehensive Grammar of the <strong>English</strong><br />
Language. L<strong>on</strong>d<strong>on</strong>: L<strong>on</strong>gman.<br />
Schmidt, R. L. (1999) Urdu: An Essential Grammar,<br />
L<strong>on</strong>d<strong>on</strong>: Routledge.<br />
Shiller, R. J. (2000) Irrati<strong>on</strong>al Exuberance, Pr<strong>in</strong>cet<strong>on</strong>,<br />
NJ: Pr<strong>in</strong>cet<strong>on</strong> University Press.<br />
Sim<strong>on</strong>, H. A. (1979) ‘Rati<strong>on</strong>al decisi<strong>on</strong> mak<strong>in</strong>g <strong>in</strong><br />
bus<strong>in</strong>ess organizati<strong>on</strong>s’ (revisi<strong>on</strong> of a 1978 Nobel<br />
Prize Lecture), American Ec<strong>on</strong>omic Review, 69(4),<br />
pp.493-513.<br />
Smadja, F. (1994) ‘Retriev<strong>in</strong>g Collocati<strong>on</strong>s fromText:<br />
Xtract’. In (Eds.) Armstr<strong>on</strong>g, S., Us<strong>in</strong>g Large Corpora.<br />
L<strong>on</strong>d<strong>on</strong>: MIT Press.<br />
St<strong>on</strong>e, P. J., Dunphy, D. C., Smith, M. S., Oglivie, D.<br />
M. and associates (1996) The General Inquirer: A<br />
Computer Approach to C<strong>on</strong>tent Analysis. Cambridge,<br />
MA: The MIT Press, 1966.<br />
Surdeanu, M., Harabagiu, S., Williams, J. and<br />
Aarseth, P. (2003) ‘Us<strong>in</strong>g Predicate-Argument<br />
Structures for Informati<strong>on</strong> Extracti<strong>on</strong>’. In Proc. of<br />
the 41st Annual C<strong>on</strong>ference of the Associati<strong>on</strong> for<br />
Computati<strong>on</strong>al L<strong>in</strong>guistics (ACL’03), Saporro: Associati<strong>on</strong><br />
for Computati<strong>on</strong>al L<strong>in</strong>guistics, pp. 8–15.<br />
Wils<strong>on</strong>, T., Wiebe, J. and Hoffmann, P. (2005) ‘Recogniz<strong>in</strong>g<br />
C<strong>on</strong>textual Polarity <strong>in</strong> Phrase-Level Sentiment<br />
Analysis’. In Proc of HLT-EMNLP-2005.<br />
Vancouver: Associati<strong>on</strong> for Computati<strong>on</strong>al L<strong>in</strong>guistics,<br />
pp. 347-354.<br />
Tetlock, P. C. (2007) Giv<strong>in</strong>g C<strong>on</strong>tent to Investor Sentiment:<br />
The Role of Media <strong>in</strong> the Stock Market.<br />
Journal of F<strong>in</strong>ance, 62(3), pp. 1139-1168.<br />
Tetlock, P. C., Saar-Tsechansky, M., and Mackskassy,<br />
S. (2005) ‘More Than Words: Quantify<strong>in</strong>g<br />
Language to Measure Firms’ Fundamentals.<br />
(http://www.mccombs.utexas.edu/faculty/Paul.Tetl<br />
ock/papers/TSM_More_Than_Words_09_06.pdf)<br />
Turney, P.D. (2002) ‘Thumbs Up or Thumbs Down?<br />
Semantic Orientati<strong>on</strong> Applied to Unsupervised<br />
Classificati<strong>on</strong> of Reviews’. In Proc of the 40th Annual<br />
Meet<strong>in</strong>g of the Associati<strong>on</strong> for Computati<strong>on</strong>al<br />
L<strong>in</strong>guistics (ACL'02), Philadelphia: Associati<strong>on</strong> for<br />
Computati<strong>on</strong>al L<strong>in</strong>guistics, pp. 417-424.<br />
Zitouni, I., Sorensen, J., Luo, X. and Florian, R.<br />
(2005) ‘The Impact of Morphological Stemm<strong>in</strong>g<br />
<strong>on</strong> <strong>Arabic</strong> Menti<strong>on</strong> Detecti<strong>on</strong> and Coreference<br />
Resoluti<strong>on</strong>’. In Proc of the ACL Workshop <strong>on</strong><br />
Computati<strong>on</strong>al Approaches to Semitic Languages.<br />
Ann Arbor, Michigan: Associati<strong>on</strong> for Computati<strong>on</strong>al<br />
L<strong>in</strong>guistics, pp. 63-70.<br />
Yousif Almas and Khurshid Ahmad, A <str<strong>on</strong>g>note</str<strong>on</strong>g> <strong>on</strong> <strong>extract<strong>in</strong>g</strong> ‘sentiments’ <strong>in</strong> f<strong>in</strong>ancial <strong>news</strong> <strong>in</strong> <strong>English</strong>, <strong>Arabic</strong> & Urdu, The 2 nd Workshop <strong>on</strong> Computati<strong>on</strong>al Approaches<br />
to <strong>Arabic</strong> Script-based Languages, L<strong>in</strong>guistic Soc America 2007 L<strong>in</strong>guistic Institute, Stanford University, Stanford, California., July 21-22, 2007,<br />
L<strong>in</strong>guistic Society of America, 2007, pp 1 – 12<br />
12