The Hong Kong Polytechnic University ENGL316: Computer-mediated Communication
Assignment III: A corpus-based study on the characteristics of texts generated in Computer-mediated Communication
Student: Lee Sin Ming (09369162D) Date: 6th May, 2011
With
the
advanced
technology
and
easy
access
to
the
Internet,
computer-mediated communication (CMC), which diminishes the constraints of time and geographic barriers, is a common form of interaction among people in today’s era of information. Particularly, the blog is gaining worldwide attention as a new and important online genre in recent years (Vincent et al. 2007). The resultant linguistic features are interesting and worth a corpus-based study to investigate how people communicate with the special use of their language resources in this specific genre.
In this study, a target corpus, with a total of 30,103 words, is built by collecting texts from 5 personal blogs on a popular theme, beauty. The blogs originate in various places (Figure 1) and this leads to a more representative resultant analysis.
Blogs
Description
Beaut.ie
Established by a pair of Irish sisters Kirstie and Aisling, freelance beauty writers in Ireland, in 2006
Jack and Hill: A Beauty Blog Founded by 2 beauty obsessive women in 2005: Hillary, an author in the LA, & Jankie, a PR pro in London and the US Beauty Addict
Founded by Kristen (a Beauty Addict grown up in New York and lived in Manhattan) in 2005 as a way to share her product obsessions with the world
Skin Care and Beauty
Publishes Galina’s reviews on beauty and skin care products, and articles from recent fashion magazines (English, German, Russian issues) since 2007
My Women Stuff
Operates by Paris B from 2007, writing about all things that make women beautiful, based in Kuala Lumpur, Malaysia Figure 1: Details of the collected beauty blogs
The beauty-corpus is compared with both the spoken and written samplers of the
British National Corpus (BNC), a 2 million-word representative of Standard British English, to protrude its sole linguistic features. The analysis is conducted by Wmatrix, a leading corpus linguistic software which offers word frequency profiles and concordances. With Wmatrix’s functions of part-of-speech tagging (POS) by CLAWS and semantic tagging (Semtag) by USAS, the linguistic characteristics with unexpected high frequency in the target corpus, indicated by the relatively higher log-likelihood (LL) values, in comparison with other more general texts can be identified. The results are discussed as follows.
The aboutness of the corpus is revealed in its word frequency (Appendix I) and collocation lists (Appendix II). The first 2 content words on the word frequency profile are ‘skin’ (18th) and ‘hair’ (24th) while 80% of the top 60 collocations are also related to the topic, such as the names of the products (e.g. ‘cleansing oils’ , ‘BB cream’) and the brands (e.g. ‘Esmeria Organics’, ‘Clarins White’). The remaining 20% are phrases like ‘I think’, ‘you can’ and ‘have tried’, indicating that bloggers like to share comments and recommend the others on the use of beauty products with reference to their own experience. Hence, the two most common POS are NN1 (singular common noun, 14.5%) and JJ (general adjective, 8.0%) as bloggers often discuss ‘skin’, ‘hair’, ‘product’, ‘cream’ and ‘colour’ (Appendix III) with adjectives like ‘good’, ‘new’, ‘dry’, ‘oily’ and ‘nice’ as their evaluation (Appendix IV).
The semantic frequency profile also manifests the field of the corpus. 46.4% of the corpus, including the names of the bloggers and beauty products, is categorized as Names and Grammatical Words. 6.49%, the second largest proportion, is under the category of Substances, Materials, Objects and Equipment since bloggers focus on discussing the ‘products’ as Objects Generally (O2), Colour Patterns (O4.3) and
Judgement of Appearance (O4.2). 5.80% belongs to the Numbers and Measurement when bloggers comment on the Linear Order (N4), Quantities (N5) and Frequency (N6) of their experience of using beauty products and services. The topic of the corpus is salient with 4.71% of it about the Body and the Individual, such as Physiology (B1), Medical Treatment (B3), and Cleaning and Personal Care (B4). People express their Thoughts (X2.1), Knowledge (X2.2), Wanting (X7) and Trying (X8) freely in their blogs. Thus, Psychological Actions, States and Processes is another major component (3.99%) of the corpus (Appendix V).
According to the statistics, the beauty-corpus contains 17.12% of the semantically positive items but only 5.21% of the negative ones (Figure 2). Bloggers tend to share more about what they love.
Positive
Negative
Best
(A5.1+++)
Dry
Love
(E2+)
Disappointed (E4.2-)
Pretty
(O4.2+)
Few
(N5-)
Able
(X9.1+)
Darker
(W2--)
Problem
(A12-)
Permanent (T2+++)
(O1.2-)
Figure 2: Examples of semantically positive and negative items with their Semtag
However, people may find the tagging strange especially when the contextual meaning of the whole sentence rather than just the term is taken into consideration. For instances, ‘ageing’ of which people are trying hard to get rid is defined as T3++ (more examples in Figure 3). On the other hand, most people prefer ‘new’ and ‘inexpensive’ beauty products and services but they are regarded as T3- and I1.3respectively. Wmatrix tags the item with a magnifying effect or an increase in sense positive, and vice versa. So, ‘new’ and ‘inexpensive’ are defined semantically
negative as they are ‘short in history’ and ‘not highly priced’.
a_lot (N5+)
I do nt like this as much as I do the other Kanebo cleansing oils Ive used. There is a scent and the texture feels thick. It also does not emulsify as well and takes a lot of work to clean off.
too (N5++)
Additionally, the brush is a little too soft. With liquid liners, a good brush is CRITICAL If you’re new to the medium, this is n't a great one to start with; the soft brush makes it too easy to make the line too thick or lose control of it.
again (N6+)
Cleared up skin, used it for another week and skin died again. Tossed it and chalked it up to an expensive mistake I recently tried a sample sachet I obtained randomly
Figure 3: examples of positively-tagged items conveying negative messages More features can be identified from the comparison between the target and the BNC corpora. Regarding the use of multi-word expressions, phrases related to the beauty field account for 52% and 44% of the 50 expressions with the highest LL -values when compared to the spoken and written BNC respectively (Appendix VI). As compared to the written BNC, the beauty-corpus contains more verbal phrases, like ‘take a look’, ‘as you can see’, ‘a little bit’ and ‘I say’, which are not often used in writing. When compared with the spoken one, two phrases, ‘you know’ and ‘I mean’, which appear in daily conversation frequently are obviously underused. This implies that the genre of online blog is less spoken-like than in conversation but more verbal than in writing.
Turning to the POS tagging, as mentioned before, NN1 and JJ are the two most frequent POS in the beauty-corpus because bloggers often discuss the beauty products’ names, ingredients, functions and usage, and comment with their own opinions. Yet, the lexical density of spoken texts is lower (with less content words) to
enable easy verbal communication. As a result, when the beauty-corpus is compared with the spoken BNC, JJ, NN1, FO (e.g. ‘spa 36’, ‘pa+++’), NN2 (e.g. ‘products’, ‘hands’, ‘lips’ ) and NNU (e.g. ‘20% off’, ‘30ml’) are at the top of the overused POS list (Appendix VII). APPGE also obtains relatively high LL values because bloggers often mention about their own experience with possessive pronouns. Though, several POS are underused. The most salient one is interjection. Phrase like ‘yes’, ‘eh’ and ‘ah’ occur less frequently in the target corpus as online discussion is not real face-to-face interaction in which people usually interrupt each others or show their emotions with various interjections. With NN1 and NN2 as the dominant POS, personal pronouns, including PPHS1, PPHS2, PPIS2 and PPY, are used less frequently when compared to general speech. (Appendix VII)
The results of comparison of the beauty-corpus with the written BNC vary greatly from the above findings. In general written texts, personal pronouns are usually not mentioned to maintain a formal register. However, bloggers focus on the sharing of personal views and experience and hence PPIS1, PPH1, PPY and PPIO1 are used more often than in the written BNC (Appendix VIII). Meanwhile, the beauty-corpus contains more adverbs (RR and RG), like ‘very’, ‘too’ and ‘more’, to emphasize bloggers’ views. Germanic genitive marker (’ or ’s) is often omitted for a quicken pace of communication while less proper nouns and preceding noun of title are found as blog discussions are less formal than in general written genre. (Appendix VIII)
As for the Semtag aspect, items related to the beauty field, such as those under the semantic categories of B, O, W and N, again occupy over half of the top 30 overused semantic categories in both the spoken and written BNC, revealing the
aboutness of this specific corpus. On the other hand, an underuse of items about other topics, including Politics, Government, Telecommunications, and Architecture, can be identified when compared with the BNC. (Appendix IX and X)
Bloggers tend to share what they found good with their net friends. Therefore, E1 (Emotional Actions), E2+ (Like) and A5.1+ (Evaluation: Good) are used more commonly than in general texts. To enhance the persuasiveness and make the messages more interesting to the readers, bloggers adopt more degree boosters to emphasize their viewpoints. The difference is more salient in the written BNC (with more A13 appear in the overused Semtag list) as writers tend to avoid involving their personal views to make their writing more objective and formal. As a less formal genre, the beauty-corpus includes not only more degree boosters, but also more pronouns (Z8) to illustrate bloggers’ own ideas and experience when compared with the written BNC. (Appendix IX and X)
Another major difference is on the Unmatched items of which the target corpus contains more than the BNC, especially when it is compared with the spoken one. Z99 ranks 1st in the overused Semtag list of the spoken BNC but only 11th in the written one. People usually use simple wordings in conversation for easier communication. So, Unmatched items are relatively rarely found in the spoken BNC, leading to a high LL values in the comparison. (Appendix IX and X)
In fact, all of the Unmatched items are either the names and brands of the beauty products, or some contractions with the Germanic genitive marker missing, such as ‘im’ and ‘theyre’ (Appendix XI). Bloggers like to skip typing the punctuation for more convenient and faster communication. Such kind of minor ‘mistake’ is
acceptable in CMC which is normally regarded an informal genre. This also explains the underuse of Grammatical and Discourse bins when the beauty corpus is compared with the written and spoken BNC respectively (Appendix IX and X).
In conclusion, the blog provides an online platform for Internet users with common interests to freely exchange information and opinions. When compared to standard spoken and written texts, the target corpus contains more items related to the field of beauty. Other identified linguistic features reveal that the blog discussion is a genre which is more spoken-like that the standard written one but less verbal than the spoken one. Total: 1535 words (excluding the supplementary Figures 1-3)
Reference list Danicki, J. & Johnson, H. (2005). Jack & Hill: A Beauty Blog. Retrieved April 20, 2011 from http://www.jackandhill.net/ Galina (2007). Skin Care and Beauty. Retrieved April 20, 2011 from http://www.skincarebeautyproduct.blogspot.com/ Kelly, K. (2005). Beauty Addict: A Little Obsessed With Makeup. Retrieved April 20, 2011 from http://beautyaddict.blogspot.com/ Mcdermott, K. & Mcdermott, A. (2006). Beaut.ie. Retrieved April 20, 2011 from http://beaut.ie/blog/2011/body-shop-honey-bronze-overview-swatches/ Paris, B. (2007). My Women Stuff. Retrieved April 20, 2011 from http://www.mywomenstuff.com/ Vincent B.Y. Ooi, Peter K.W. Tan & Andy K.L. Chiang. (2007). Analyzing personal weblogs in Singapore English: the Wmatrix approach . Studies in Variation, Contacts and Change in English 2: Towards Multimedia in Corpus Studies .