Infromation Organization & Retrieval | Essay Collection.

Page 1

Pasquale J. Festa INF384C – Organizing and Providing Access to Information Prof. Efron 2.1.2007 Metadata, Dublin Core and OAI This week’s readings work to explain the ways in which metadata, the Dublin Core system for metadata creation and organization and the OAI Interoperability framework create a structure for organizing and accessing information from institutional repositories. Metadata is most simply defined as data about data. It is a system of built data much akin to the manner in which a letter relates to a word relates to a sentence. Each piece of data can be envisioned as a brick with the metadata being a virtual wall. Metadata is the information that we hold that allows us to find the “who, what, when, where and why?” in regards to a specific information source. The Dublin Core is trumped as a simplified standard for creating metadata organization and retrieval records for network resources. It works on the basis of having a number of identifiers (title, author, source, etc.) that link to the particular piece of information and allow for its retrieval through a number of access points. What makes DC so simple to use is the fact that it does not require all fields to be filled and looks at a complete set of fields for a document as being expansive, rather than the norm. Its simplified form makes it accessible to users of varying technological and academic degree. Using common language and having translated systems to define terms lends a hand to its universality in both terms of an individual’s grammar background as well as their native tongue. It is expected that, due to the system’s simplicity in terms of accessibility and use, that it can be ever expansive as metadata sets are contributed to the collection from numerous sources. While all of this may of great possible benefit for the academic world and may allow for a new extensibility in Interoperability frame-working, for the system to meet its optimal and peak performance, it would be required that as much data for all field’s be filled in. For a search to be as effective in both locating desired materials and weeding out undesirable sources, it is a matter of mathematical chance: the more identifiers that are capable of being deployed, the greater the chance of honing in on the information. One thing that was of interest was the fact that certain identifiers were less commonly used by IR’s than others. While this may point to the fact that some IR’s find this information to be futile for metadata needs, it does keep the system from operating with optimal function ability. One of the prized aspects concerning DC and the OAI has to do with how the system is designed to help both nonspecialized and specialized searchers come across the information they desire. One point of contention I have to make is that if such things as source are left open, it removes an aspect of the system that is integral to its desirability. For example, if a casual user has read something in a particular periodical and would like to find what other essays or contributions may have also been published in that series, a blank source identifier would cause one’s search to turn up less than accurate results (this is assuming that the essay name would be filed under title and the periodical under source). The DC’s method of organizing and accessing metadata has both great power and ease of use embedded in its structure. The only issue with the system seems to be the human component factor. As these blank identifiers are the work of people at IR’s, it seems that just how much we are willing to put into the system defines how much we will be able to get out of it. In addition, the note that a number of XML’s were malformed when analyzed illustrates that materials submitted to the system must be “quality controlled,” as they would be of little use otherwise. In the end, the DC and OAI illustrate a step in effectively creating a system for metadata organization and retrieval that has both usability and great power in terms of interoperability. The system caters to a broad range of users and tries its best to appease their particular abilities when it comes to information seeking skills. In effect, it seems as if this project is a proper thrust in the direction of providing access to and maintaining organizational control over a vast and growing number of resources.


Pasquale J. Festa INF284C – Organizing and Providing Access to Information Prof. Efron 2.22.2007 Classification and the NSDL One thing that interested me greatly in the set of readings we had for this week had to do with the concept of the Colon Classification scheme. Until now, I did not know this form of classification existed and was only familiar with Dewey Decimal and the Library of Congress systems of classification. What interested me most about this organizational scheme was the way in which it allowed each element to be searched independently of each other or in conjunction for more refined search results. I chose to look up some more information on this classification system on Wikipedia and was able to have it explained to me with a little more depth. When I looked at Colon Classification, I could not help but be reminded of what we talked about earlier in the course in regards to the Open Archives Initiative and the Dublin Core. It seems to me (and I hope I am correct in assuming this) that these systems all work in a similar manner, allowing classification elements to be searched wholly independent of each other or in conjunction for a more fine tuned retrieval of information and data. If the National Science Digital Library has its metadata repository filled through the Open Archives Initiative and its metadata harvesting is based on the Dublin Core, it seems to me that the manner in which one would go about searching for documents through the National Science Digital Library would be akin to searching separate colon elements in a Colon Classification system. I searched through the NSDL to get a feel for how the system worked and found that this seems to be the case. The only field I was at all required to fill in was my subject heading. While, with OAI and Dublin Core not all elements are necessary for cataloging information, it does make sense that one element would be necessary to create a search query. I tried to search without a subject and just clicked “Graduate” as my grade level to see if I would be given everything that was considered a Graduate level information source regardless of its field. This did not seem to be the case and I was required to put in a subject in the search box. This is just something that I found of interest in exploring the NSDL as I was under the assumption that I would not be required to fill in this field. Another interesting aspect of the NSDL to me was the “NSDL at a Glance” function under “browse”. Apparently, the system of classification is laid out to us in a visual format and allows a user, of pretty much any skill level, to wander through the NSDL collection based upon subject classifications. As XML created a tree of a document the “NSDL at a Glance” page offers us a visual web of the collection’s organizational structure. We can browse through the entire collection based upon a main subject (Mathematics, for instance) and then a sub‐subject (Applied Mathematics) onward to a very specific topic (Black Holes, based upon my personal interest). What we have here is a very intricate collection of metadata harvested from a number of sources and accessible through a centralized access point. It seems as if the job of the NSDL is to take as much information as it can from an expansive landscape of resource providers and allow users from diverse skill levels and research backgrounds numerous ways to access the information they are searching for through a variety of interfaces that each help to translate to us the classification structure of the system itself.


Pasquale J. Festa INF384C ‐ Organizing and Providing Access to Information Prof. Efron 3.1.2007 Fallacies in Classification As I went further in depth with Colon Classification in my previous essay, I would like to discuss Farradane's writing on the fallacies of classification in this writing assignment. One thing I wonder about while reading writing like this is, does arguing the semantics of language usage help us to come to a better understanding of how to classify information or does it just make the task more difficult? When reading Farradane I can't help but think about the linguistic philosophy of Ludwig Wittgenstein which asserted that language is fundamentally ambiguous and that words do not say what things actually are but only create for them an approximation of their concepts. According to Wittgenstein, language is nonsensical. As he wrote in The Tractus, "...he who understands me finally recognizes them [sentences] as senseless”. I can see in the Farradane reading a tie to this philosophical view when he writes, "Language is at best a poor tool for exact expression...”. Farradane seems to take deep umbrage with the idea of "Philosophical Classification" because he believes that it this method assumes a "totality of knowledge" which is expressed through main classes that are then subdivided. However, a main class, in his view, is fallacious as it is only a contemporary focus of interest. He makes the claim that classification is a theory of the structure of knowledge and by using language to classify, for the lack of a better word, "things" we are doing our job pretty poorly. Farradane does not say that finding a better method of classification that works with semantic factoring and unique definitions is an easy task. What he is calling for is a system for the expression of relations that is stable yet flexible. The question put forth is this: Do we not already have that? The Dewey Decimal, Library of Congress, and even Search Engine systems have held up for quite some time now. They work. People understand them almost implicitly based upon common sense. Are these systems not already stable enough and allow for flexibility in how they are designed? Would creating a new, notational based system, only make information retrieval more difficult as it would force users to adapt to a new method which is wholly foreign to them? Are we not already doing this with Boolean searching and multiple field searches such as the type we have found on OAIster? I do not know if I do or don't grasp his argument, though I have an inclination that I may in fact understand what he is saying (again, the Wittgenstein pops up). Is Farradane proposing a language for human beings or for machines? He writes in closing, "New machines will, however, also have to be devised to perform the tasks adapted to the type of general structure envisaged." I do understand his argument when it comes to the ambiguity of language (i.e. "Light" as in "Lamp" versus "Light" as in "Electromagnetic Radiation") and that there is a need to create a system for structuring knowledge which removes as much language distortion as possible. However, I do not understand how such a system can be translated between human (ambiguous) language and machine (formal) language on such a grand scale with maximized amounts of both stability and flexibility. I feel as if I understand what Farradane is arguing for and I acknowledge and agree with his ideas, but fail to understand just how we can form such a system. In the end, does not everything always come back to words and, if so, then is everything fundamentally nonsensical? Can we formalize knowledge when, according to Wittgenstein, this is the case: "The limits of my language are the limits of my mind. All I know is what I have words for?"


Pasquale J. Festa INF384C – Organizing and Providing Access to Information Prof. Efron 3.22.2007 Statistical Classification and User Indexing I feel as if I may be getting ahead of where we are in the course by bringing up indexing in this paper, but I feel it is something of interest to myself and lends a hand to the statistical classification model we are currently examining. Before coming to Austin I used Pandora to find new music for myself (I subsequently ditched Pandora and moved to LastFM due to it having a more user-friendly interface). I must say, after reading the statistical classification materials for this week’s class, I am rather befuddled by the mathematic side of it all (though I find it fascinating). My question is, what is the difference between statistical classification and user indexing (i.e. tagging) in the case of Pandora? While tagging is mostly conceptualized as a user defining his or her own indexing terms for a document, is not what we are doing here at Pandora the same thing? While I may be given a song and I am choosing between “yes” and “no” as to my desire to hear it again, is this not the same as me indexing the song as “love” and “hate”? Also, when Pandora suggests a song to me, does it not rely upon indexing from other users of its system to decide whether or not I may like it? Perhaps I am mistaken in how Pandora is working, but I am under the assumption that if I say “yes” to Black Flag’s “Rise Above” it will cycle through it’s instances of user’s saying “yes” to this same song and, from similar “yes” instances, choose another song that has also been tagged as “yes” by the majority of these other “yes” users, perhaps something like Minor Threat’s “In My Eyes”. As Pandora has been chosen as the case study for Statistical Classification, I assume it to be true that it is a fine example of such. However, I feel as if I may be looking at Pandora from another perspective (perhaps that of a user) and seeing another facet of its organization and retrieval framework. Is it wrong of me to assume that my “indexing” of a song as “yes” or “no” is the first step in the information organization and retrieval process at work in Pandora? As I see it, I must first assign a class to the song (by tagging, indexing, classifying or whichever subsequent term one would prefer to call it) and then, based upon my assigning of that class Pandora then “does the work” for me to filter through the other materials in the class and offer up suggestions. It seems to me that statistical classification is a more efficient or automated form of “social tagging” as it removes the burden of browsing through the collection for the user and, based upon statistical evaluation, offers up pieces of information that seem to fit the user’s schema. While I must say that this is a very ingenious and fitting way to offer up music to users, this of course would not be a good model for offering up such things as academic materials for paper writing. This form of classification only works well in two instances. First, statistical classification is helpful when the information the user is searching for has room for some degree of chance or luck in locating it (for instance, giving me Black Flag, Minor Threat, The Gorilla Biscuits, The Circle Jerks, and Dead Kennedys as possible music choices would work for me though other connoisseurs of punk rock, which is a broadly defined term, might differ in their opinions and say “no” to some of my “yes” indexes and subsequently say “yes” to some of my “no” indexes. In this case we are “taking a chance” on the information that is being passed to us). Secondly, the thing about statistical classification is that it only works best and most precisely the more times you use it and the more times you give feedback. I have recently stopped using both Pandora and LastFM due to the sheer fact that I just got tired of saying “no” or “yes” to choices, I was waiting for the point when the two virtually “read my mind” about what I wanted to hear (now I just do the leg work myself and browse through Amazon). While statistical classification has many positive aspects to it, just like any other classification system it has its drawbacks.


Pasquale J. Festa INF384C – Organizing and Providing Access to Information Prof. Efron 3.29.2007 Sentiment Classification

It seems that in sentiment classification we are trying to find a way of understanding language statements as being “negative” or “positive” for purposes of classification. In this weeks article we were shown that a great deal of research has been done on sentiment classification in regards to movie reviews by Pang, Lee, and Vaithyanathan. As the team stated, movie reviews were chosen as the sample set for such experiments because there was such a large collection of them to be found online and they tended to have their own user inputted rating indicators. The difficulty with sentiment classification lies in the ambiguity of words and the fact that sentiment can be expressed without there being an explicit word expressing positive or negative feelings in regards to a document. The article used the example of, “How could anyone sit through this movie?” for this case, illustrating that despite the lack of a negative word, such as bad, horrible, or sucks, the review itself was still disfavorable in regards to the movie but would require human knowledge of language games to understand. In sentiment classification it seems as if we have the problem of language as characterized by Wittgenstein language philosophy coming up all over again. Before we found it to be an issue in regards to topical classification, here we are seeing that the ambiguity of language is also a problem in terms of sentiment classification. In their experiments, it was crucial that word ambiguity be minimized and such things as adding tags (i.e. for the word “not” so that “good” and “not good” would not both show up as positive attributes in Naïve Bayes attempts at classification) added an extra degree of work to creating their sentiment classification model. What is most interesting is the outcome. If you look at Figure 3 in the article (Average three-fold crossvalidation accuracies, in percent), you will notice that Support Vector Machines models out performed both Naïve Bayes and Maximum Entropy models. However, despite the fact that this form of sentiment classification is, for the most part, more accurate, its level of accuracy still leaves something to be desired. At best, Support Vector Machines were only 82.9% accurate in terms of properly classifying reviews based upon their sentiment. While this number is well above average in terms of its accuracy, there is still enough margin of error to leave a person weary of how well the system works. If we think of this percentage in terms of 1000 records, we will find that about 170 records will be misclassified. In classification, the balance we nee to find is that between how many correct will make us happy with a model and how many incorrect are we willing to sacrifice.


Pasquale J. Festa INF384C – Organizing and Providing Access to Information Prof. Efron 4.5.2007 On del.icio.us As a number of my recent papers have been on the readings and this week’s set of readings has already been discussed in previous essays, I would like to take the chance to put some focus on one of our practical models of information organization methods. I have, since the beginning of this course, been using del.icio.us as my primary homepage. I do this because it gives me a personalized way of organizing and retrieving information that I find pertinent to my life. What I need can be placed where I wish to place it and this privilege allows me much ease and efficiency when it comes to retrieving information that I want to access with a minimal expenditure of time. While this is my personal case, I am very interested in the notions of posited use and actual use in terms of del.icio.us. For example, del.icio.us is trumpeted as a social bookmarking tool, a way in which individuals can share information and disseminate it amongst a broad audience. It appears to me that the creators of del.icio.us have a set mind frame in regards to how their service will be utilized by the public. My question is, is it naïve of us, as information organizers and retrieval professionals, to assume that a system we create will be used for our own explicit intention? While this may seem a broad and philosophical question that may have little to do with Information Studies, I feel it is of critical importance in the contemporary era of technological advancement. Personally, I do not wander through the digital labyrinth that is del.icio.us. I use it for what appears to be, even in my own opinion, quite selfish means. I really care little about what others are looking at and do not care for them to partake in my interests. I use this technology solely for the sake and purpose of simplifying my life. When you take this perspective and try to mesh it with what appears to be an idealistic dream of the del.icio.us regime you find a great disparity in what was supposed to happen and what is actually happening. While this may seem of trivial nature, I think such a statement gets at the heart of the ethical dilemma of what we are doing here in the Information school. Whenever we create a model or a system for providing individuals with information, the second that model is turned over to the user group for utilization, it is no longer ours. The early dreams of the internet consisted of fanciful notions of a place where individuals from diverse backgrounds living in separate worlds would come together to communicate and share information. Today, when looking at the current form of the internet, we see something quite different. We have been overrun with spam, advertisements, junk, trivial blogs about nothingness and cyber smut. My point is this: despite any notions we may have concerning what we are doing in organizing and providing access to information, we must always be prepared for the fact that the idealistic dreams we have of how our creation will be used by others may take a turn for the worse down the dark alley we wished to avoid instead of that country road we so greatly desired.


Pasquale J. Festa INF384C – Organizing and Providing Access to Information Prof. Efron 4.12.2007 Information Retrieval and the General Public “In reality, almost no data is truly ‘unstructured’. This is definitely true of all text data if one considers the latent linguistic structure of human languages.” (Manning and Scheutze, p. 1). Monday through Friday I work at the Texas State Law Library. I help, for the most part, users from the general public, would-be lawyers, and partners at small firms who are looking for legal information in regards to a range of subjects. The plain and simple fact is that I myself have little understanding of the ways in which one goes about finding such information. In Manning and Scheutze’s yet to be published manuscript they call attention to the Westlaw information retrieval system and its use of both Boolean and Common Language searching. I will argue, for the most part, that while they say that there is a degree of success with these information retrieval systems at Westlaw (the term “professional” is used all too much in this article when referring to users) the system and dominant forms of information retrieval methods in the public (and even lawyer) sphere do not coalesce as well as it may be thought. My issue with information retrieval, particularly in terms of law documents, has to do with the matter of context and procedure. Westlaw is a very complicated information retrieval system. For the most part, it is best used by law librarians and trained professionals and requires intricate knowledge of the ways in which law documents interact with one another. I feel, in their analysis, that Manning and Scheutze do not take into account the nature of the beast, user needs, and social reality when discussing the success of search methods in regards to Westlaw. Documents regarding law and policy are unlike most other documents in that they are cross platform and require a great deal of networking between multiple documents to find the pertinent and proper information that is being sought out. Certain procedures, such as shepardizing, are particular to the law profession and add a spin to the task of information retrieval. Westlaw does, however, institute the use of a stop light icon to inform users of the status of a case but does little to make the intricacies of law information retrieval less of a mystery to novice users. On many occasions I have searched for information (for instance, §42.12 of the Texas Code of Criminal Procedure from 1976) on Westlaw and other law information repositories and have come up with nothing in terms of pertinent information. The problem is quite simple actually, the nature of law is such that it changes and grows and takes on a life of its own and in trying to pin it down and be able to find a particular piece of information in regards to it is a daunting task. While Boolean and Natural Language searches may work for the more professional users of Westlaw, it seems that such methods will bring about hardly any, if not negative, results for the member of the general public who is not accustomed to the intricacies of the system. If this issue is pertinent to law, something that has been in existence for ages, what will happen in the digital realm of the World Wide Web in the future? Won’t this living library of metamorphosizing documents also pose problems in the future as information changes? While Manning and Scheutze assert that human language is structured, it is not hard to see that the fundamental ambiguity of it all (and here I think of a patron who was searching for cases involving Panasonic VCR’s with the term “v.c.r.”) is causing a change in linguistic structure. While we may believe there are “rules” to language in terms of grammar, we will notice that, with the advent of regional dialects, shorthand, slang and the creation of new words in everyday conversation, these rules are starting to go out the window rapidly when we look at the way in which language is used in actuality. For information retrieval specialists it is imperative that the changing climate of human language (how many times have I heard that “cataloging” is dying since coming to the iSchool?) be taken into account when systems for retrieval are being developed for use by the world’s population.


Pasquale J. Festa INF384C – Organizing and Providing Access to Information Prof. Efron 4.26.2007 Smart Mobs, Social Computing and Net Neutrality I recently gave a presentation on my paper topic for INF390N (Communication, Law and Power) with Professor Stein from the Radio, Television, and Film School. I took for my topic the current dilemma that is being raised around the issue of Net Neutrality. After reading over this week’s assignment, I feel that this paper may be a pertinent place for me to let my interests mingle. It seems that the Institute for the Future really does take into account the fact that what is happening in contemporary society will have consequences for the years and societies that are to come after the present day. In reading over this paper, I found it interesting that what was of predominant focus was the capacity or possibility for social progress as characterized by emerging technological trends. Since the advent of the Internet, the rise of mobile telephony and the current merging of phone, video, image, text, and information resources we are seeing, social groups have had a degree of power in terms of organization and action that we have yet to see. If the civil rights movement of Martin Luther King, Jr. were happening today, would it reach its ends any faster due to these socio-technological advancements? In Rheingold’s essay it is written, “It is likely that these early instances of collective action are signs of a larger future social and organizational upheaval” (p.18). It is noted that human’s are competitive creatures, but through the use of new technologies there has been a rise in so-called “Smart Mobs” that points to a level of cooperation between individuals that is catalyzed by mobile social computing. With the rise of mobile social computing also comes the rise of knowledge collectives. Like minded individuals, regardless of physical and geographic distance, can now come together to share individual and collective experiences in a way that allows for the emergence and solidification of new sub-cultures. It seems, due to the new technologies and their inherent social facets, that we are bound to find ourselves in a new Renaissance era. Or are we? Currently, the battle of Net Neutrality is being waged in corporate offices and on the legislative floor. The possibility for the Internet to be turned into a glorified boob tube is on the horizon. Net Neutrality advocates such as Lawrence Lessig and Timothy Wu are making arguments pointing to the fact that, at some point in the future, we may very well find ourselves roaming around a completely different Internet. Issues such as blockage, data management and propriety content are of great concern to those advocating Internet freedom. Already in Canada, a society usually regarded as socially liberal by American standards, we find the case of Telus Telecommunications blocking access to labor organization sites that were in dispute with the company from their customers. Is it the case, perhaps, that the days of the much trumpeted social network utopia of the Internet have already reached their zenith? Are we to go nowhere but down from here? The fear is real and the crisis is eminent. If we are to achieve the level socio-technological advancement that the Institute for the Future strives to grasp, it is imperative that society as a whole cooperates on a level never otherwise known for the buttressing and solidification of information access and retrieval in terms of the digital world.


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.