TY - CONF TI - Improving Web Search Ranking by Incorporating User Behavior Information AU - Agichtein, Eugene AU - Brill, Eric AU - Dumais, Susan T3 - SIGIR '06 AB - We show that incorporating user behavior data can significantly improve ordering of top results in real web search setting. We examine alternatives for incorporating feedback into the ranking process and explore the contributions of user feedback compared to other common web search features. We report results of a large scale evaluation over 3,000 queries and 12 million user interactions with a popular web search engine. We show that incorporating implicit feedback can augment other features, improving the accuracy of a competitive web search ranking algorithms by as much as 31% relative to the original performance. C1 - New York, NY, USA C3 - Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval DA - 2006/// PY - 2006 DO - 10.1145/1148170.1148177 DP - ACM Digital Library SP - 19 EP - 26 LA - en PB - ACM SN - 978-1-59593-369-0 UR - http://doi.acm.org/10.1145/1148170.1148177 Y2 - 2019/01/18/18:14:12 ER - TY - CONF TI - Diversifying Search Results AU - Agrawal, Rakesh AU - Gollapudi, Sreenivas AU - Halverson, Alan AU - Ieong, Samuel T3 - WSDM '09 AB - We study the problem of answering ambiguous web queries in a setting where there exists a taxonomy of information, and that both queries and documents may belong to more than one category according to this taxonomy. We present a systematic approach to diversifying results that aims to minimize the risk of dissatisfaction of the average user. We propose an algorithm that well approximates this objective in general, and is provably optimal for a natural special case. Furthermore, we generalize several classical IR metrics, including NDCG, MRR, and MAP, to explicitly account for the value of diversification. We demonstrate empirically that our algorithm scores higher in these generalized metrics compared to results produced by commercial search engines. C1 - New York, NY, USA C3 - Proceedings of the Second ACM International Conference on Web Search and Data Mining DA - 2009/// PY - 2009 DO - 10.1145/1498759.1498766 DP - ACM Digital Library SP - 5 EP - 14 LA - en PB - ACM SN - 978-1-60558-390-7 UR - http://doi.acm.org/10.1145/1498759.1498766 Y2 - 2019/01/27/21:41:12 ER - TY - BOOK TI - Modern Information Retrieval AU - Baeza-Yates, Ricardo AU - Ribeiro-Neto, Berthier AB - This is a rigorous and complete textbook for a first course on information retrieval from the computer science perspective. It provides an up-to-date student oriented treatment of information retrieval including extensive coverage of new topics such as web retrieval, web crawling, open source search engines and user interfaces. From parsing to indexing, clustering to classification, retrieval to ranking, and user feedback to retrieval evaluation, all of the most important concepts are carefully introduced and exemplified. The contents and structure of the book have been carefully designed by the two main authors, with individual contributions coming from leading international authorities in the field, including Yoelle Maarek, Senior Director of Yahoo! Research Israel; Dulce Poncele´on IBM Research; and Malcolm Slaney, Yahoo Research USA. This completely reorganized, revised and enlarged second edition of Modern Information Retrieval contains many new chapters and double the number of pages and bibliographic references of the first edition, and a companion website www.mir2ed.org with teaching material. It will prove invaluable to students, professors, researchers, practitioners, and scholars of this fascinating field of information retrieval. DA - 1999/// PY - 1999 DP - Google Books SP - 548 LA - en PB - ACM Press SN - 978-0-201-39829-8 ER - TY - BOOK TI - Information Retrieval: Implementing and Evaluating Search Engines AU - Büttcher, Stefan AU - Clarke, Charles L. A. AU - Cormack, Gordon V. AB - Information retrieval is the foundation for modern search engines. This textbook offers an introduction to the core topics underlying modern search technologies, including algorithms, data structures, indexing, retrieval, and evaluation. The emphasis is on implementation and experimentation; each chapter includes exercises and suggestions for student projects. Wumpus -- a multiuser open-source information retrieval system developed by one of the authors and available online -- provides model implementations and a basis for student work. The modular structure of the book allows instructors to use it in a variety of graduate-level courses, including courses taught from a database systems perspective, traditional information retrieval courses with a focus on IR theory, and courses covering the basics of Web retrieval. In addition to its classroom use, Information Retrieval will be a valuable reference for professionals in computer science, computer engineering, and software engineering. DA - 2010/// PY - 2010 SP - 633 LA - en PB - MIT Press SN - 978-0-262-52887-0 ST - Information Retrieval ER - TY - CONF TI - A Comparative Analysis of Cascade Measures for Novelty and Diversity AU - Clarke, Charles L.A. AU - Craswell, Nick AU - Soboroff, Ian AU - Ashkan, Azin T3 - WSDM '11 AB - Traditional editorial effectiveness measures, such as nDCG, remain standard for Web search evaluation. Unfortunately, these traditional measures can inappropriately reward redundant information and can fail to reflect the broad range of user needs that can underlie a Web query. To address these deficiencies, several researchers have recently proposed effectiveness measures for novelty and diversity. Many of these measures are based on simple cascade models of user behavior, which operate by considering the relationship between successive elements of a result list. The properties of these measures are still poorly understood, and it is not clear from prior research that they work as intended. In this paper we examine the properties and performance of cascade measures with the goal of validating them as tools for measuring effectiveness. We explore their commonalities and differences, placing them in a unified framework; we discuss their theoretical difficulties and limitations, and compare the measures experimentally, contrasting them against traditional measures and against other approaches to measuring novelty. Data collected by the TREC 2009 Web Track is used as the basis for our experimental comparison. Our results indicate that these measures reward systems that achieve an balance between novelty and overall precision in their result lists, as intended. Nonetheless, other measures provide insights not captured by the cascade measures, and we suggest that future evaluation efforts continue to report a variety of measures. C1 - New York, NY, USA C3 - Proceedings of the Fourth ACM International Conference on Web Search and Data Mining DA - 2011/// PY - 2011 DO - 10.1145/1935826.1935847 DP - ACM Digital Library SP - 75 EP - 84 LA - en PB - ACM SN - 978-1-4503-0493-1 UR - http://doi.acm.org/10.1145/1935826.1935847 Y2 - 2019/01/27/21:34:38 ER - TY - CONF TI - Novelty and Diversity in Information Retrieval Evaluation AU - Clarke, Charles L.A. AU - Kolla, Maheedhar AU - Cormack, Gordon V. AU - Vechtomova, Olga AU - Ashkan, Azin AU - Büttcher, Stefan AU - MacKinnon, Ian T3 - SIGIR '08 AB - Evaluation measures act as objective functions to be optimized by information retrieval systems. Such objective functions must accurately reflect user requirements, particularly when tuning IR systems and learning ranking functions. Ambiguity in queries and redundancy in retrieved documents are poorly reflected by current evaluation measures. In this paper, we present a framework for evaluation that systematically rewards novelty and diversity. We develop this framework into a specific evaluation measure, based on cumulative gain. We demonstrate the feasibility of our approach using a test collection based on the TREC question answering track. C1 - New York, NY, USA C3 - Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval DA - 2008/// PY - 2008 DO - 10.1145/1390334.1390446 DP - ACM Digital Library SP - 659 EP - 666 LA - en PB - ACM SN - 978-1-60558-164-4 UR - http://doi.acm.org/10.1145/1390334.1390446 Y2 - 2019/01/27/19:15:02 ER - TY - JOUR TI - Evaluating Implicit Measures to Improve Web Search AU - Fox, Steve AU - Karnawat, Kuldeep AU - Mydland, Mark AU - Dumais, Susan AU - White, Thomas T2 - ACM Trans. Inf. Syst. AB - Of growing interest in the area of improving the search experience is the collection of implicit user behavior measures (implicit measures) as indications of user interest and user satisfaction. Rather than having to submit explicit user feedback, which can be costly in time and resources and alter the pattern of use within the search experience, some research has explored the collection of implicit measures as an efficient and useful alternative to collecting explicit measure of interest from users.This research article describes a recent study with two main objectives. The first was to test whether there is an association between explicit ratings of user satisfaction and implicit measures of user interest. The second was to understand what implicit measures were most strongly associated with user satisfaction. The domain of interest was Web search. We developed an instrumented browser to collect a variety of measures of user activity and also to ask for explicit judgments of the relevance of individual pages visited and entire search sessions. The data was collected in a workplace setting to improve the generalizability of the results.Results were analyzed using traditional methods (e.g., Bayesian modeling and decision trees) as well as a new usage behavior pattern analysis (“gene analysis”). We found that there was an association between implicit measures of user activity and the user's explicit satisfaction ratings. The best models for individual pages combined clickthrough, time spent on the search result page, and how a user exited a result or ended a search session (exit type/end action). Behavioral patterns (through the gene analysis) can also be used to predict user satisfaction for search sessions. DA - 2005/04// PY - 2005 DO - 10.1145/1059981.1059982 DP - ACM Digital Library VL - 23 IS - 2 SP - 147 EP - 168 LA - en SN - 1046-8188 UR - http://doi.acm.org/10.1145/1059981.1059982 Y2 - 2019/01/18/19:48:10 ER - TY - CONF TI - An Axiomatic Approach for Result Diversification AU - Gollapudi, Sreenivas AU - Sharma, Aneesh T3 - WWW '09 AB - Understanding user intent is key to designing an effective ranking system in a search engine. In the absence of any explicit knowledge of user intent, search engines want to diversify results to improve user satisfaction. In such a setting, the probability ranking principle-based approach of presenting the most relevant results on top can be sub-optimal, and hence the search engine would like to trade-off relevance for diversity in the results. In analogy to prior work on ranking and clustering systems, we use the axiomatic approach to characterize and design diversification systems. We develop a set of natural axioms that a diversification system is expected to satisfy, and show that no diversification function can satisfy all the axioms simultaneously. We illustrate the use of the axiomatic framework by providing three example diversification objectives that satisfy different subsets of the axioms. We also uncover a rich link to the facility dispersion problem that results in algorithms for a number of diversification objectives. Finally, we propose an evaluation methodology to characterize the objectives and the underlying axioms. We conduct a large scale evaluation of our objectives based on two data sets: a data set derived from the Wikipedia disambiguation pages and a product database. C1 - New York, NY, USA C3 - Proceedings of the 18th International Conference on World Wide Web DA - 2009/// PY - 2009 DO - 10.1145/1526709.1526761 DP - ACM Digital Library SP - 381 EP - 390 LA - en PB - ACM SN - 978-1-60558-487-4 UR - http://doi.acm.org/10.1145/1526709.1526761 Y2 - 2019/01/27/22:06:28 ER - TY - JOUR TI - Search log analysis: What it is, what's been done, how to do it AU - Jansen, Bernard J. T2 - Library & Information Science Research AB - The use of data stored in transaction logs of Web search engines, Intranets, and Web sites can provide valuable insight into understanding the information-searching process of online searchers. This understanding can enlighten information system design, interface development, and devising the information architecture for content collections. This article presents a review and foundation for conducting Web search transaction log analysis. A methodology is outlined consisting of three stages, which are collection, preparation, and analysis. The three stages of the methodology are presented in detail with discussions of goals, metrics, and processes at each stage. Critical terms in transaction log analysis for Web searching are defined. The strengths and limitations of transaction log analysis as a research method are presented. An application to log client-side interactions that supplements transaction logs is reported on, and the application is made available for use by the research community. Suggestions are provided on ways to leverage the strengths of, while addressing the limitations of, transaction log analysis for Web-searching research. Finally, a complete flat text transaction log from a commercial search engine is available as supplementary material with this manuscript. DA - 2006/09/01/ PY - 2006 DO - 10.1016/j.lisr.2006.06.005 DP - ScienceDirect VL - 28 IS - 3 SP - 407 EP - 432 J2 - Library & Information Science Research LA - en SN - 0740-8188 ST - Search log analysis UR - http://www.sciencedirect.com/science/article/pii/S0740818806000673 Y2 - 2018/03/20/23:18:18 ER - TY - JOUR TI - Determining the informational, navigational, and transactional intent of Web queries AU - Jansen, Bernard J. AU - Booth, Danielle L. AU - Spink, Amanda T2 - Information Processing & Management AB - In this paper, we define and present a comprehensive classification of user intent for Web searching. The classification consists of three hierarchical levels of informational, navigational, and transactional intent. After deriving attributes of each, we then developed a software application that automatically classified queries using a Web search engine log of over a million and a half queries submitted by several hundred thousand users. Our findings show that more than 80% of Web queries are informational in nature, with about 10% each being navigational and transactional. In order to validate the accuracy of our algorithm, we manually coded 400 queries and compared the results from this manual classification to the results determined by the automated method. This comparison showed that the automatic classification has an accuracy of 74%. Of the remaining 25% of the queries, the user intent is vague or multi-faceted, pointing to the need for probabilistic classification. We discuss how search engines can use knowledge of user intent to provide more targeted and relevant results in Web searching. DA - 2008/05/01/ PY - 2008 DO - 10.1016/j.ipm.2007.07.015 DP - ScienceDirect VL - 44 IS - 3 SP - 1251 EP - 1266 J2 - Information Processing & Management LA - en SN - 0306-4573 UR - http://www.sciencedirect.com/science/article/pii/S030645730700163X Y2 - 2018/03/28/23:33:46 ER - TY - JOUR TI - Real life, real users, and real needs: a study and analysis of user queries on the web AU - Jansen, Bernard J. AU - Spink, Amanda AU - Saracevic, Tefko T2 - Information Processing & Management AB - We analyzed transaction logs containing 51,473 queries posed by 18,113 users of Excite, a major Internet search service. We provide data on: (i) sessions — changes in queries during a session, number of pages viewed, and use of relevance feedback; (ii) queries — the number of search terms, and the use of logic and modifiers; and (iii) terms — their rank/frequency distribution and the most highly used search terms. We then shift the focus of analysis from the query to the user to gain insight to the characteristics of the Web user. With these characteristics as a basis, we then conducted a failure analysis, identifying trends among user mistakes. We conclude with a summary of findings and a discussion of the implications of these findings. DA - 2000/03/01/ PY - 2000 DO - 10.1016/S0306-4573(99)00056-4 DP - ScienceDirect VL - 36 IS - 2 SP - 207 EP - 227 J2 - Information Processing & Management LA - en SN - 0306-4573 ST - Real life, real users, and real needs UR - http://www.sciencedirect.com/science/article/pii/S0306457399000564 Y2 - 2019/01/27/22:52:32 ER - TY - CONF TI - Optimizing search engines using clickthrough data AU - Joachims, Thorsten T2 - KDD '02 AB - This paper presents an approach to automatically optimizing the retrieval quality of search engines using clickthrough data. Intuitively, a good information retrieval system should present relevant documents high in the ranking, with less relevant documents following below. While previous approaches to learning retrieval functions from examples exist, they typically require training data generated from relevance judgments by experts. This makes them difficult and expensive to apply. The goal of this paper is to develop a method that utilizes clickthrough data for training, namely the query-log of the search engine in connection with the log of links the users clicked on in the presented ranking. Such clickthrough data is available in abundance and can be recorded at very low cost. Taking a Support Vector Machine (SVM) approach, this paper presents a method for learning retrieval functions. From a theoretical perspective, this method is shown to be well-founded in a risk minimization framework. Furthermore, it is shown to be feasible even for large sets of queries and features. The theoretical results are verified in a controlled experiment. It shows that the method can effectively adapt the retrieval function of a meta-search engine to a particular group of users, outperforming Google in terms of retrieval quality after only a couple of hundred training examples. C1 - Edmonton, Alberta, Canada C3 - Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining DA - 2002/07/23/ PY - 2002 DO - 10.1145/775047.775067 DP - dl.acm.org SP - 133 EP - 142 LA - en PB - ACM SN - 978-1-58113-567-1 UR - http://dl.acm.org/citation.cfm?id=775047.775067 Y2 - 2019/01/18/20:54:23 ER - TY - CONF TI - Accurately Interpreting Clickthrough Data As Implicit Feedback AU - Joachims, Thorsten AU - Granka, Laura AU - Pan, Bing AU - Hembrooke, Helene AU - Gay, Geri T2 - SIGIR'05 AB - This paper examines the reliability of implicit feedback generated from clickthrough data in WWW search. Analyzing the users' decision process using eyetracking and comparing implicit feedback against manual relevance judgments, we conclude that clicks are informative but biased. While this makes the interpretation of clicks as absolute relevance judgments difficult, we show that relative preferences derived from clicks are reasonably accurate on average. C3 - Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, 2005 DA - 2005/// PY - 2005 DP - ACM Digital Library SP - 154 EP - 161 LA - en Y2 - 2019/01/18/20:45:44 ER - TY - JOUR TI - Learning to Rank for Information Retrieval AU - Liu, Tie-Yan T2 - Foundations and Trends® in Information Retrieval AB - Learning to rank for Information Retrieval (IR) is a task to automatically construct a ranking model using training data, such that the model can sort new objects according to their degrees of relevance, preference, or importance. Many IR problems are by nature ranking problems, and many IR technologies can be potentially enhanced by using learning-to-rank techniques. The objective of this tutorial is to give an introduction to this research direction. Specifically, the existing learning-to-rank algorithms are reviewed and categorized into three approaches: the pointwise, pairwise, and listwise approaches. The advantages and disadvantages with each approach are analyzed, and the relationships between the loss functions used in these approaches and IR evaluation measures are discussed. Then the empirical evaluations on typical learning-to-rank methods are shown, with the LETOR collection as a benchmark dataset, which seems to suggest that the listwise approach be the most effective one among all the approaches. After that, a statistical ranking theory is introduced, which can describe different learning-to-rank algorithms, and be used to analyze their query-level generalization abilities. At the end of the tutorial, we provide a summary and discuss potential future work on learning to rank. DA - 2009/06/27/ PY - 2009 DO - 10.1561/1500000016 DP - www.nowpublishers.com VL - 3 IS - 3 SP - 225 EP - 331 J2 - INR LA - en SN - 1554-0669, 1554-0677 UR - https://www.nowpublishers.com/article/Details/INR-016 Y2 - 2019/01/18/20:05:20 ER - TY - BOOK TI - An introduction to information retrieval AU - Manning, Christopher D. AU - Raghavan, Prabhakar AU - Schütze, Hinrich CY - Cambridge, England DA - 2009/// PY - 2009 LA - en PB - Cambridge University Press UR - http://www.informationretrieval.org/ ER - TY - JOUR TI - On Relevance, Probabilistic Indexing and Information Retrieval AU - Maron, M. E. AU - Kuhns, J. L. T2 - Journal of the ACM AB - This paper reports on a novel technique for literature indexing and searching in a mechanized library system. The notion of relevance is taken as the key concept in the theory of information retrieval and a comparative concept of relevance is explicated in terms of the theory of probability. The resulting technique called “Probabilistic Indexing,” allows a computing machine, given a request for information, to make a statistical inference and derive a number (called the “relevance number”) for each document, which is a measure of the probability that the document will satisfy the given request. The result of a search is an ordered list of those documents which satisfy the request ranked according to their probable relevance. The paper goes on to show that whereas in a conventional library system the cross-referencing (“see” and “see also”) is based solely on the “semantical closeness” between index terms, statistical measures of closeness between index terms can be defined and computed. Thus, given an arbitrary request consisting of one (or many) index term(s), a machine can elaborate on it to increase the probability of selecting relevant documents that would not otherwise have been selected. Finally, the paper suggests an interpretation of the whole library problem as one where the request is considered as a clue on the basis of which the library system makes a concatenated statistical inference in order to provide as an output an ordered list of those documents which most probably satisfy the information needs of the user. DA - 1960/07// PY - 1960 DO - 10.1145/321033.321035 DP - ACM Digital Library VL - 7 IS - 3 SP - 216 EP - 244 LA - en SN - 0004-5411 UR - http://doi.acm.org/10.1145/321033.321035 Y2 - 2019/01/27/23:02:51 ER - TY - JOUR TI - Redundancy, diversity and interdependent document relevance AU - Radlinski, Filip AU - Bennett, Paul N. AU - Carterette, Ben AU - Joachims, Thorsten T2 - ACM SIGIR Forum AB - The goal of the Redundancy, Diversity, and Interdependent Document Relevance workshop was to explore how ranking, performance assessment and learning to rank can move beyond the assumption that the relevance of a document is independent of other documents. In particular, the workshop focussed on three themes: the effect of redundancy on information retrieval utility (for example, minimizing the wasted effort of users who must skip redundant information), the role of diversity (for example, for mitigating the risk of misinterpreting ambiguous queries), and algorithms for set-level optimization (where the quality of a set of retrieved documents is not simply the sum of its parts). This workshop built directly upon the Beyond Binary Relevance: Preferences, Diversity and Set-Level Judgments workshop at SIGIR 2008 [3], shifting focus to address the questions left open by the discussions and results from that workshop. As such, it was the first workshop to explicitly focus on the related research challenges of redundancy, diversity, and interdependent relevance – all of which require novel performance measures, learning methods, and evaluation techniques. The workshop program committee consisted of 15 researchers from academia and industry, with experience in IR evaluation, machine learning, and IR algorithmic design. Over 40 people attended the workshop. This report aims to summarize the workshop, and also to systematize common themes and key concepts so as to encourage research in the three workshop themes. It contains our attempt to summarize and organize the topics that came up in presentations as well as in discussions, pulling out common elements. Many audience members contributed, yet due to the free-flowing discussion, attributing all the observations to particular audience members is unfortunately impossible. Not all audience members would necessarily agree with the views presented, but we do attempt to present a consensus view as far as possible. DA - 2009/12/14/ PY - 2009 DO - 10.1145/1670564.1670572 DP - dl.acm.org VL - 43 IS - 2 SP - 46 EP - 52 LA - en SN - 0163-5840 UR - http://dl.acm.org/citation.cfm?id=1670564.1670572 Y2 - 2019/01/27/19:48:40 ER - TY - JOUR TI - The Probabilistic Relevance Framework: BM25 and Beyond AU - Robertson, Stephen AU - Zaragoza, Hugo T2 - Foundations and Trends® in Information Retrieval AB - The Probabilistic Relevance Framework (PRF) is a formal framework for document retrieval, grounded in work done in the 1970–1980s, which led to the development of one of the most successful text-retrieval algorithms, BM25. In recent years, research in the PRF has yielded new retrieval models capable of taking into account document meta-data (especially structure and link-graph information). Again, this has led to one of the most successful Web-search and corporate-search algorithms, BM25F. This work presents the PRF from a conceptual point of view, describing the probabilistic modelling assumptions behind the framework and the different ranking algorithms that result from its application: the binary independence model, relevance feedback models, BM25 and BM25F. It also discusses the relation between the PRF and other statistical models for IR, and covers some related topics, such as the use of non-textual features, and parameter optimisation for models with free parameters. DA - 2009/12/17/ PY - 2009 DO - 10.1561/1500000019 DP - www.nowpublishers.com VL - 3 IS - 4 SP - 333 EP - 389 J2 - INR LA - en SN - 1554-0669, 1554-0677 ST - The Probabilistic Relevance Framework UR - https://www.nowpublishers.com/article/Details/INR-019 Y2 - 2019/01/18/20:09:44 ER - TY - JOUR TI - A Vector Space Model for Automatic Indexing AU - Salton, G. AU - Wong, A. AU - Yang, C. S. T2 - Commun. ACM AB - In a document retrieval, or other pattern matching environment where stored entities (documents) are compared with each other or with incoming patterns (search requests), it appears that the best indexing (property) space is one where each entity lies as far away from the others as possible; in these circumstances the value of an indexing system may be expressible as a function of the density of the object space; in particular, retrieval performance may correlate inversely with space density. An approach based on space density computations is used to choose an optimum indexing vocabulary for a collection of documents. Typical evaluation results are shown, demonstating the usefulness of the model. DA - 1975/11// PY - 1975 DO - 10.1145/361219.361220 DP - ACM Digital Library VL - 18 IS - 11 SP - 613 EP - 620 LA - en SN - 0001-0782 UR - http://doi.acm.org/10.1145/361219.361220 Y2 - 2017/11/08/22:43:01 ER - TY - JOUR TI - Term-weighting approaches in automatic text retrieval AU - Salton, Gerard AU - Buckley, Christopher T2 - Information Processing & Management AB - The experimental evidence accumulated over the past 20 years indicates that text indexing systems based on the assignment of appropriately weighted single terms produce retrieval results that are superior to those obtainable with other more elaborate text representations. These results depend crucially on the choice of effective termweighting systems. This article summarizes the insights gained in automatic term weighting, and provides baseline single-term-indexing models with which other more elaborate content analysis procedures can be compared. DA - 1988/01/01/ PY - 1988 DO - 10.1016/0306-4573(88)90021-0 DP - ScienceDirect VL - 24 IS - 5 SP - 513 EP - 523 J2 - Information Processing & Management LA - en SN - 0306-4573 UR - http://www.sciencedirect.com/science/article/pii/0306457388900210 Y2 - 2016/10/15/20:58:32 ER - TY - JOUR TI - Analysis of a Very Large Web Search Engine Query Log AU - Silverstein, Craig AU - Marais, Hannes AU - Henzinger, Monika AU - Moricz, Michael T2 - SIGIR Forum AB - In this paper we present an analysis of an AltaVista Search Engine query log consisting of approximately 1 billion entries for search requests over a period of six weeks. This represents almost 285 million user sessions, each an attempt to fill a single information need. We present an analysis of individual queries, query duplication, and query sessions. We also present results of a correlation analysis of the log entries, studying the interaction of terms within queries. Our data supports the conjecture that web users differ significantly from the user assumed in the standard information retrieval literature. Specifically, we show that web users type in short queries, mostly look at the first 10 results only, and seldom modify the query. This suggests that traditional information retrieval techniques may not work well for answering web search requests. The correlation analysis showed that the most highly correlated items are constituents of phrases. This result indicates it may be useful for search engines to consider search terms as parts of phrases even if the user did not explicitly specify them as such. DA - 1999/09// PY - 1999 DO - 10.1145/331403.331405 DP - ACM Digital Library VL - 33 IS - 1 SP - 6 EP - 12 LA - en SN - 0163-5840 UR - http://doi.acm.org/10.1145/331403.331405 Y2 - 2018/03/29/00:38:59 ER - TY - CONF TI - On Query Result Diversification AU - Vieira, Marcos R. AU - Razente, Humberto L. AU - Barioni, Maria C. N. AU - Hadjieleftheriou, Marios AU - Srivastava, Divesh AU - Traina, Caetano AU - Tsotras, Vassilis J. T3 - ICDE '11 AB - In this paper we describe a general framework for evaluation and optimization of methods for diversifying query results. In these methods, an initial ranking candidate set produced by a query is used to construct a result set, where elements are ranked with respect to relevance and diversity features, i.e., the retrieved elements should be as relevant as possible to the query, and, at the same time, the result set should be as diverse as possible. While addressing relevance is relatively simple and has been heavily studied, diversity is a harder problem to solve. One major contribution of this paper is that, using the above framework, we adapt, implement and evaluate several existing methods for diversifying query results. We also propose two new approaches, namely the Greedy with Marginal Contribution (GMC) and the Greedy Randomized with Neighborhood Expansion (GNE) methods. Another major contribution of this paper is that we present the first thorough experimental evaluation of the various diversification techniques implemented in a common framework. We examine the methods' performance with respect to precision, running time and quality of the result. Our experimental results show that while the proposed methods have higher running times, they achieve precision very close to the optimal, while also providing the best result quality. While GMC is deterministic, the randomized approach (GNE) can achieve better result quality if the user is willing to tradeoff running time. C1 - Washington, DC, USA C3 - Proceedings of the 2011 IEEE 27th International Conference on Data Engineering DA - 2011/// PY - 2011 DO - 10.1109/ICDE.2011.5767846 DP - ACM Digital Library SP - 1163 EP - 1174 LA - en PB - IEEE Computer Society SN - 978-1-4244-8959-6 UR - http://dx.doi.org/10.1109/ICDE.2011.5767846 Y2 - 2019/01/27/22:10:26 ER - TY - CONF TI - Faceted Metadata for Image Search and Browsing AU - Yee, Ka-Ping AU - Swearingen, Kirsten AU - Li, Kevin AU - Hearst, Marti T3 - CHI '03 AB - There are currently two dominant interface types for searching and browsing large image collections: keyword-based search, and searching by overall similarity to sample images. We present an alternative based on enabling users to navigate along conceptual dimensions that describe the images. The interface makes use of hierarchical faceted metadata and dynamically generated query previews. A usability study, in which 32 art history students explored a collection of 35,000 fine arts images, compares this approach to a standard image search interface. Despite the unfamiliarity and power of the interface (attributes that often lead to rejection of new search interfaces), the study results show that 90% of the participants preferred the metadata approach overall, 97% said that it helped them learn more about the collection, 75% found it more flexible, and 72% found it easier to use than a standard baseline system. These results indicate that a category-based approach is a successful way to provide access to image collections. C1 - New York, NY, USA C3 - Proceedings of the SIGCHI Conference on Human Factors in Computing Systems DA - 2003/// PY - 2003 DO - 10.1145/642611.642681 DP - ACM Digital Library SP - 401 EP - 408 LA - en PB - ACM SN - 978-1-58113-630-2 UR - http://doi.acm.org/10.1145/642611.642681 Y2 - 2018/08/09/19:17:02 ER -