Information Retrieval
Books
9_ENTRIES- Introduction to Information Retrieval
C.D. Manning, P. Raghavan, H. Schütze. Cambridge UP, 2008. (First book for getting started with Information Retrieval).
- Search Engines: Information Retrieval in Practice
Bruce Croft, Don Metzler, and Trevor Strohman. 2009. (Great book for readers interested in knowing how Search Engines work. The book is very detailed).
- Modern Information Retrieval
R. Baeza-Yates, B. Ribeiro-Neto. Addison-Wesley, 1999.
- Information Retrieval in Practice
B. Croft, D. Metzler, T. Strohman. Pearson Education, 2009.
- Mining the Web: Analysis of Hypertext and Semi Structured Data
S. Chakrabarti. Morgan Kaufmann, 2002.
- Language Modeling for Information Retrieval
W.B. Croft, J. Lafferty. Springer, 2003. (Handles Language Modeling aspect of Information Retrieval. It also extensively details probabilistic perspective in this domain, which is interesting).
- Information Retrieval: A Survey
Ed Greengrass, 2000. (Comprehensive survey of Conventional Information Retrieval, before Deep Learning era).
- Introduction to Modern Information Retrieval
G.G. Chowdhury. Neal-Schuman, 2003. (Intended for students of library and information studies).
- Text Information Retrieval Systems
C.T. Meadow, B.R. Boyce, D.H. Kraft, C.L. Barry. Academic Press, 2007 (library/information science perspective).
Courses
10_ENTRIES- INF384H / CS395T / INF350E: Concepts of Information Retrieval (and Web Search)
Matthew Lease (University of Texas at Austin).
- CS 276 / LING 286: Information Retrieval and Web Search
Chris Manning and Pandu Nayak (Stanford University).
- CS 371R: Information Retrieval and Web Search
Raymond J. Mooney (University of Texas at Austin).
- CS 172: Introduction to Information Retrieval
Vagelis Hristidis (University of California - Riverside).
- SIMS 240: Principles of Information Retrieval
Ray R. Larson (UC berkeley).
- 11-442 / 11-642: Search Engines
Jamie Callan (CMU).
- 600.466: Information Retrieval and Web Agents
David Yarowsky (John Hopkins University).
- CS 435: Information Retrieval, Discovery, and Delivery
Andrea LaPaugh (Princeton University).
- Information Retrieval and Data Mining
Dr. Jilles Vreeken , Prof. Dr. Gerhard Weikum (MPI).
- Coursera - Text Retrieval and Search Engines
Prof. ChengXiang Zhai (University of Illinois at Urbana-Champaign).
Software
2_ENTRIES- Apache Lucene
Open Source Search Engine that can be used to test Information Retrieval Algorithm. Twitter uses this core for its real-time search.
- The Lemur Project
The Lemur Project develops search engines, browser toolbars, text analysis tools, and data resources that support research and development of information retrieval and text mining software.Indri Search Engine - Another Open Source Search Engine competitor of Apache Lucene.Lemur Toolkit - Open Source Toolkit for research in Language Modeling, filtering and categorization.
Standard IR Collections
10_ENTRIES- DBPedia
Linked data web.
- Cranfield Collections
This is one of the first collections in IR domain, however the dataset is too small for any statistical significance analysis, but is nevertheless suitable for pilot runs.
- TREC Collections
TREC is the benchmark dataset used by most IR and Web search algorithms. It has several tracks, each of which consists of dataset to test for a specific task. The tracks along with suggested use-case are:Blog - Explore information seeking behavior in the blogosphere.Chemical IR - Address challenges in building large chemical testbeds for chemical IR.[Clinical Decision Support](http://trec.nist.gov/data/clinical.htm…
- GOV2 Test Collection
This is one of the largest Web collection of documents obtained from crawl of government websites by Charlie Clarke and Ian Soboroff, using NIST hardware and network, then formatted by Nick Craswel.
- NTCIR Test Collection
This is collection of wide variety of dataset ranging from Ad-hoc collection, Chinese IR collection, mobile clickthrough collections to medical collections. The focus of this collection is mostly on east asian languages and cross language information retrieval.CLIR Test Collections - This dataset can be used for cross lingual IR between CJKE (Chinese-Japanese-Korean-English) languages. It is suitable for the following tasks…
- Conference and Labs of the Evaluation Forum (CLEF) dataset
It contains a multi-lingual document collection. The test suite includes:AdHoc - News Test suite.Domain Specific Test Suite - On collections of scientific articles.Question Answering Test Suite.
- Reuters Corpora
The corpora is now available through NIST. The corpora includes following:RCV1 (Reuter's Corpus Volume 1) - Consists of only English language News stories.RCV2 (Reuter's Corpus Volume 2) - Consists of stories in 13 languages (Dutch, French, German, Chinese, Japanese, Russian, Portuguese, Spanish, Latin American Spanish, Italian, Danish, Norwegian, and Swedish). Note that the stories are not parallel.TRC (Thomson Reuters Text Research Collection) - This is a fairly recent corpus consisting of 1,…
- 20 Newsgroup dataset
This data set consists of 20000 newsgroup messages.posts taken from 20 newsgroup topics.
- English Gigaword Fifth Edition
This data set is a comprehensive archive of English newswire text data including headlines, datelines and articles.
- Document Understanding Conference (DUC) datasets
Past newswire/paper datasets (DUC 2001 - DUC 2007) are available upon request.
External Curation Links
3_ENTRIESTechnical Talks
9_ENTRIES- Extreme Classification: A New Paradigm for Ranking & Recommendation
Manik Verma (Microsoft Research)
- The next web
Tim Berners-Lee (Ted Talk) [Tim Berners-Lee invented the World Wide Web. He leads the World Wide Web Consortium (W3C), overseeing the Web's standards and development].
- Is Pivot a turning point for web exploration?
Gary Flake, Technical Fellow at Microsoft (TED Talks).
- Challenges in Building Large-Scale Information Retrieval Systems
Jeff Dean (WSDM Conference, 2009).
- Knowledge-based Information Retrieval with Wikipedia
David Wilne (The University of Waikato, 2008).
- Music Information Retrieval Using Locality Sensitive Hashing
Steve Tjoa (RackSpace Developers) [This talk shows that IR is not just text and images].
- The Functional Web -- The Future of Apps and the Web
Liron Shapira (Box Tech Talk).
- Information Experience - Solution to Information Overload on Web
Doug Imbruce (Techcrunch Disrupt)[Doug Imbruce is the Founder of Qwiki, Inc, a technology startup in New York, NY, acquired by Yahoo! in 2013].
- Internet Privacy
Dr. Alma Whitten (Google Brussels Tech Talk).
Philosophical Talks
5_ENTRIES- The moral bias behind your search results
Andreas Ekström (Swedish Author & Journalist, TED Talk).
- Beware online "filter bubbles"
Eli Pariser (Author of the Filter Bubble, TED Talk).
- Think your email's private? Think again
Andy Yen (CERN, TED Talk) [This talk talks about privacy, which Search Engines intrude into, and how can people protect it].
- Do we have the right to be forgotten?
Michael Douglas [TEDx SouthBank].
- The case for anonymity online
Christopher "moot" Poole" (Ted Talks) [Christopher "moot" Poole is founder of 4chan, an online imageboard whose anonymous denizens have spawned the web's most bewildering and influential subculture].
Blogs
2_ENTRIES- Information Retrieval and the Web
Google Research.
- IR Thoughts
Dr. Edel Garcia.
Interesting Reads
5_ENTRIES- Deep Neural Network Learns to Judge Books by Their Covers
Information Extraction.
- Can Deep Learning help solve Deep Learning
Information Retrieval from Lip Reading.
- Whoa, Google’s AI Is Really Good at Pictionary
Sketch-based search.
- Neural Network Learns to Identify Criminals by Their Faces
Information Extraction.