Archive for the ‘Research’ Category
Architecture of Search Systems and Measuring the Search Effectiveness

A look at European Conference on Information Retrieval (ECIR) 2012
The best paper award went to Guido Zuccon, Leif Azzopardi, Dell Zhang and Jun Wang for their work entitled “Top-k Retrieval using Facility Location Analysis” and presented by Leif Azzopardi during the retrieval models session. The authors propose using facility location analysis taken from the discipline of operations research to address the top-k retrieval problem of finding “the optimal set of k documents from a number of relevant documents given the user’s query”.
Meanwhile, “Predicting IMDB Movie Ratings using Social Media” by Andrei Oghina, Mathias Breuss, Manos Tsagkias and Maarten de Rijke won the best poster award. With a different goal from the best paper, the authors of the poster experiment with a prediction model for rating movies using a set of qualitative and quantitative features extracted from the stream of two social media channels, YouTube and Twitter. Their findings show that the highest predictive performance is obtained by combining features from both channels, and propose as future work to include other social media channels.
The conference was preceded by a full day of workshops and tutorials running in parallel. I attended two workshops: Information Retrieval Over Query Sessions (SIR) during the morning and Task-Based and Aggregated Search (TBAS) in the afternoon. The second workshop ended with an interactive discussion. A third, full-day workshop was Searching 4 Fun!.
The last day was the Industry Day. Only 2 papers here, plus 5 oral contributions, and around 50 attendees. A strong focus of the talks given at the industry day was on opinion-mining: four of the six participating companies/institutions presented work on sentiment analysis and opinion mining from social media streams. Jussi Karlgren, from Gavagai, argued that sentiment analysis from social media can be used by companies for example in finding reviews or comments made about their product or service, analyse their market position, and predict price movements. Rianne Kaptein, from Oxyme, backed this up by adding that businesses are interested by what the consumers say about their brand, products or campaigns on social media streams. Furthermore, Hugo Zaragoza from Websays identified two basic needs inside a company: a need for help in reading so that someone can act, and a need for help in explaining so that it can convince. Very interesting topic indeed, and research in this direction will advance as companies become more aware of the business gains from opinion mining of social media.
Overall, ECIR 2012 was a very inspiring conference. It also seemed a very friendly conference, offering many opportunities to network with the fellow attendees. Despite that, several participants said that the number of attendees at this year’s conference has decreased in comparison with previous years. The workshops and the core conference gave me the impression that it has a strong focus on young researchers, as many of the accepted contributions had a student as a first author and presenter at the conference. The fact that there was only one session running at a time was a good decision in my opinion, as the attendees were not forced to miss presentations. Nevertheless, the workshops and tutorials were running in parallel, and although the proceedings of the workshops will be made freely available, I still feel that I missed something that day. The industry day was very exciting, offering the opportunity to share ideas between academia and industry. However, there were not so many presentations, and the topics were not as diverse. I propose that next year Findwise will be among the speakers at the Industry track!
ECIR 2013 will be held in Moscow, Russia, between 24-28 March. See you there!

Searching for Zebras: Doing More with Less
There is a very controversial and highly cited 2006 British Medical Journal (BMJ) article called “Googling for a diagnosis – use of Google as a diagnostic aid: internet based study” which concludes that, for difficult medical diagnostic cases, it is often useful to use Google Search as a tool for finding a diagnosis. Difficult medical cases are often represented by rare diseases, which are diseases with a very low prevalence.
The authors use 26 diagnostic cases published in the New England Journal of Medicine (NEJM) in order to compile a short list of symptoms describing each patient case, and use those keywords as queries for Google. The authors, blinded to the correct disease (a rare diseases in 85% of the cases), select the most ‘prominent’ diagnosis that fits each case. In 58% of the cases they succeed in finding the correct diagnosis.
Several other articles also point to Google as a tool often used by clinicians when searching for medical diagnoses.
But is that so convenient, is that enough, or can this process be easily improved? Indeed, two major advantages for Google are the clinicians’ familiarity with it, and its fresh and extensive index. But how would a vertical search engine with focused and curated content compare to Google when given the task of finding the correct diagnosis for a difficult case?
Well, take an open-source search engine such as Indri, index around 30,000 freely available medical articles describing rare or genetic diseases, use an off-the-shelf retrieval model, and there you have Zebra. In medicine, the term “zebra” is a slang for a surprising diagnosis. In comparison with a search on Google, which often returns results that point to unverified content from blogs or content aggregators, the documents from this vertical search engine are crawled from 10 web resources containing only rare and genetic disease articles, and which are mostly maintained by medical professionals or patient organizations.
Evaluating on a set of 56 queries extracted in a similar manner to the one described above, Zebra easily beats Google. Zebra finds the correct diagnosis in top 20 results in 68% of the cases, while Google succeeds in 32% of them. And this is only the performance of the Zebra with the baseline relevance model — imagine how much more could be done (for example, displaying results as a network of diseases, clustering or even ranking by diseases, or automatic extraction and translation of electronic health record data).

Enterprise search – market overview 2011
A few weeks ago Forrester research released a report with an overview of the 12 leading Enterprise search vendors on the global market (Attivio, Autonomy, Coveo, Endeca, Exalead, Fabasoft, Google, IBM, ISYS Search, Microsoft, Sinequa and Vivisimo).
When I wrote about the Gartner report, readers commented on the fact that open source solutions were not part of the scope, even though their market share is increasing rapidly. The Forrester report has the same approach, except it includes vendors offering their products stand-alone as well as those with products integrated in portal/ECM solutions.
So why the exclusion of open source? Well, it appears difficult to decide on how to evaluate open source, especially when it comes to more advanced appliances.
Looking at the Forrester report, it includes some familiar conclusions but also a few new insights. Leslie Owen from Forrester concludes that “Google, Autonomy, and Microsoft are the most well-known names; they own a large portion of the existing market”. Hence, these vendors are still standing strong, even though they are challenged in various areas.
More surprisingly, some niche players get higher scores than the giants in core areas such as “Indexing and connectivity”, “Interface flexibility” and “Social and collaborative features”.
Vivisimo is seen as somewhat of a leader (with a slightly lower score on Mobile support and Semantics/text analysis). In the Gartner report, Vivisimo was excluded from the information access evaluation due to the fact that they were ”focusing on specialized application categories, such as customer service”.
An interesting reflection from Forrester is that “in the next few years, we expect prices to rise as specialized vendors wax poetic on the transformative power of search in order to distinguish their products from Google and Microsoft FAST Search for SharePoint”. On the Nordic market, we have not seen a shift to such a strategy, but rather the opposite, since open source (with zero license fees) is becoming accepted in an Enterprise environment to a larger extent.
The vendors that provide integrated solutions (to CMS/WCM etc) still remains strong, whereas the stand-alone solutions becomes exposed to completion in new ways. It will be interesting to follow the US and Nordic market to see how this evolves within the next year. It might be that the market differs when it comes to open source adaption.
If you wish to read the full report it can be downloaded from Vivisimo through a simple registration.
To get a complete overview of vendors, I recommend reading both the Gartner and Forrester report.

Delivering information where it’s needed
I recently started working at Findwise after having finished my thesis on location-based information delivery in a mobile phone. The purpose of my thesis was to:
- Investigate how location-based information (as opposed to fixed locations) could be connected to search results
- Improve quality of location-based information by considering the course and velocity of the user
To start with, I created an iPhone application with a location-based reminder system. The reminders described location constraints and users could create reminders with single locations (at home) or groups of locations (at any pharmacy). To find these groups of locations, the system searched for locations with associated information (like nearby pharmacies) and delivered this information without users having to click Search repeatedly.
This is an unusual approach to search as the user is passive, instead the system is performing searches for the user. However, to make search results relevant one has to add contextual constraints to describe when, where and to whom a piece of information is relevant. When all constraints are met, information should be relevant. If not, the system lacks some crucial contextual constraints.
When search is automated, the importance of relevant search results increases and the more you know of the users world, the better you can adjust the results. However, traditional search can also benefit from contextual information. It can be used as a filter where search results that are irrelevant in the current context are removed. Alternatively it could be a part of the relevance model, improving search results by reordering them according to context. Hence, whereas automatic information delivery is probably undesirable for many types of information – contextual constraints can still be of good use!
The people who tested my application created 25% of their reminders as groups of locations and found it useful as it helped them find places they weren’t aware of, facilitating opportunistic behavior. The course and velocity information reduced the number of false-positive information deliveries. Overall, the system worked well as a niche product.

Gartner and the magic quadrants – crowning the leaders of Enterprise Search
For years Gartner, the research and advisory company, has been publishing their magic quadrants – and their verdict of everything from ECM-systems to Data Warehouse and E-commerce plays a big role in many company’s decision to choose the right tools.
Simply put, the vendors are presented in a matrix measuring the different players by ability to execute (product, overall viability, customer experience etc.) and the completeness of their vision (offering strategy, innovation etc.). The vendors are then positioned as niche players (a rather crowded spot), visionaries, challengers and leaders.
At the end of last year Gartner decided to retire their old “Information Access Quadrant” and introduce “Enterprise Search MarketScope” due to a more mature market. A number of vendors (such as Vivisimo and Recommind) were removed, in order to exclude those whose businesses were not entirely search driven.
The evaluation criteria’s for MarketScope cover: offering (product) strategy, Innovation, Overall viability (business unit, financial, strategy, and organization), Customer experience, Market understanding and business model.
To summarize: the criteria’s are to a large extent the same, but the two areas “overall viability” and “customer experience” are weighted higher than the rest. This is most likely a result of the last years discussion around user friendly interfaces, easier administration and the fact that some customers have suffered quite bad when vendors do not survive (one example in Northen Europe is the Danish vendor that went bankrupted for some time)
The yearly fight between the three leaders; Microsoft, Endeca and Autonomy has been somewhat disrupted and Microsoft, Endeca and Google are now seen as the leaders.
Microsoft has got a very broad product line, which stretches from low-price and less functionality to Enterprise Search built on the former FAST technology. Endeca follow the same trend, as Gartner puts it their “products (are) intended to serve organizations seeking to develop general search installations..(..) broadly applicable for a variety of different search challenges”.
In the old quadrant, Google remained a “challenger” for quite some time – but never made it to the “leaders” corner. Ease of administration and “user friendly” are two words that keeps being repeated. That, in combination with a profit of $ 7290000000 during the last quarter of 2010 makes Google a player that easily can continue to develop their Enterprise business.
Autonomy should still not be disregarded, the main reason for it falling a bit behind the three others seem to be conquerable problems with support and pricing transparency. It will be interesting to see how Autonomy chooses to handle these issues during 2011.
To put it short: the new MarketScope is good reading with quite few surprises. If you wish to get a better understanding of the development going on at the different vendors, start with Gartner and continue to search among our blog posts.

Better search engines and other stuff about information practices in workplaces
During this year I have worked on a research project that aims to facilitate the development and implementation of an enterprise search engine. By understanding the use and value of information at the workplace, we hope to create even better preconditions for optimizing a search engine to the requirements of a specific organization.
We use a work-task based research approach where we study information practices – that is, the normalized ways we use to recognize information needs, look for information, and how it is valued and used. By studying such practices in real-life work tasks, we can outline the role that a search engine plays in relation to other work tasks as well as to other ways of finding information. In short, being engaged in a creativity-oriented work task initiates different types of information practices compared to the practices we use in everyday, routine-based work tasks …
The creativity-oriented work tasks involve a dimension of innovation, and concepts such as learning and development are often used to describe these activities. Uncertainty is something that is associated with curiosity and may be seen as a driving force behind information seeking. Information that is rich in nuances and that offers different, even contradictory explanations or descriptions is usually appreciated, and the task outcome is only vaguely discerned at first. Routine-oriented tasks, on the other hand, are focused on increasing effectiveness and reducing uncertainty as quickly as possible in the task outcome, which itself may be sketched out relatively clearly from the beginning. Information seeking is often directed to readily available facts. All this means that a search engine must support a variety of information practices at any given workplace!
The “we” in this project is myself together with a Findwise colleague Henrik Strindberg. The project is financially supported by the Swedish Foundation for Strategic Research, and while I am not working with the present project I am employed by the University of Borås.
Just now I am finalizing a presentation of the project for the ICKM conference in Pittsburgh, PA, USA, next week. The presentation is entitled “Interrelated use and value of information sources”, and will be available through the conference proceedings in due time.
Very exciting … and while there I will also attend the board meetings of the ASIS&T’s Board of Directors as a newly appointed Director-at-Large. Very exciting, too!
The 73rd Annual Meeting of ASIS&T focuses on “Navigation Streams in an Information Ecosystem”.

Why is search easy and hard?
Last year my colleague Lina and I went to the Workshop on Human Computer Interaction and Information Retrieval (HCIR) in Washington DC. This year we did not have the possibility to attend but since all the material is available online I took part remotely any way. I wanted to share with you what I found most interesting this year. (Daniel Tunkelang who was one of the organizers also posted a good overview of the event on his blog.)
This years keynote speaker was Dan Russell, a researcher from Google. He talked about Search Quality and user happiness; Why search is easy and hard. The point I found most interesting in his presentation was how improvement is not only needed when it comes to tools and data but also improving the users’ search skills. My own experience from various search projects is similar; users are not good at searching. Even though they are looking for a specific version of a technical documentation for a specific product they might just enter the name of the product, or even the product family. (It’s a bit like searching for ‘camera’ when you expect to find support documentation on your Dioptric lens for you Canon EOS 60D.) So I agree that users need better search skills. In his presentation Russell also presented some ideas on how a search application can help users improve their search skills.
Search is both easy and hard. Perhaps this is one of the reasons for the introduction of the HCIR Challenge as a new part of the workshop . From the HCIR website:
The aims of the challenge are to encourage researchers and practitioners to build and demonstrate information access systems satisfying at least one of the following:
- Not only deliver relevant documents, but provide facilities for making meaning with those documents.
- Increase user responsibility as well as control; that is, the systems require and reward human effort.
- Offer the flexibility to adapt to user knowledge / sophistication / information need.
- Are engaging and fun to use.
The winner of the challenge was a team of researchers from Yahoo Labs who presented Searching Through Time in the New York Times. The Time Explorer features a results page with an interactive time line that illustrates how the volume of articles (results) have changed over time. I recommend that you read the article in tech review to learn more about the project, or try out the Time explorer demo yourself. You can also learn more about the challenge in this blog post by Gene Golovchinsky.
All the papers and posters from the workshop can be found on the new website.




