Archive for the ‘Search’ Category

Paula Petcu

Searching for Zebras: Doing More with Less

February 15 - 2012 | Paula Petcu

There is a very controversial and highly cited 2006 British Medical Journal (BMJ) article called “Googling for a diagnosis – use of Google as a diagnostic aid: internet based study” which concludes that, for difficult medical diagnostic cases, it is often useful to use Google Search as a tool for finding a diagnosis. Difficult medical cases are often represented by rare diseases, which are diseases with a very low prevalence.

The authors use 26 diagnostic cases published in the New England Journal of Medicine (NEJM) in order to compile a short list of symptoms describing each patient case, and use those keywords as queries for Google. The authors, blinded to the correct disease (a rare diseases in 85% of the cases), select the most ‘prominent’ diagnosis that fits each case. In 58% of the cases they succeed in finding the correct diagnosis.

Several other articles also point to Google as a tool often used by clinicians when searching for medical diagnoses.

But is that so convenient, is that enough, or can this process be easily improved? Indeed, two major advantages for Google are the clinicians’ familiarity with it, and its fresh and extensive index. But how would a vertical search engine with focused and curated content compare to Google when given the task of finding the correct diagnosis for a difficult case?

Well, take an open-source search engine such as Indri, index around 30,000 freely available medical articles describing rare or genetic diseases, use an off-the-shelf retrieval model, and there you have Zebra. In medicine, the term “zebra” is a slang for a surprising diagnosis. In comparison with a search on Google, which often returns results that point to unverified content from blogs or content aggregators, the documents from this vertical search engine are crawled from 10 web resources containing only rare and genetic disease articles, and which are mostly maintained by medical professionals or patient organizations.

Evaluating on a set of 56 queries extracted in a similar manner to the one described above, Zebra easily beats Google. Zebra finds the correct diagnosis in top 20 results in 68% of the cases, while Google succeeds in 32% of them. And this is only the performance of the Zebra with the baseline relevance model — imagine how much more could be done (for example, displaying results as a network of diseases, clustering or even ranking by diseases, or automatic extraction and translation of electronic health record data).

Caroline Abrahamsson

Search in the Digital Workplace

February 9 - 2012 | Caroline Abrahamsson

Last week we (Caroline Abrahamsson and Kristian Norling) had the opportunity to act as moderators for a conference on the Digital Workplace in Stockholm. Amongst the many good presentations, the keynote by Jane McConell was a gem. The Digital Workplace Trends report by Jane gives many insights into the intranet world, or as Jane and many others prefer to call it, the Digital Workplace (Participants in the survey receives a free copy of the report, highly recommended!). One of  the most interesting parts for us was the four different future scenarios that Jane described during her session and that the survey participants had voted on (on a scale with low, medium or high business value):

  • “My apps” – The intranet is a set of highly customized apps. People select what they need to do their jobs and build their own “intranet” like on an iPad.
  • “Smartsystems”-The userexperience is efficient and relevant because information is delivered in meaningful ways based on past behavior and context.
  • “People-centric” – Social networking, social tagging, location awareness, presence indicators and other technologies are integrated into processes and how people work daily.
  • “Super search” – Various search technologies come together to offer people greater relevance and control over vast amounts of information from inside and outside the enterprise.

p. 19 Digital Workplace Trends 2012

When Jane asked the audience at the conference if they thought Super Search had “high potential value”, a whopping 100% answered yes! In the Digital Workplace Trends report 70% of the participants considered Super Search to have a “high potential value”, and 20% of the leadership group has started implementing it.

The Digital Workplace: Redefining Productivity in the Information Age by  Infocentric Research is another excellent (and free) source on the current state of the Digital Workplace. Also in this report good search is mentioned as very important for getting work done in the digital workplace:

“Imagine that each and every employee in your organization would spend 1 to 2 full working hours per day surfing the web and social media sites (such as Facebook, YouTube and Twitter) purely for private pleasure. Would that be acceptable for you? And even more important: would it leave your bottom line results unaffected?

The answer to both questions of course is clear “No”. But the bad news is that your employees spend just that amount of time for something even worse. And they do so with full allowance by management and in accordance to accepted work practices in your organization. What they do, what you do as well, is looking for information they need to do their job and ineffectively working with that information.”

p. 4 The Digital Workplace: Redefining Productivity in the Information Age

When reading these excellent reports it is quite obvious to us that the need for a “Super Search”, i.e. an Enterprise Search solution that can reach all types of information, is very much in demand. Many organizations have worked extensively with search for many years understand that this is actually a never-ending task. But search is still a very cost-effective and hands-on solution for many information and knowledge intensive tasks.

“Information based work is driven and determined by having the right information to perform the task at hand. For this, the information has to be there when needed. Looking for the right information to do something therefore constitutes one of the most relevant of all tasks. In fact, “searching” in all its forms is the most ubiquitous activity that information workers perform in their jobs”.

p. 15 The Digital Workplace: Redefining Productivity in the Information Age

To conclude, the new digital workplace in transforming the way we work, interact and communicate. The discussions during the conference showed that almost all organizations were in a transformation phase where the traditional intranet (with static pages updated by editors) is being complemented (and in some cases replaced altoghether) with collaboration areas and flexible worktools.  We look forward to this years development and hope to share some good cases with you, especially with regard to search, collaboration and mobility..

More reading on the Digital Workplace

Intranet Pioneer Mark Morell

Connaxions / Martin Risgaard 

The Intranet Benchmarking Forum

Pawel Wroblewski

Search Stuffed up with GIS

February 3 - 2012 | Pawel Wroblewski

When I browsed through marketing brochures of GIS (Geographic Information System) vendors I noticed that the message is quite similar to search analytics. It refers in general to integration of various separate sources into analysis based on geo-visualizations. I have recently seen quite nice and powerful combination of search and GIS technologies and so I would like to describe it a little bit. Let us start from the basic things.

Search result visualization

It is quite obvious to use a map instead of simple list of results to visualize what was returned for an entered query. This technique is frequently used for plenty of online search applications especially in directory services like yellow pages or real estate web sites. The list of things that are required to do this is pretty short:

- geoloalization of items  – it means to assign accurate geo coordinates to location names, addresses, zip codes or whatever expected to be shown in the map; geo localization services are given more less for free by Google or Bing maps.

- backgroud map – this is necessity and also given by Google or Bing; there are also plenty of vendors for more specialized mapping applications

- returned results with geo-coordinates  as metadata – to put them in the map

Normally this kind of basic GIS visualisation delivers basic map operations like zooming, panning, different views and additionally some more data like traffic, parks, shops etc. Results are usually pins [Bing] or drops [Google].

Querying / filtering with the map

The step further of integration between search and GIS would be utilizing the map as a tool for definition of search query. One way is to create area of interest that could be drawn in the map as circle, rectangle or polygon. In simple way it could be just the current window view on the map as the area of query. In such an approach full text query is refined to include only results belonging to area defined.

Apart from map all other query refinement tools should be available as well, like date-time sliders or any kind of navigation and fielded queries.

Simple geo-spatial analysis

Sometimes it is important to sort query results by distance from a reference point in order to see all the nearest Chinese restaurant in the neighborhood.  I would also categorize as simple geo-spatial analysis grouping of search result into a GIS layers like e.g. density heatmap, hot spots using geographical and other information stored in results metadata etc.

Advanced geo-spatial analysis

More advance query definition and refinement would involve geo-spatial computations. Basing on real needs it could be possible for example to refine search results by an area of sight line from a picked reference point or select filtering areas like those inside specific borders of cities, districts, countries etc.

So the idea is to use relevant output from advanced GIS analysis as an input for query refinement. In this way all the power of GIS can be used to get to the unstructured data through a search process.

What kind of applications do you think could get advantage of search stuffed with really advanced GIS? Looking forward to your comments on this post.

Christian Ubbesen

Inspiration from the Enterprise Search Europe conference

November 11 - 2011 | Christian Ubbesen

A couple of weeks ago, me and some of my colleagues attended the Enterprise Search Europe conference in London. We’re very grateful to the organizer Martin White at IntranetFocus for arranging the event, and having us as one of the gold sponsors.

For me it was the first time in years I attended a conference like this, and while it was “same old, same old” for many of the attendees, for me it was enlightening to meet up with the industry and have a discussion on where we are as an industry.

There were mainly software vendors and professional services/consultants there, as well a few customers or actual users of enterprise search… and I think the consensus of the two days were that we in the industry STILL haven’t really figured out what we should do with the enterprise search concept, and how to make it valuable for our customers. We at Findwise are not alone with this challenge, but rather it is an industry challenge. There are some vendors who seem to be doing some good work of delivering real value to customers, and also there are a few colleagues to us in the industry that do good professional services/consultant work. At first it was a bit of a downer to realize that we haven’t progressed more during the 10 years I’ve been in the business, but at the same time it was very inspirational to see that we at Findwise together with a few other players, seem to be on the right track with our hard work, and that we have the position to solve some of the real industry challenges we’re facing.

As I see it, if we gather our forces and make a focused “push forward” together now, we will be able to take the industry to a new maturity level where we better solve real business challenges with enterprise search (or search-driven Findability solutions, as we like to call them).

My simple analysis of all the discussions at the conference is that we need to do two things:

  1. Manage the whole “full picture” of enterprise search – from strategy to organizational governance, involving necessary competencies to cover all aspects of a successful Findability solution.
  2. Break down the customer challenge into manageable chunks, and solve actual business problems, not just solving the traditional “finding stuff when needed” challenge.

I think we are on the right track, and it’s going to be a very interesting journey from here on!

Björn Klockljung Johansson

Book Review: Search Analytics for Your Site

September 14 - 2011 | Björn Klockljung Johansson

Lou Rosenfeld is the founder and publisher of Rosenfeld Media and also the co-author (with Peter Morville) of the best-selling book Information architecture for the World Wide Web, which is considered one of the best books about information management.

In Lou Rosenfeld’s latest book he lets us know how to successfully work with Site Search Analytics (SSA). With SSA you analyse the saved search logs of what your users are searching for to try to find emerging patterns. This information can be a great help to figure out what users want and need from your site.  The search terms used on your site will offer more clues to why the user is on your site compared to search queries from Google (which reveal how they get to your site).

So what’s in the book?

Part I – Introducing Site Search Analytics

In part one the reader gets a great example of why to use SSA and an introduction to what SSA is. In the first chapters you follow John Ferrara who worked at a company called Vanguard and how he analysed search logs to prove that a newly bought search engine performed poorly whilst using the same statistics to improve it. This is a great real world example of how to use SSA for measuring quality of search AND to set up goals for improvement.

a word cloud is one way to play with the data

Part II – Analysing the data

In this part Lou gets hands on with user logs and lets you how to analyse the data. He makes it fun and emphasizes the need to play with user data. Without emphasis on playing, the task to analyse user data may seem daunting. Also, with real world examples from different companies and institutions it is easy to understand the different methods for analysis. Personally, I feel the use of real data in the book makes the subject easier (and more interesting) to understand.

From which pages do users search?

Part III – Improving your site

In the third part of the book, Rosenfeld shows how to apply your findings during your analysis. If you’ve worked with SSA before most of it will be familiar (improving best bets, zero hits, query completion and synonyms) but even for experienced professionals there is good information about how to improve everything from site navigation to site content and even to connect your ssa to your site KPI’s.

Conclusion

Search Analytics For Your Site shows how easy it is to get started with SSA but also the depth and usefulness of it. This book is easy to read and also quite funny. The book is quite short which in this day and age isn’t negative. For me this book reminded me of the importance of search analytics and I really hope more companies and sites takes the lessons in this book to heart and focuses on search analytics.

Tobias Berg

Google Search Appliance 6.10 released

May 4 - 2011 | Tobias Berg

Last week, Google released version 6.10 of the software to their Google Search Appliance (GSA).

This is a minor update and the focus at the Google teams has been bug fixes and increased stability. Looking at the release notes, there’s indeed plenty of bugs that has been solved.

However, there are also some new features in this release. Some of the more interesting, in my opinion, are:

Multiple front-end configuration for Dynamic Navigation
Since the 6.8 release, the GSA has been able to provde facets, or Dynamic Navigation as Google calls it. However the facets has been global so you couldn’t have two front ends with different facets. This is now possible.

More feeds statistics and Adjust PageRank in feeds
More statistics of what’s happening with feeds you push into the GSA is a very welcome feature. The possibility to adjus PageRank allows for some more control over relevancy in feeds.

Indexing Crawl time kerberos support and Indexing large files
Google is working hard on security and every release since 6.0 has included some security improvements. Nice to see that it continues. Since beginning, the GSA has simply dropped files bigger than 30 MB. Now it will index larger (you can configure how large), but still only the first 2.5 MB of the content will be indexed.

Stopword lists for differented languages

Scalability Centralized configuration
For a multi-node GSA setup, you can now specify the configuration on the master and it’s propagated to the slaves

For a complete list of new features, see the New and Changed Features page in the documentation

1 Comment;   Topics: Search

Delivering information where it’s needed

April 7 - 2011 | David Ronnqvist

I recently started working at Findwise after having finished my thesis on location-based information delivery in a mobile phone. The purpose of my thesis was to:

  • Investigate how location-based information (as opposed to fixed locations) could be connected to search results
  • Improve quality of location-based information by considering the course and velocity of the user

To start with, I created an iPhone application with a location-based reminder system. The reminders described location constraints and users could create reminders with single locations (at home) or groups of locations (at any pharmacy). To find these groups of locations, the system searched for locations with associated information (like nearby pharmacies) and delivered this information without users having to click Search repeatedly.

This is an unusual approach to search as the user is passive, instead the system is performing searches for the user. However, to make search results relevant one has to add contextual constraints to describe when, where and to whom a piece of information is relevant. When all constraints are met, information should be relevant. If not, the system lacks some crucial contextual constraints.

When search is automated, the importance of relevant search results increases and the more you know of the users world, the better you can adjust the results. However, traditional search can also benefit from contextual information. It can be used as a filter where search results that are irrelevant in the current context are removed. Alternatively it could be a part of the relevance model, improving search results by reordering them according to context. Hence, whereas automatic information delivery is probably undesirable for many types of information – contextual constraints can still be of good use!

The people who tested my application created 25% of their reminders as groups of locations and found it useful as it helped them find places they weren’t aware of, facilitating opportunistic behavior. The course and velocity information reduced the number of false-positive information deliveries. Overall, the system worked well as a niche product.

No Comments   Topics: Research | Search

Tobias Berg

Solr 3.1 released

April 5 - 2011 | Tobias Berg

Last friday, Solr 3.1 was released along with Lucene 3.1. This might seem like a big step from previous version 1.4.1, but is an effect of the merged development for Solr and Lucene that took place a year ago. The Solr version now reflects the Lucene version that is used.

For a complete list of new features and enhancements, you can read the release notes. Though, some of the most interesting features are:

  • Extended dismax (edismax) query parser. It’s an enhancement over dismax, supports full lucene query syntax etc.
  • Spatial search (ie, we can now enable geo-search; sort by distance, boost by distance etc)
  • Numeric range facets.
  • Lots of optimizations and performance improvements, including better Unicode and 64-bit JVM support.

Update: There’s a good list of features and enhancements at Sematexts blog:

I’m really keen on the Spatial Search which open up a new set of applications, espeacially for Mobile Search where you have the advantage of knowing the position of the user.

I’m glad the community pulled of this release after the merge with Lucene and it will be fun to start working with it. What’s your favorite feature in 3.1? Drop a comment!

2 Comments;   Topics: Search

Daniel Ling

Open source tools for text analytics

March 21 - 2011 | Daniel Ling

Recently, both clients of Findwise as well as the Enterprise Search community in general are increasingly showing interest in text analytics in order to get a higher business value out of their (often large) volumes of unstructured information.

Text Analytics merges techniques from linguistics, computer science, machine learning, statistics and many of the central algorithms in this field are publically available as open source tools and packages with easily accessible APIs. While many customers of commercial Enterprise Search solutions, such as Automomy, IBM Omnifind, Microsoft FAST ESP, etc., have long benefitted from some sort of Text Analytics (e.g. Entity Extraction, Keyword Extraction and document summarization), the open source components have now come a long way in providing alternative, free of charge solutions with similar performance and feature set.
As every modern enterprise search architecture today has some kind of document processing that is extensible by additional stages or APIs (for example the Open Pipeline with Solr or the pipeline that comes with Microsoft FAST) – the opportunity for plugging new text analytics stages to existing search implementations is open and ready for new innovation.

Among the most popular applications of text analytics that have emerged lately are customized entity extraction, sentiment analysis and document classification – each with a set of open source alternatives (such as Balie, OpenNLP and GATE) readily available for customization and implementation to your document processing.

Regardless of your industry domain, these techniques open up for a wide variety of new ways to interpret the content and discover new trends from your unstructured textual data – be it through sentiment analysis to support the decision making process, trend analysis or relevance model of search, or entity extraction in order to navigate your content by entities (such as company name or person), the enhancement of your texts by meta-data tagging or finding similar and related content.

How are you taking advantage of modern text analytics?