Archive for the ‘Conference’ Category

Martin Johansson

Data and Search Going Big?

April 25 - 2012 | Martin Johansson

A few enterprise search specialists from Findwise recently attended the Scandinavian Developer Conference 2012. One of the tracks was Big Data, which is very much related to search. It had some interesting talks about how to handle large amounts of data in an efficient way. Special thanks to Theo Hultberg, Jim Webber and Tim Berglund!

The theme was that you should choose a storage system which is well suited for the task. This may seem like an obvious point, but for a long time this was simply ignored; I’m talking about the era of relational databases. Don’t get me wrong, sometimes a relational database is the very best for the job, but in many cases it isn’t.

Data is jagged by nature, i.e. not all objects have the same properties. This is why we shouldn’t force them to fit into a square table, instead everything should be denormalized! The application accessing the data will be aware of the information structure and will handle it accordingly. This will also avoid expensive assembly operations (such as joins) to get the data in the format we want when retrieving it. Why should you split up your data if you are going to assemble it over and over again? Also remember that disk space is cheap, pre-compute as much as possible. The design of a Big Data system should be governed by how the data will be retrieved.

Another step away from the relational databases is the relaxation of some of the ACID properties: Atomicity, Consistency, Isolation and Durability. Again, this is along the lines of choosing the components best suited for the system. Decide which properties are a must have and which are not so important.

Relaxing the ACID properties, such as consistency, can give great performance gains. The NoSQL database Cassandra is eventually consistent and its write performance scales linearly up to 288 nodes (and probably even higher) which gives a write performance of over 1 million writes per second!

However, relaxation of these properties is not a new concept in the world of search engines. When indexing a document, it will usually take a number of seconds before it is searchable. This is called eventual consistency, i.e. the state of the search engine will be brought from one valid state to another, within a sufficiently long period of time. Do we really need documents that were just submitted to the search engine to be
searchable instantly? Most likely, no. Isolation is another property that is not crucial to a search engine. Since a document in an index doesn’t have any explicit relations to any other documents in the same index, there isn’t a great need for isolation. If two writes for the same document are submitted at the same time, there is probably something wrong in another part of the system.

So what does all this mean for search? There is an interesting challenge in storing jagged data in large amounts and then making good use out of it. To search in vast amounts jagged data, you need a lot of querytime field mappings (to make relevant data searchable) … or do you? There is also the issue of retaining a good relevancy model, which is absolutely vital to a search engine. How do you measure the relevance of arbitrary metadata and then weigh it all together? Maybe we need to think in new ways about relevance all together?

Whomever can solve these problems in a good way with a minimum amount of manual labor, is a name we’ll be hearing from a lot in the future.

Description: Big Data, which is very much related to search  •  About: Big data and Search  •  Author:  •  Keywords: search, big data, enterprise search, conference  • 

Paula Petcu

A look at European Conference on Information Retrieval (ECIR) 2012

April 18 - 2012 | Paula Petcu
The 34th European Conference on Information Retrieval was held  1-5 April 2011, in the lovely but crowded city of Barcelona, Spain. The core conference attracted over 100 attendees, with a total of 35 accepted full papers, 28 posters, and 7 demos being presented. As opposed to the previous year, which had 2 parallel sessions, this year’s conference included a single running session. The accepted papers covered a diverse range of topics, and were divided into query representation, blog and online-community search, semi-structured retrieval, applications, evaluation, retrieval models, classification, categorisation and clustering, image and video retrieval, and systems efficiency.

The best paper award went to Guido Zuccon, Leif Azzopardi, Dell Zhang and Jun Wang for their work entitled “Top-k Retrieval using Facility Location Analysis” and presented by Leif Azzopardi during the retrieval models session. The authors propose using facility location analysis taken from the discipline of operations research to address the top-k retrieval problem of finding “the optimal set of k documents from a number of relevant documents given the user’s query”.

Meanwhile, “Predicting IMDB Movie Ratings using Social Media” by Andrei Oghina, Mathias Breuss, Manos Tsagkias and Maarten de Rijke won the best poster award. With a different goal from the best paper, the authors of the poster experiment with a prediction model for rating movies using a set of qualitative and quantitative features extracted from the stream of two social media channels, YouTube and Twitter. Their findings show that the highest predictive performance is obtained by combining features from both channels, and propose as future work to include other social media channels.

The conference was preceded by a full day of workshops and tutorials running in parallel. I attended two workshops: Information Retrieval Over Query Sessions (SIR) during the morning and Task-Based and Aggregated Search (TBAS) in the afternoon. The second workshop ended with an interactive discussion. A third, full-day workshop was Searching 4 Fun!.

The last day was the Industry Day. Only 2 papers here, plus 5 oral contributions, and around 50 attendees. A strong focus of the talks given at the industry day was on opinion-mining: four of the six participating companies/institutions presented work on sentiment analysis and opinion mining from social media streams. Jussi Karlgren, from Gavagai, argued that sentiment analysis from social media can be used by companies for example in finding reviews or comments made about their product or service, analyse their market position, and predict price movements. Rianne Kaptein, from Oxyme, backed this up by adding that businesses are interested by what the consumers say about their brand, products or campaigns on social media streams. Furthermore, Hugo Zaragoza from Websays identified two basic needs inside a company: a need for help in reading so that someone can act, and a need for help in explaining so that it can convince. Very interesting topic indeed, and research in this direction will advance as companies become more aware of the business gains from opinion mining of social media.

Overall, ECIR 2012 was a very inspiring conference. It also seemed a very friendly conference, offering many opportunities to network with the fellow attendees. Despite that, several participants said that the number of attendees at this year’s conference has decreased in comparison with previous years. The workshops and the core conference gave me the impression that it has a strong focus on young researchers, as many of the accepted contributions had a student as a first author and presenter at the conference. The fact that there was only one session running at a time was a good decision in my opinion, as the attendees were not forced to miss presentations. Nevertheless, the workshops and tutorials were running in parallel, and although the proceedings of the workshops will be made freely available, I still feel that I missed something that day. The industry day was very exciting, offering the opportunity to share ideas between academia and industry. However, there were not so many presentations, and the topics were not as diverse. I propose that next year Findwise will be among the speakers at the Industry track!

ECIR 2013 will be held in Moscow, Russia, between 24-28 March. See you there!

Accountable Person: Paula Petcu  •  Author:  •  Keywords: information retrieval  • 

Kristian Norling

Search Analytics in Practice

March 1 - 2012 | Kristian Norling
Presentation made by  @kristiannorling at IntraTeam 2012, 1st of March in Copenhagen.
View more presentations from Findwise

Caroline Abrahamsson

Search in the Digital Workplace

February 9 - 2012 | Caroline Abrahamsson

Last week we (Caroline Abrahamsson and Kristian Norling) had the opportunity to act as moderators for a conference on the Digital Workplace in Stockholm. Amongst the many good presentations, the keynote by Jane McConell was a gem. The Digital Workplace Trends report by Jane gives many insights into the intranet world, or as Jane and many others prefer to call it, the Digital Workplace (Participants in the survey receives a free copy of the report, highly recommended!). One of  the most interesting parts for us was the four different future scenarios that Jane described during her session and that the survey participants had voted on (on a scale with low, medium or high business value):

  • “My apps” – The intranet is a set of highly customized apps. People select what they need to do their jobs and build their own “intranet” like on an iPad.
  • “Smartsystems”-The userexperience is efficient and relevant because information is delivered in meaningful ways based on past behavior and context.
  • “People-centric” – Social networking, social tagging, location awareness, presence indicators and other technologies are integrated into processes and how people work daily.
  • “Super search” – Various search technologies come together to offer people greater relevance and control over vast amounts of information from inside and outside the enterprise.

p. 19 Digital Workplace Trends 2012

When Jane asked the audience at the conference if they thought Super Search had “high potential value”, a whopping 100% answered yes! In the Digital Workplace Trends report 70% of the participants considered Super Search to have a “high potential value”, and 20% of the leadership group has started implementing it.

The Digital Workplace: Redefining Productivity in the Information Age by  Infocentric Research is another excellent (and free) source on the current state of the Digital Workplace. Also in this report good search is mentioned as very important for getting work done in the digital workplace:

“Imagine that each and every employee in your organization would spend 1 to 2 full working hours per day surfing the web and social media sites (such as Facebook, YouTube and Twitter) purely for private pleasure. Would that be acceptable for you? And even more important: would it leave your bottom line results unaffected?

The answer to both questions of course is clear “No”. But the bad news is that your employees spend just that amount of time for something even worse. And they do so with full allowance by management and in accordance to accepted work practices in your organization. What they do, what you do as well, is looking for information they need to do their job and ineffectively working with that information.”

p. 4 The Digital Workplace: Redefining Productivity in the Information Age

When reading these excellent reports it is quite obvious to us that the need for a “Super Search”, i.e. an Enterprise Search solution that can reach all types of information, is very much in demand. Many organizations have worked extensively with search for many years understand that this is actually a never-ending task. But search is still a very cost-effective and hands-on solution for many information and knowledge intensive tasks.

“Information based work is driven and determined by having the right information to perform the task at hand. For this, the information has to be there when needed. Looking for the right information to do something therefore constitutes one of the most relevant of all tasks. In fact, “searching” in all its forms is the most ubiquitous activity that information workers perform in their jobs”.

p. 15 The Digital Workplace: Redefining Productivity in the Information Age

To conclude, the new digital workplace in transforming the way we work, interact and communicate. The discussions during the conference showed that almost all organizations were in a transformation phase where the traditional intranet (with static pages updated by editors) is being complemented (and in some cases replaced altoghether) with collaboration areas and flexible worktools.  We look forward to this years development and hope to share some good cases with you, especially with regard to search, collaboration and mobility..

More reading on the Digital Workplace

Intranet Pioneer Mark Morell

Connaxions / Martin Risgaard 

The Intranet Benchmarking Forum

Kristian Norling

Text Analytics in Enterprise Search

January 11 - 2012 | Kristian Norling

A presentation made by Daniel Ling at Apache Lucene Eurocon in Barcelona, october 2011.

We think this is the first of many forthcoming presentations.

We also want to get more involved in the community in the future. By doing presentations, sponsoring, contributing code. Hope to bring more news on this subject in the next few weeks. Enjoy the presentation:

Text Analytics in Enterprise Search, Daniel Ling, Findwise, Eurocon 2011 from Lucene Revolution on Vimeo.