User experience is a key aspect to evaluate the goodness of a system. But the evaluation method ruling the IR community for decades is to compute some “scores” such as precision and NDCG. This method is cheap and repeatable but has limited agreements with real user experience (Turpin & Scholer, 2006). Today deep neural networks help our systems make progress every day, but we need to ensure the “progress” heads in the correct direction.
I believe the discrepancy between evaluation methods and user experience is a key issue of IR and many other information systems. I tackle this challenge by developing automatic evaluation techniques that are faithful to user experience but require less cost than collecting explicit user feedback. I developed many techniques to predict user experience based on behavior data from system logs or even without involving real users (based on user modeling and simulation).
Inferring User Experience from Search Logs
Current search engines rely on behavior statistics such as click-through rates to diagnose user experience. I am a leading expert in this research direction, known for my work of predicting user satisfaction in search engines and intelligent voice assistants. My cost-benefit framework for predicting search engine user satisfaction had been widely used as a standard baseline in the past few years and had contributed to Microsoft Bing’s online evaluation modules. I am also an expert of user interaction and experience in voice search. During an internship at Microsoft Research in 2014, I designed the first automatic framework for evaluating intelligent voice assistants based on heterogeneous behavioral signals, ranging from voice feedback to mobile screen touch.
Predicting User Experience Without Involving Real Users
Main collaborators: James Allan
My recent work tries to reduce the data constraints of user experience modeling—only search engine companies have enough user traffic to evaluate systems in real time using the above methods. I recast conventional test collection based search evaluation (such as computing precision and NDCG scores) as the problem of predicting potential search experience of users without involving real users. I designed improved evaluation metrics for IR systems based on user modeling and simulation. My latest work on this topic improves user experience modeling for search engines by assessing and predicting search result quality from dimensions other than relevance (the current main criterion), including novelty, understandability, credibility, effort, etc.
Jiepu Jiang and James Allan. Correlation between system and user metrics in a session. In Proceedings of the First ACM SIGIR Conference on Human Information Interaction and Retrieval (CHIIR '16), 2016. (short paper)
Julia Kiseleva, Kyle Williams, Jiepu Jiang, Ahmed Hassan Awadallah, Aidan C. Crook, Imed Zitouni, and Tasos Anastasakos. Understanding user satisfaction with intelligent assistants. In Proceedings of the First ACM SIGIR Conference on Human Information Interaction and Retrieval (CHIIR '16), 2016.
Jiepu Jiang, Ahmed Hassan Awadallah, Rosie Jones, Umut Ozertem, Imed Zitouni, Ranjitha Gurunath Kulkarni, and Omar Zia Khan. Automatic online evaluation of intelligent assistants. In Proceedings of the 24th International Conference on World Wide Web (WWW '15), 2015.
Jiepu Jiang, Ahmed Hassan Awadallah, Xiaolin Shi, and Ryen W. White. Understanding and predicting graded search satisfaction. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining (WSDM '15), 2015.
Jiepu Jiang, Daqing He, Shuguang Han, Zhen Yue, and Chaoqun Ni. Contextual evaluation of query reformulations in a search session by user simulation. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management (CIKM '12), 2012. (poster)