User experience is a key aspect to evaluate the goodness of a system. But the evaluation method ruling the IR community for decades is to compute some “scores” such as precision and NDCG. This method is cheap and repeatable but has limited agreements with real user experience (Turpin & Scholer, 2006). Today deep neural networks help our systems make progress every day, but we need to ensure the “progress” heads in the correct direction.

I believe the discrepancy between evaluation methods and user experience is a key issue of IR and many other information systems. I tackle this challenge by developing automatic evaluation techniques that are faithful to user experience but require less cost than collecting explicit user feedback. I developed many techniques to predict user experience based on behavior data from system logs or even without involving real users (based on user modeling and simulation).

Inferring User Experience from Search Logs

Main collaborators: Ahmed Hassan Awadallah (MSR), Ryen White (Microsoft Health), Rosie Jones (Microsoft)

Current search engines rely on behavior statistics such as click-through rates to diagnose user experience. I am a leading expert in this research direction, known for my work of predicting user satisfaction in search engines and intelligent voice assistants. My cost-benefit framework for predicting search engine user satisfaction had been widely used as a standard baseline in the past few years and had contributed to Microsoft Bing’s online evaluation modules. I am also an expert of user interaction and experience in voice search. During an internship at Microsoft Research in 2014, I designed the first automatic framework for evaluating intelligent voice assistants based on heterogeneous behavioral signals, ranging from voice feedback to mobile screen touch.

Predicting User Experience Without Involving Real Users

Main collaborators: James Allan

My recent work tries to reduce the data constraints of user experience modeling—only search engine companies have enough user traffic to evaluate systems in real time using the above methods. I recast conventional test collection based search evaluation (such as computing precision and NDCG scores) as the problem of predicting potential search experience of users without involving real users. I designed improved evaluation metrics for IR systems based on user modeling and simulation. My latest work on this topic improves user experience modeling for search engines by assessing and predicting search result quality from dimensions other than relevance (the current main criterion), including novelty, understandability, credibility, effort, etc.

Related Publications:


Jiepu Jiang and James Allan. Adaptive Persistence for Search Effectiveness Measures. In Proceedings of the 26th ACM International Conference on Information and Knowledge Management (CIKM '17), 2017.

Jiepu Jiang and James Allan. Adaptive effort for search evaluation metrics. In Proceedings of the 38th European Conference on Information Retrieval (ECIR '16), 2016.

Jiepu Jiang and James Allan. Correlation between system and user metrics in a session. In Proceedings of the First ACM SIGIR Conference on Human Information Interaction and Retrieval (CHIIR '16), 2016. (short paper)

Julia Kiseleva, Kyle Williams, Jiepu Jiang, Ahmed Hassan Awadallah, Aidan C. Crook, Imed Zitouni, and Tasos Anastasakos. Understanding user satisfaction with intelligent assistants. In Proceedings of the First ACM SIGIR Conference on Human Information Interaction and Retrieval (CHIIR '16), 2016.

Jiepu Jiang, Ahmed Hassan Awadallah, Rosie Jones, Umut Ozertem, Imed Zitouni, Ranjitha Gurunath Kulkarni, and Omar Zia Khan. Automatic online evaluation of intelligent assistants. In Proceedings of the 24th International Conference on World Wide Web (WWW '15), 2015.

Jiepu Jiang, Ahmed Hassan Awadallah, Xiaolin Shi, and Ryen W. White. Understanding and predicting graded search satisfaction. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining (WSDM '15), 2015.

Jiepu Jiang and Daqing He. Simulating user selections of query suggestions. The SIGIR ’13 workshop on Modeling User Behavior for Information Retrieval Evaluation, 2013.

Jiepu Jiang, Daqing He, Shuguang Han, Zhen Yue, and Chaoqun Ni. Contextual evaluation of query reformulations in a search session by user simulation. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management (CIKM '12), 2012. (poster)