The Nerthus Dataset

Bowel preparation (cleansing) is considered to be a key precondition for successful colonoscopy (endoscopic examination of the bowel). The degree of bowel cleansing directly affects the possibility to detect diseases and may influence decisions on screening and follow-up examination intervals. An accurate assessment of bowel preparation quality is therefore important. Despite the use of reliable and validated bowel preparation scales, the grading may vary from one doctor to another. An objective and automated assessment of bowel cleansing would contribute to reduce such inequalities and optimize use of medical resources. This would also be a valuable feature for automatic endoscopy reporting in the future. Here, we present Nerthus, a dataset containing videos from inside the gastrointestinal (GI) tract, showing different degrees of bowel cleansing. By providing this dataset, we invite multimedia researchers to contribute in the medical field by making systems automatically evaluate the quality of bowel cleansing for colonoscopy. Such innovations would probably contribute to improve the medical field of GI endoscopy.


Background

The large bowel, also named colon and large intestine, is the lower part of the human gastrointestinal tract. It may be affected by severe diseases including cancer and chronic inflammations. Bowel cancer (colorectal cancer) is currently the third most common cancer worldwide, accounting for nearly 1.4 million new cases and 700 000 cancer deaths in 2012. The current gold standard for diagnostic and screening investigations of the large bowel is colonoscopy. This is a real-time video examination of the inside of the large bowel by use of a digital high definition endoscope. Such endoscopic examinations are resource demanding and require both expensive technical equipment and trained personnel. Furthermore, the efficiency of colonoscopy depends on sufficient bowel cleansing to visualize the gastric mucosa (a membrane that lines the GI tract), achieved by use of oral laxatives (substances that loosen stools and increase bowel movements) administrated prior to the procedure. The quality of bowel preparation has shown to influence both the colonoscopy completion rate and detection of possible precursors of cancer (e.g., adenomas, which are the benign tumor of epithelial tissue). Adenoma detection rate (ADR), that is inversely associated with a patient’s risk of developing colorectal cancer, has been proven to be dependent on quality of bowel preparation. Therefore, the degree of bowel preparation is considered to be a reliable quality measure for colonoscopy. Quality of bowel preparation may also influence decisions on screening and follow-up intervals, since low-quality bowel preparation requires repeated colonoscopy. An accurate description of the bowel cleanliness is therefore needed. Despite the use of reliable and validated bowel preparation scales, the grading may vary from one doctor to another. An objective and automated assessment of bowel cleansing may contribute to reduce such inequalities and optimize use of medical resources. Since endoscopic examinations are real-time investigations, both normal and abnormal findings have to be recorded and documented within written reports. Thus, automatic report generation will probably contribute to reduce doctors’ time required for paperwork and thereby increase time to patient care. To our knowledge, a standardized and automatic reporting system that ensures high quality endoscopy reports does not exist. Assessment of bowel cleanliness would be a valuable feature for such automatic endoscopy reporting in the future. In this area of research, people start to see the synergies between multimedia and medical systems. The development of real-time classification systems is a perfect match in the intersection of medicine, multimedia systems and image/video retrieval. Prototypes like the EIR system targeting analysis of medical videos for detection of abnormalities may be an initial starting point. However, such systems require a lot of data for development, training and testing. To the best of our knowledge, no medical dataset exists for this type of data. In this paper, we therefore present the Nerthus dataset 1 . It contains 21 videos with a total number of 5, 525 frames annotated and verified by medical doctors (experienced endoscopists). The videos are divided into four classes of predefined bowel-preparation qualities. Our initial experiments indicate potential for improvement. In many cases, we are able to detect the annotated bowel cleansing quality. However, to deliver an automatic and reliable system for the endoscopy units, more work is needed. By providing the Nerthus dataset, we invite researchers to contribute in order to improve important systems for automatic assessment and reporting for GI endoscopy.


Bowel Preparation Quality

Traditionally, the bowel preparation quality has been categorized as poor, adequate or good. Such classification of bowel cleanliness often lacks clear definitions, and the judgement on quality tends to be subjective. This may result in significant inter-observer variation. In addition, such a traditional categorization relies on a global assessment of bowel cleanliness, which does not account for differences in cleansing between bowel segments. Poor quality of preparation in one segment may then result in low overall grading, despite an otherwise perfectly cleaned bowel. To minimize the inter-endoscopist variation, new score-based methods of assessing bowel cleanliness have been introduced during the last decade. State-of-the-art scoring systems include the BBPS and the Ottawa bowel preparation scale (OBPS). Both these scales divide the bowel into three sections (right, middle and left) and score the bowel cleansing within each section according to a defined numeric scale. OBPS uses segmental scores ranges from 0 to 4 in addition to a global three-score fluid-quantity rating, which requires estimation of residual liquid. In contrast, the Boston bowel preparation scale (BBPS) uses only a four-point scoring system (ranges from 0 to 3). Figure 2 illustrates the segmental division of the large bowel used for bowel preparation assessment according to BBPS and OBPS. In this paper, we use BBPS as this is probably best validated and most frequently used scoring system in both routine clinic and screening settings today. The BBPS scale is tested and validated to assess the cleanliness at withdrawal. It does not take into account whether the endoscopist has performed any additional cleansing maneuvers, which reflects the actual practice of colonoscopy. The definition of the BBPS segmental scores are described in table 1. The segmental scores ranges from 0 to 3, where 0 is worse and 3 is the best quality of the bowel preparation. Examples for the different categories are shown in Dataset Details  section. In real colonoscopy examinations, a segmental score is applied to each of the three bowel segments and summed in a total score ranging from 0 to 9. In the Nerthus dataset, however, all videos are recorded in the left part of the bowel. Automatic detection of scope position or total score calculation is thereby irrelevant for this dataset and only quality of bowel preparation by segmental scores are of value here. For the future development of automated systems, detecting position and assessment of total BBPS score will be of interest.


Data Collection

The data is collected using equipment as shown in figure 3 at Bærum hospital, Vestre Viken Hospital Trust in Norway. Furthermore, the videos are annotated by one or more medical experts from the Cancer Registry of Norway. A selection of the videos will in addition be annotated by several medical experts from Norway, Sweden, UK, US and Canada through a web based test. These video clips will be marked as the gold standard within dataset and will be released as an addition to the Nerthus dataset with higher quality regarding bowel preparation assessment.


Dataset Details

The Nerthus dataset consists of 21 videos with a total number of 5, 525 frames, annotated and verified by medical doctors (ex- perienced endoscopists), including 4 classes showing four-score BBPS-defined bowel-preparation quality. The number of videos per class varies from 1 to 10. The number of frames per class varies from 500 to 2, 700. The number of videos and frames is sufficient to be used for different tasks, e.g., image retrieval, machine learning, deep learning and transfer learning, etc.. The dataset consists of videos with resolution 720x576 and is organized by sorting the videos into separate folders named according to their BBPS-bowel preparation quality score. Most of the included videos and images have a green picture in each frame, illustrating the po- sition and configuration of the endoscope inside the bowel. This is obtained by use of an electromagnetic imaging system (ScopeGuide, Olympus Europe) and may support the interpretation of the image. This type of information may be important for later investigations on segmental position within the bowel, but must be handled with care for the bowel preparation quality assessment.


Applications of the Dataset

Our vision is that the available data may eventually help researchers to develop systems that improve the health-care system in the context of the GI tract endoscopic diagnosis. Adequate bowel preparation (cleansing) is required to achieve high quality colonoscopy examinations. Despite the use of reliable and validated bowel preparation assessment scales, the grading may vary from one doctor to another. By providing the Nerthus dataset, we invite multimedia researchers to contribute in the medical field by making systems that automatically and consistently can evaluate the quality of bowel cleansing. Innovations in this area contribut- ing with computer-aided assessment and automatic reporting may potentially improve the medical field of GI endoscopy. In the end, the improved quality of GI tract investigations will probably significantly reduce mortality and number of luminal GI disease incidents. With respect to direct use in the multimedia research areas, the main application area of Nerthus is automatic evaluation the quality of bowel cleansing. Thus, the provided dataset can be used in several scenarios where the aim is to develop and evaluate algoritmic analysis of images. Using the same collection of data, researchers can easier compare approaches and experimental results, and results can easier be reproduced. In particular, in the area of image retrieval and object detection, Nerthus will play an important initial role where the image collection can be divided into training and test sets for developments of and experiments for various image retrieval and object localization methods including search-based systems, neural-networks, video analysis, information retrieval, machine learning, object detection, deep learning, computer vision, data fusion and big data processing.


Suggested Metrics

Looking at the list of related work in this area, there are a lot of different metrics used, with potentially different names when used in the medical area and the computer science (information retrieval) area. Here, we provide a small list of the most important metrics. For future research, in addition to describing the dataset with respect to total number of images, total number of images in each class and total number of positives, it might be good to provide as many of the metrics below as possible in order to enable an indirect comparison with older work:
True positive (TP) -The number of correctly identified samples. The number of frames with an endoscopic finding which correctly is identified as a frame with an endoscopic finding.
True negative (TN) - The number of correctly identified negative samples, i.e., frames without an endoscopic finding which correctly is identified as a frame without an endoscopic finding.
False positive (FP) - The number of wrongly identified samples, i.e., a commonly called a "false alarm". Frames without an endoscopic finding which is erroneously identified as a frame with an endoscopic finding.
False negative (FN) - The number of wrongly identified negative samples. Frames without an endoscopic finding which erroneously is identified as a frame with an endoscopic finding.
Recall (REC) - This metric is also frequently called sensitivity, probability of detection and true positive rate, and it is the ratio of samples that are correctly identified as positive among all existing positive samples.
Precision (PREC) - This metric is also frequently called the positive predictive value, and shows the ratio of samples that are correctly identified as positive among the returned samples (the fraction of retrieved samples that are relevant).
Specificity (SPEC) - This metric is frequently called the true negative rate, and shows the ratio of negatives that are correctly identified as such (e.g., the fraction of frames without an endoscopic finding are correctly identified as a negative result).
Accuracy (ACC) - The percentage of correctly identified true and false samples.
Matthews correlation coefficient (MCC) - MCC takes into account true and false positives and negatives, and is a balanced measure even if the classes are of very different sizes.
F1 score (F1) - A measure of a test’s accuracy by calculating the harmonic mean of the precision and recall.
In addition to the above metrics, system performance metrics processing speed and resource consumption are of interest. In our work, we have used the achieved frame-rate (FPS) as a metric as real-time feedback is important.


Download

In all documents and papers that report experimental results based on the Nerthus: A Bowel Preparation Quality Video Dataset, a reference to this study should be included:

Konstantin Pogorelov, Kristin Ranheim Randel, Thomas de Lange, Sigrun Losada Eskeland, Carsten Griwodz, Dag Johansen, Concetto Spampinato, Mario Taschwer, Mathias Lux, Peter Thelin Schmidt, Michael Riegler, Pål Halvorsen, Nerthus: A Bowel Preparation Quality Video Dataset, In MMSys'17 Proceedings of the 8th ACM on Multimedia Systems Conference, Pages 170-174, Taipei, Taiwan, June 20-23, 2017.

BibTeX:
@inproceedings{Pogorelov:2017:NBP:3083187.3083216,
author = {Pogorelov, Konstantin and Randel, Kristin Ranheim and de Lange, Thomas and Eskeland, Sigrun Losada and Griwodz, Carsten and Johansen, Dag and Spampinato, Concetto and Taschwer, Mario and Lux, Mathias and Schmidt, Peter Thelin and Riegler, Michael and Halvorsen, P{\aa}l},
title = {Nerthus: A Bowel Preparation Quality Video Dataset},
booktitle = {Proceedings of the 8th ACM on Multimedia Systems Conference},
series = {MMSys'17},
year = {2017},
isbn = {978-1-4503-5002-0},
location = {Taipei, Taiwan},
pages = {170--174},
numpages = {5},
url = {http://doi.acm.org/10.1145/3083187.3083216},
doi = {10.1145/3083187.3083216},
acmid = {3083216},
publisher = {ACM},
address = {New York, NY, USA},
}


Contact

Email konstantin (_at_) simula (_dot_) no if you have any questions about to the dataset and our research activities. We always welcome collaboration and joint research!