Daniela's PhD

Implementation of MediaPipe 2D detector and application to MPI-INF-3DHP dataset

MediaPipe as a 2D detector

We came to the conclusion during the last meeting that the performance of OpenPose was not good enough to satisfy our needs and to create a robust 3D detection. For that reason, I added MediaPipe another option for the 2D detector.

MediaPipe proved to be much more efficient detecting 2D poses than OpenPose in the Human 3.6M dataset. As we can see in the following image, the colored crosses (projections of the 3D detections) match almost perfectly the squares (MediaPipe detections).

Although the good results that this presents, there is still a challenge in the quantitative evaluation process, because the MediaPipe skeleton (squares) is different from the ground truth skeleton (black crosses), creating a gap between the joints and making it impossible to truly evaluate the results.

The following table has the results of the evaluation using MediaPipe as the detector. As we can see, using all the joints to obtain the results, the average error is 4 cm and using only the best matching joints the error is still 3 cm. When comparing with other methodologies, this is not good enough, but I believe this higher error are caused by the differences in skeletons.

Application to the MPI-INF-3DHP dataset

I then decided that I would try to implement this algorithm in a different dataset. Another widely used dataset is the MPI-INF-3DHP, which is why I decided to use it in the hope that the ground truth skeleton would be similar to the ones we have. Also, because OpenPose offers a selection of different skeletons, and one of them is the MPI one.

This dataset between 8 and 14 cameras, depending on the section of the dataset, with different actions and both 2D and 3D ground truth.

After creating all the necessary adaptations to run this dataset, I then extracted the 2D poses using the configured OpenPose for the MPI skeleton. The following image show that the performance of OpenPose using this skeleton in this dataset is even worse than the previous tests with half the arms not even being detected and the legs completely in the wrong place. I discarded OpenPose again as an option for 2D detections.

I decided to also try this with MediaPipe and obtained the same type of results as in Human 3.6M: accurate results but very displaced from the ground truth, as you can see in the following image. Again, this makes it very difficult to evaluate the accuracy of our 3D skeletons.

Proof of concept - Algorithm working with 2D ground truth

To prove that the algorithm is working correctly, I used the 2D ground truth provided by the MPI dataset as a 2D detector.

As we can see in the following images, our algorithm's pose predictions match perfectly the 2D ground truth (squares) and the 3D ground truth (black=

The following tables show the errors for a subset of 25 frames using the 2D ground truth in the MPI dataset. These tables show us that indeed our methodology is doing what it's supposed to be doing, and the errors are approximately zero.

How to solve the evaluation problem?

As previously discussed, the gap between the 2D detectors' skeletons and the ground truth's skeleton creates a difficulty in evaluating the performance of our algorithm.

When I was developing the proof of concept with the 2D ground truth I had an idea: I could induce random errors in the 2D ground truth, as well as induce occlusions in some of the cameras. This would allow to have the exact same skeleton in all the phases of the process and also control the experiments by inducing different error percentages or creating different types of occlusions in different cameras.

On-going tasks

Adding the link length restrictions to optimization (optimization weights are very low when compared to other algorithms - needs fixing)
Calibrate an entire video and output a video of the 3D pose
Discuss the possibility of describing the skeleton with Denavit–Hartenberg parameters
Compare with different algorithms
Improve first guess with last frame optimized skeleton in case of consecutive frames
Evaluation with state-of-the-art metrics (MPJPE)

July 2023