Controlling web app through the camera and hand gestures | by Deividas Maciejauskas

As we already are getting hand pose detection, let’s explore more in detail what information it contains:

// in first place, we receive an array of detected hands 
[
{
// Detected hand (Right, or Left)
handedness: 'Right',
// An integer value between 0 and 1, indicating the confidence of the prediction.
score: 0.98,
// An array of 21 points, that indicates landmarks of the hands
keypoints: [
{
x: 77.00782109658839, // x coordinate
y: 71.67626027006992, // y coordinate
name: 'wrist', // name of landmark
},
...
],
// Same landmarks as for keypoints but with an additional z-coordinate.
keypoints3D: [...]
}
]

We’re mostly interested in the keypoints, as they contain valuable information which we can use to classify the desired hand pose. Before we move forward, let’s visualize those landmarks for a better understanding of our data:

Each of those landmarks has x and y coordinates, but for the classification, we cannot directly use those coordinates, as they have a wide distribution, and it’s going to be hard to achieve robust results. To obtain more robust data features, we can calculate the distance between the points using this formula:

√[(x₂ - x₁)² + (y₂ - y₁)²]

We don’t need to calculate the distance between each of those landmarks. Calculating the distance between the fingertips and the wrist could help generate useful data features:

For the palm hand pose, we have something similar to:

{
...
keypoints: [
{
"x": 67.2229,
"y": 31.6324,
"name": "index_finger_tip"
},
{
"x": 32.8413,
"y": 155.6068,
"name": "wrist"
}
...
]
{

If we apply the provided formula for distance calculation to the available landmarks, we will receive approximately 128.65 pixels as the distance between the index finger tip and the wrist. Now, if we were to do the same for the fist hand pose, we would get approximately 65.42 pixels as the distance between the index finger tip and the wrist.

The difference between them is almost double! I think now it’s quite clear where I’m heading with this reasoning. If we take six points (the wrist and each fingertip) and calculate the distance between them, we get something similar to:

The palm is represented by the red color, and the fist by the blue color. — The palm is represented by the red color, and the fist by the blue color

It is easy to distinguish the palm from the fist just by checking the distance between them. For the classification, we will use K-means clustering, which will create a cluster (like a category) for each hand posture. In our case, we want to classify two hand poses, palm, and fist. That’s why we use K-means with the number of clusters equal to 2, aiming to cluster all distances into two groups. The output will be something similar to:

[
[
60.44377635123625,
51.557028154899726,
44.623663749718425,
39.69141225804538,
38.27270597176409
],
[
71.86672601184169,
88.76003536712001,
107.09756812003087,
103.72602043189468,
88.34780049860731
],
]

I know what you are probably thinking now. Yeah, this does look very reasonable and doesn’t seem very hard. But the problem is that if you’ve never tried doing something similar yourself, this could be a big deal. Don’t worry, I’ve got you covered. There is another brief post on how to implement this classification step by step.

Source link

Controlling web app through the camera and hand gestures | by Deividas Maciejauskas | Mar, 2024

Be the first to comment

Leave a Reply Cancel reply