1. Problem Definition

There's no doubt that researches and applications on the foundation of videos has become a popular field including intelligence surveillance, interactions between human and machines, content-based video retrieval and so on. However, it's also a research direction full of challenges. We want to design and implement a simple system whose responsibility is recognising limited kinds of actions. For example, if we want the system have the ability to recognise actions of walking, running and jumping, when given an unlabelled video containing one of the above actions, the system should tell us which kind of actions it has detected, as shown in Figure 1. For more details about human activity analysis, please refer to [1].

2. Design

In this section, I provide what we are thinking about while designing this system and give a brief introduction to relevant techniques in each module of the system. The global processing flowchart is shown in Figure 2.

2.1 Foreground Objects Extraction

We are interested in the actions in a video, so the foreground object seems to be more important to us than the background. We need to figure out a efficient way of foreground object detection and segmentation from a video stream. Once getting the foreground object in each frame of a video, we can spare more effort to analyse useful data. An commonly used approach to extract foreground objects form the image sequence is through background subtraction when the video is grabbed from a stationary camera assuming the backgrounds in all the videos are same. But the requirement is too harsh because our videos may come form any corner in the world having different backgrounds. We need a method having the property of adapting to the changes of background more or less. Gaussian Mixture Model(GMM) [7] and Optical Flow [2] method sare considered as promising methods.In GMM, each pixel in the scene are modelled by a mixture of \(K(3\sim5)\) Gaussian distributions. The optical flow method exploit the consistency of optical flows over a short period of time, being able to detect foreground objects in complex outdoor scenes. The main drawback of this method is high computation cost. After weighting the superiorities and shortcomings of the three methods, we choose GMM as the foundation of the foreground objects extraction module in the system.

2.2 Action Feature Extraction

Action feature extraction is the key component in actions recognition systems. If we can take advantage of a method not only describing actions' characteristics appropriately, but also magnifying the differences between actions as much as possible, we will make a breakthrough in the relevant fields. Here,the well-known Motion History Image(MHI) [6] is utilized to generate compact, descriptive representations of motion information in the video. MHI compact the whole motion sequence into a single image, which is actually the weighted sum of past foreground objects and the weights decay back through time. Therefore, an MHI image contains the past foreground objects within itself, where the most recent one is brighter than the earlier ones.

2.3 Dimensionality Reduction

Because the obtained motion features are high dimensional data, then dimensionality reduction becomes an essential part of the system. On the one hand, the process of dimensionality reduction extract main features of data, on the other hand, it helps to weaken the influence of noise. There are many algorithms,such as PCA[4,5], LPP [5],LDA [5],Laplacian Eigenmaps [5] and the corresponding kernel versions.We decide to apply PCA into our dimensionality reduction module. PCA is a simple but classical method,whose advantage lies in reserving much intrinsic information of data. That's to say,the compressed data can be used in reconstruction.

2.4 Learning and Recognition

Our ultimate goal is making the system recognize actions correctly.So classifier used for learning and recognizing play a crucial role in the system.So far,classifiers such as Bayes classifier [9], decision trees [8], support vector machines(SVM) [10] have been widely used in machine learning. I prefer to use SVM in this module. Originally, SVM was a technique for binary classification.Later it was extended to regression and clustering problems. SVM maps feature vectors into a higher-dimensional space using a kernel function and builds an optimal hyper-plane fitting into the training data to implements the task of classification.

3. Implementation and Experiments

3.1 Dataset

All the video samples in the experiments are made by Pose Pro 2012,including five people in a same scene waving hands,jumping and walking. Besides,the shooting angle of the camera in the scene ranges from \(-50^{\circ} \sim 50^{\circ}\). Figure 4 illustrates few frames of each action.

We split the whole dataset into training set and testing set,as shown in Table 1. It's suggested that the ratio between training data and testing data is 7:3.

3.2 Extracting Foreground Objective

From Figure 5, we can see that the foreground objective extraction is not as perfect as we expect.What we get are the moving parts of the objective,not the whole objective. GMM only concerns itself with the moving parts having relatively different color with the background.It seems to be a drawback of GMM. That won't affect the following motion features extraction much. Thinking in another way,which kind of action an objective is doing only decided by its moving parts.

Besides,the parameters of GMM especially the background threshold \(T\) (\(T=0.7\)  in our experiment) influence its performance greatly. It's because its optimal parameters change with surrounding environment that optimizing its parameters is not a easy work.

3.3 Constructing MHI

Figure 6 shows the corresponding motion history images to us. It's mentioned before that MHI is actually the superimposition of foreground objective with weight. There is no need to make all the foreground objectives extracted be part of MHI. It's better to sample the whole foreground objectives sequence and integrate every \(N\) (\(N=3\) in our experiment) items into the MHI. This trick can not only reduce computation cost, but also magnify the difference between similar actions such as walking and running. The main defect of MHI is that the shooting angle is a major determinant of its performance when there is only one camera equipped. For instance,when the camera is in front of a man, it may difficult for us to distinguish whether he is walking,walking or standing through a MHI.

3.4 Dimensionality Reduction and Reconstruction

We compress MHI with the aim of simplifying the action features further, so as to reduce the burden of the module of learning and recognition. PCA is a linear method and easy to implement. The compressed data can be used to construct data approximating the original data. In Figure 7, the images in the first column are the original motion history image and the images in the second,third and fourth column are reconstructed from compressed data whose number of principal components are respectively 10,20 and 60. It can be easily observed that the more principal components used, the more similar the reconstructed image is to the original image. When the number of principal components is \(10\), the result seems to be pretty good so we compress our MHI with \(10\) principal components here.

3.5 Optimizing Parameters of SVM

We choose radial basis function(RBF) as the kernel function of SVM.Its parameter \(\gamma\) is critical to its performance.Besides,the penalty factor \(C\) also play an important role in soft margin allowing for mislabelled samples in SVM. The optimal parameters must be found for specific dataset so as to obtain the best recognition result. Cross validation [3] and grid search [3] are used in the process of optimization.

3.5.1 Cross Validation

The cross-validation procedure can prevent the overfitting problem. In v-fold cross-validation, we first divide the training set into \(v\) (\(v=10\) in our experiment) subsets of equal size. Sequentially one subset is tested using the classifier trained on the remaining \(v-1\) subsets. Thus, each instance of the whole training set is predicted once so the cross-validation accuracy is the percentage of data which are correctly classified.

3.5.2 Grid Search

We recommend a “grid-search” on \(C\) and \(\gamma\) to search approximately optimal parameter using cross-validation. Various pairs of \((C, \gamma)\) values are tried and the one with the best cross-validation accuracy is picked out. Trying exponentially growing sequences of \(C\) and \(\gamma\) is a practical method to identify good parameters (for example, \(C = 2^{-5} , 2^{-3} ,\cdots, 2^{15}, γ = 2^{-15} , 2^{-13},\cdots, 2^3)\). It's likely that we can't find the optimal pair of \((C,\gamma)\) using grid-search. After all, the computational cost of methods doing an exhaustive parameter search by approximations or heuristics is too high, we are satisfied with the approximating one without much cost. Furthermore, the grid-search can be easily parallelized because of the independence of each \((C,\gamma)\). Since doing a complete grid-search may still be time-consuming, we recommend using neighbor search. Specifically, we use a coarse grid first and identify a “better” region on the grid, then conduct a finer grid search on that region. We can provide a possible interval of \(C\)(or \(\gamma\)) with the grid space. Then, all grid points of \((C,\gamma)\) are tried to find the one giving the highest cross validation accuracy.Then the best parameters are used to train the whole training set and generate the final model. Figure 8 is the contour plot of cross-validation accuracy during the process of parameters optimization.

3.6 Actions Recognition

Table 2 shows the confusion matrix obtained from our system. Every element on the diagonal corresponds to the correctly recognised number of each action in the testing set. The element of $i$-th row and $j$-th column means how many samples of $j$-th action are labelled as $i$-th action. The recognition accuracy is quite satisfying and the system reaches our basic requirements, even through we only test it with special dataset.

4. Summary

Clearly, there are a lot of flaws in theory in the system, which have been pointed out more or less in the designing part. Besides, there are some aspects needed to be improved:

  • Lacking sufficient training data.The videos containing a same action may differ from each other a lot because of illumination, shooting angle, background and so on.
  • MHI used in the system to extract actions' features has strict requirements for videos.It seems to fail to uncover more underlying properties of each action. But I have to say that sometimes it's difficult do distinguish an action from another without explicit definition.
  • Having no ability to recognise an action during its process through just few frames, let alone complex group actions.

Plenty of work still need to be done to build a excellent and practical actions recognition system.

  • Developing an algorithm describing the features of an action more efficiently and accurately.
  • Comparing other dimensionality reduction algorithms with PCA,such as LDA.Perhaps the later works well than the former in supervised learning.
  • Semi-supervised learning sometimes has more satisfying performance than supervised learning.Maybe we should have a try.

In one word, we need to explore the intrinsic characteristics of specific data and relationships between them.

References

[1] JK Aggarwal and M.S. Ryoo. Human activity analysis: A review. ACM Computing Surveys (CSUR), 43(3):16, 2011.

[2] J.L. Barron, D.J. Fleet, and SS Beauchemin. Performance of optical flow techniques. International journal of computer vision, 12(1):43–77,1994.

[3] C.W. Hsu, C.C. Chang, C.J. Lin, et al. A practical guide to support vector classification, 2003.

[4] A. Hyv ̈rinen, J. Karhunen, and E. Oja. Principal component analysis,a2001.

[5] E. Kokiopoulou, J. Chen, and Y. Saad. Trace optimization and eigen-problems in dimension reduction methods. Numerical Linear Algebra with Applications, 18(3):565–602, 2011.

[6] H. Meng, N. Pears, M. Freeman, and C. Bailey. Motion history histograms for human action recognition. Embedded Computer Vision,pages 139–162, 2009.

[7] C. Stauffer and W.E.L. Grimson. Learning patterns of activity using real-time tracking. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 22(8):747–757, 2000.

[8] Decision tree. http://en.wikipedia.org/wiki/Decision_tree.

[9] Naive bayes classifier. http://en.wikipedia.org/wiki/Naive_Bayes_classifier.

[10] Support vector machine. http://en.wikipedia.org/wiki/Support_vector_machine.

A Simple Actions Recognition System的更多相关文章

  1. a simple machine learning system demo, for ML study.

    Machine Learning System introduction This project is a full stack Django/React/Redux app that uses t ...

  2. Simple Library Management System HDU - 1497(图书管理系统)

    Problem Description After AC all the hardest problems in the world , the ACboy 8006 now has nothing ...

  3. 【HDOJ】1497 Simple Library Management System

    链表. #include <cstdio> #include <cstring> #include <cstdlib> #define MAXM 1001 #def ...

  4. Simple Vedio Intercom System

    I. Deployment  / Architecture Block Diagram II. Resources Used sip proxy server + sip user agent 1.  ...

  5. Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recognition/Verification System :: Major Project ::: Introduction

    转载自:http://ganeshtiwaridotcomdotnp.blogspot.com/2010/12/text-prompted-remote-speaker.html Biometrics ...

  6. 大规模视觉识别挑战赛ILSVRC2015各团队结果和方法 Large Scale Visual Recognition Challenge 2015

    Large Scale Visual Recognition Challenge 2015 (ILSVRC2015) Legend: Yellow background = winner in thi ...

  7. Online handwriting recognition using multi convolution neural networks

    w可以考虑从计算机的“机械性.重复性”特征去设计“低效的”算法. https://www.codeproject.com/articles/523074/webcontrols/ Online han ...

  8. How to Make a Computer Operating System

    How to Make a Computer Operating System 如何制作一个操作系统(翻译版) 原文地址:Github:How to Make a Computer Operating ...

  9. 课程四(Convolutional Neural Networks),第四 周(Special applications: Face recognition & Neural style transfer) —— 3.Programming assignments:Face Recognition for the Happy House

    Face Recognition for the Happy House Welcome to the first assignment of week 4! Here you will build ...

随机推荐

  1. linux常用命令之--磁盘管理命令

    linux的磁盘管理命令 1.查看磁盘空间 df:用于显示磁盘空间的使用情况 其命令格式如下: df [-option] 常用参数: -i:使用inodes显示结果 -k:使用KBytes显示结果 - ...

  2. 为duilib的MenuDemo增加消息响应,优化代码和显示效果

    转载请说明原出处,谢谢~~:http://blog.csdn.net/zhuhongshu/article/details/38253297 第一部分 我在前一段时间研究了怎么制作duilib的菜单, ...

  3. 帮同事写了几行代码,在 安装/卸载 程序里 注册/卸载 OCX控件

    写了个小控制台程序,这个程序用来注册 / 卸载OCX控件,用在Inno Setup做的安装卸载程序里. #include "stdafx.h" #include <windo ...

  4. Tsinsen A1303. tree(伍一鸣) (LCT+处理标记)

    [题目链接] http://www.tsinsen.com/A1303 [题意] 给定一棵树,提供树上路径乘/加一个数,加边断边,查询路径和的操作. [思路] LCT+传标 一次dfs构造LCT. L ...

  5. 把JSON数据载入到页面表单的两种思路(对easyui自带方法进行改进)

    #把JSON数据载入到页面表单的两种思路(对easyui自带方法进行改进) ##背景 项目中经常需要把JSON数据填充到页面表单,一开始我使用easyui自带的form load方法,觉得效率很低,经 ...

  6. L0、L1与L2范数、核范数(转)

    L0.L1与L2范数.核范数 今天我们聊聊机器学习中出现的非常频繁的问题:过拟合与规则化.我们先简单的来理解下常用的L0.L1.L2和核范数规则化.最后聊下规则化项参数的选择问题.这里因为篇幅比较庞大 ...

  7. Distributed Sentence Similarity Base on Word Mover's Distance

    Algorithm: Refrence from one ICML15 paper: Word Mover's Distance. 1. First use Google's word2vec too ...

  8. libvirt虚拟系统如何增加usb设备

    之前干这些事情都是通过virt-manager来搞定的.不过由于这个图形界面不太方便,而且现在没法打开(具体原因不详,每次打开提示一些方法未实现什么的),所以试下用libvirt的命令virsh来搞定 ...

  9. homework-05 GoldNumberServer

    作业要求 这次作业要求实现一个黄金数游戏服务器,游戏流程如下,每个client向服务器提交一个有理数,服务器接收到所有客户端的提交后计算这些数字的平均数,再将其乘以黄金分割得到一个GoldNumber ...

  10. Linux下文件的权限

    一.Linux下查看文件属性 命令为: [root@localhost ~]# ls -al 结果: ls是『list』的意思,重点在显示文件的文件名与相关属性.而选项『-al』则表示列出所有的文件详 ...