Visualization of My Research Interests

Categories DM


Tools

  1. Mendeley
  2. Python
  3. WordArt.com

Steps

  1. Export papers in Mendeley to bib file.
  2. Cleaning via Python nltk.
  3. Frequency counter via Python (Github)
  4. Cleaning via Python nltk.
    • remove phrases with low frequency.
    • remove meaningless phrases
  5. Import the output file to WordArt
  6. Visualization
 

Lecture Note of Scientometrics

Categories DM

Rectenly, Prof. Liu JianGuo, from SUFE, gave a lecture about scientometrics in our center. The content is mainly resolved to real applications, and it shows very interesting problems. The lecture has four parts:

  • Ranking of Research Institutions Based on Citation Relations
  • Research of Deep Learning Based Quantitative Trading Strategies
  • Quantitative Trading Strategies Based on the Degree of Attention of Stocks
  • Which Kinds of Disclosed Information Can Help You Get a Loan From P2P

组会随笔(2018-05-12)

Categories DM

两个报告,分别是关于轨迹中的交通流,以及深度学习在推荐系统中的应用。

由于主要是在做应用,是典型的问题驱动,所以这里仅仅列出一些有趣的问题。至于方法,我目前对深度学习不了解,所以不评述。

轨迹交通流有关的研究问题:

  • 某个地区出租车的供需预测
  • 交通灯控制(红绿灯 – 交通摄像头)

其中,第一个问题可以抽象成 时序数据上的回归问题,和传统的机器学习比较接近。第二个问题,在我自己了解到的方向中较少出现,因为是涉及到决策的东西。摄像头提供交通流数据,以此作为切换红绿灯或者红绿灯频率的决策基础,而决策也会影响下一个时间点的交通流。这个过程可以建模成强化学习。

此外,姚还提到了一些当前轨迹交通流研究的热门:

  • coordination in the transportation network
  • continual adaption with changing environment
  • events detection
  • data quality(sparsity, noise)

确实是很有趣的问题,但以实验室目前的环境,做不了。相信大部分同学都有这样的感觉,实验室研究的东西,虽然还是数据挖掘的主流方向,但已经脱离应用很远了;至于数据挖掘的基础理论研究,也不见得做的有多深入,因此处于很尴尬的境地。

失落会有一些,但乐观一点想,我们在研究的问题,至少是自己喜欢的,很好奇的问题。即使目前看,它对于学科的发展和社会的推动毫无作用。但我们自己还是在前进,每次向前走一点点,也是一件开心的事情。

何况,目前的环境和研究的内容,对于长期发展,还是存在一定的帮助:

虽然研究的内容是传统的数据挖掘与机器学习,以shallow model为主,但其中涉及到的基础问题也是DL中关注的问题。与一开始就做DL的同学相比,我们对于各种基础理论的学习,如压缩感知,线性代数,各种凸优化非凸优化;以及一些不入流的“小技巧”,锚点的学习,信息的传播和最大化间隔,等等;要更加熟悉。这些思想,在DL中也存在,也还待发掘。保持一种积极学习的心态,兼收并蓄,相信长期的积累,还是会带来回报。

十年了。

Structural SVMs with its Application in Recommender System [Seminar Note]

Categories DM
Paper Sharing: Predicting Diverse Subsets Using Structural SVMs [ICML’08]

Motivation
Diversity in retrieval tasks, reduces redundancy, showing more information.
e.g. A set of documents with high topic diversity ensures that fewer users abandon the query because no results are relevant to them.
In short, high diversity will cover more needs for different users, though the accuracy may not be good.

Preliminary
Candidate set: x = {x_i}, i = 1 … n
Topic set: T = {T_i}, i = 1 … n; T_i contains x_i, different topic sets may overlap.

Idea
The topic set T is unknow, thus the learning problem is to find a function for predicting y in the absence of T.
Is T the latent variable ??
– In general, the subtopics are unknown. We instead assume that the candidate set contains discriminating features which separates subtopics from each other, and these are primarily based on word frequencies.
The goal is to select K documents from x which maximizes subtopic coverage.

Keypoint: Diversity -> Covering more subtopics -> Covering more words

Method Overview

D1, D2, D3 are three documents, V1, V2, … , V5 are words.
weight word importance (more distinct words = more information)

After D1 is selected in the first iteration, which covers V3, V4, V5;
In the second iteration, we only focus on V1 and V2.

Remark:
– Feature space based on word frequency
– Optimizes for subtopic loss (Structural-SVM)

The process of this model is sophisticated. Feature space is based on word frequency, and it further divided into “bag of words” (subtopic).

From my point of view, a reason should be: each document has too many words, so dividing document into subtopics is reasonable, and this approach will reduce the overlapping between subtopics of different documents.
In each iteration, we learn the most representative subtopic, then choose the related document until we get K documents.

Remark: Structural-SVM repeatedly finds the next most violated constraint until set of constraints is a good approximation.

Comments
This paper is very interesting.
My doubts are:

  1. Can frequency of word reflect the true relevance of the document to a certain topic?
  2. How to find subtopics?

Further Reading
Learning to Recommend Accurate and Diverse Items [WWW’17]

An Intro to Subspace Clustering [Seminar Note]

Categories DM

Subspace Clustering


For the generation of clusters, often a part of features are relevant, especially for the high dimensional data. From this point of view, a number of clustering methods are proposed to find clusters in different subspaces within a dataset.

Two Perspectives

  • There exists subspace in data, so we search for the most representative feature subsets.
  • As there are clusters in different subspaces, features are more dense in each subspace cluster. Instance within a cluster can be represented by other instances within the same cluster. From this perspective, we seek to learn a representation of data, which yield X=XC.

The first perspective motivates many data mining algorithms (see survey by Parsons L. et. [1]). But due to the complexity of those algorithms, they can not handle large scale datasets. Recently, the majority of subspace clustering researchs are considering from the second perspective. By making different assumptions: the sparsity or the low-rank property, these methods can be generally divided into sparse subspace clustering [2] or low-rank subspace clustering [3].

Note that learning a representation of itself in the form of X=XC is very simple. To improve this model, a sort of algorithms consider using dictionary learning in subspace clustering, to learn a clean dictionary and an informative code, which yield X=DC. That is a big topic, see survey by Zhang Z. etc. [4] and survey by C Bao. etc [5].

Paper Sharing: Deep Adaptive Clustering [6]


Motication
In image clustering, existing methods often ignore the combination between feature learning and clustering.

Method
DAC is based on deep network, so we just give the flowchart. Firstly, a trained ConvNet is given to generate features, which guarantee the basic capacity of separation. Based on the learned features, traditional similarity learning is applied to find similar pairs and dissimilar pairs, similar to must-link and cannot-link in network mining. After obtaining those constraints, DAC goes back to train the ConvNet. That is one iteration.

The hint of DAC are that:

  • It adopts a classification framework for clustering.
  • The learned features tend to be one-hot vectors by introducing a constraint into DAC. Thus clustering can be performed by locating the largest response of the learned features.

Doubts
The performance of DAC is strongly dependent on the initialization of ConvNet. It is learned by another method.

Others
There are other ideas that using “supervised” model to clustering task. For example [7]

References

[1] Parsons L, Haque E, Liu H. Subspace clustering for high dimensional data: a review[J]. Acm Sigkdd Explorations Newsletter, 2004, 6(1): 90-105.

[2] Elhamifar E, Vidal R. Sparse subspace clustering[C]//Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009: 2790-2797.

[3] Vidal R, Favaro P. Low rank subspace clustering (LRSC)[J]. Pattern Recognition Letters, 2014, 43: 47-61.

[4] Zhang Z, Xu Y, Yang J, et al. A survey of sparse representation: algorithms and applications[J]. IEEE access, 2015, 3: 490-530.

[5] Bao C, Ji H, Quan Y, et al. Dictionary learning for sparse coding: Algorithms and convergence analysis[J]. IEEE transactions on pattern analysis and machine intelligence, 2016, 38(7): 1356-1369.

[6] Chang J, Wang L, Meng G, et al. Deep Adaptive Image Clustering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 5879-5887.

[7] Liu H, Han J, Nie F, et al. Balanced Clustering with Least Square Regression[C]//AAAI. 2017: 2231-2237.

Feature Selection相关笔记

Categories DM

计划做一篇属性网络(Attributed Network)有关的工作。不同于拓扑网络,属性网络中的节点带有“特征”,因此,该问题中,除了网络的邻接矩阵,也存在节点属性构成的矩阵。研究属性网络,对理解网络的生成机制具有很现实的意义,如社交网络中好友关系的形成,也可以更好地解决社团挖掘(community detection)、链路预测(link predicting)这些衍生问题。

在了解的过程中,发现,属性网络的研究,有一部分是做在 特征选择 上。因为属性网络中节点的特征维度一般很低,且不稀疏,所以存在特征选择的可行性。联想到社交网络中,连边的产生并不一定是因为两节点所有的特征都很相似,可能是因为某些特别的地方,两节点之间的连边产生了,因此特征选择在理论上也具有一定合理性。

读相关的survey,发现了17年KDD上的一个tutorial与我想解决的问题比较相关。把阅读笔记分享出来:

矩阵有关梳理(待续)

Categories DM

在深度学习横行的今天,思考传统机器学习、数据挖掘有关的思想、流派和方法论,具有很实际的意义。

矩阵,在大多数问题中,都是二维数组。行表示item,对应列表示item的feature。从压缩感知开始,矩阵有关的方法开始井喷:基于低秩分解、稀疏表征的方法层出不穷,在图像处理上取得了非常卓越的效果,直到近年才被深度学习的有关算法超越;而NMF的分解形式早已有,近些年原理上的创新较少,多数工作还是在应用层面。

基于矩阵的方法优势在于,可以加各种形式的约束来达到想要的效果(好坏又是另一回事了⊙﹏⊙)。只要是非凸的优化函数,都比较好解,对非凸问题,在众多大牛的努力下,现在也有非常多的求解框架。劣势在于,矩阵有关的算法在线性问题上可以取得较好效果,但并不是很适合非线性问题。因此也有了概率矩阵分解(Probabilistic Matrix Factorization, PMF),在矩阵分解中引入概率,在推荐系统中有较多应用。

目前思考的还是很简单,没有加引用文献,很不学术。挖了个坑,慢慢填咯~