1
May 2015
WWW '15 Companion: Proceedings of the 24th International Conference on World Wide Web
Publisher: ACM
Bibliometrics:
Citation Count: 0
Downloads (6 Weeks): 11, Downloads (12 Months): 100, Downloads (Overall): 211
Full text available:
PDF
Short documents are typically represented by very sparse vectors, in the space of terms. In this case, traditional techniques for calculating text similarity results in measures which are very close to zero, since documents even the very similar ones have a very few or mostly no terms in common. In ...
Keywords:
short text clustering, nystrom approximation
Title:
Short-Text Clustering using Statistical Semantics
Keywords:
short text clustering
Abstract:
... common. In order to alleviate this limitation, the representation of short- -text segments should be enriched by incorporating information about correlation between ...
References:
K. Verma, M. K. Jadon, and A. K. Pujari, "Clustering Short-Text Using Non-negative Matrix Factorization of Hadamard Product of Similarities," Information Retrieval Technology Lecture Notes in Computer Science, Volume 8281, pages 145--155, 2013.
Z. Faguo, Z. Fan, Y. Bingru, Y. Xingang, "Research on Short Text Classification Algorithm Based on Statistics and Rules," In proceedings of third International Symposium on Electronic Commerce and Security, pages 3--7, 2010.
X. Yan, J. Guo, Sh. Liu, X. Cheng, Y. Wang, "Clustering Short Text Using Ncut weighted Non-negative Matrix Factorization," CIKM 12 Proceedings of the 21st ACM international conference on Information and knowledge management, pages 2259--2262, 2012.
P. Ferragina, U. Scaiella, "Fast and Accurate Annotation of Short Texts with Wikipedia Pages," Software, IEEE, Volume:29, Issue: 1, 2011.
X. Hu, N. Sun, C. Zhang, T. Chua, "Exploiting internal and external semantics for the clustering of Short texts using world knowledge," In Proc. CIKM Hong Kong, China, pages 919--928, Nov. 2009.
Full Text:
Short- -Text Clustering using Statistical SemanticsSepideh SeifzadehUniversity of WaterlooWaterloo, Ontario, Canada.N2L 3G1sseifzad@uwaterloo.caAhmed ... 2015, Florence, Italy.ACM 978-1-4503-3473-0/15/05.http://dx.doi.org/10.1145/2740908.2742474.1. INTRODUCTIONIn social media, users usually post short texts. . Twitterlimits the length of each Tweet to 140 characters; ... therefore,developing data mining techniques to handle the large vol-ume of short texts has become an important goal [1]. Textdocument clustering has been ... to organize doc-ument databases and discover similarity and topics amongdocuments. Short text clustering is more challenging thanregular text clustering; due to the ... documents re-quire different or more adapted approaches. The represen-tation of short- -text segments needs to get enriched by incor-porating information about correlation ...
given cate-gories, short text classification requires sufficient number oftraining examples to achieve the high ... achieve the high accuracy; therefore,using a totally unsupervised technique for short text group-ing is more efficient. An overview of the approaches in ... of the approaches in theliterature to tackle the problem of short text clustering isprovided in this section.One possible approach is modifying the ... the term weighting in short documents. They havementioned that for short texts using tf-idf is not very effi-cient since term frequency in ... proposeda simple and fast method for entity disambiguation (entitylinking) for short texts using Wikipedia. Hu et al. [12] ex-ploited features from Wikipedia ... to be in the same cluster or not.Correlation matrix for short texts is very sparse (mostlyzero) as a lot of words do ... data matrix.3. PROPOSED METHODFinding correlation among terms is important especiallyin short text clustering where limited knowledge is avail-able. If using external semantics ...
... to alleviate this limitation, we need toenriched the representation of short- -text segments by incor-porating information about correlation between terms. Inother words, ...
... Yan et al. [9] and is us-ing term-term correlation for short- -text clustering. In thisset of experiments NMI and computational time are ...
... approximationis also 5000 (k=5000).5. CONCLUSIONIn this work, the effectiveness of short- -text document clus-tering algorithms has been improved. As shown in the ...
... X. Yan, J. Guo, Sh. Liu, X. Cheng, Y. Wang,?Clustering Short Text Using Ncut weightedNon-negative Matrix Factorization,? CIKM 12Proceedings of the 21st ... pages888?905, 2000.[11] P. Ferragina, U. Scaiella, ?Fast and AccurateAnnotation of Short Texts with Wikipedia Pages,?Software, IEEE, Volume:29, Issue: 1, 2011.[12] X. Hu, ...
2
October 2016
CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management
Publisher: ACM
Bibliometrics:
Citation Count: 0
Downloads (6 Weeks): 4, Downloads (12 Months): 75, Downloads (Overall): 75
Full text available:
PDF
Collections that contain a large number of short texts are becoming increasingly common ( eg. , tweets, reviews, etc ). Analytical tasks (such as classification, clustering, etc. ) involving short texts could be challenging due to the lack of context and owing to their sparseness. An often encountered problem is ...
Keywords:
document expansion, short text expansion
Keywords:
short text expansion
Abstract:
<p>Collections that contain a large number of short texts are becoming increasingly common (<i>eg.</i>, tweets, reviews, <i>etc</i>). Analytical tasks ... reviews, <i>etc</i>). Analytical tasks (such as classification, clustering, <i>etc.</i>) involving short texts could be challenging due to the lack of context and ... the task. A standard technique used in the handling of short texts is expanding them before subjecting them to the task. However, ... before subjecting them to the task. However, existing works on short text expansion suffer from certain limitations: (i) they depend on domain ... on several Machine Learning, NLP and Information Retrieval tasks on short texts (such as short text classification, clustering, entity disambiguation, and the like) without using task ... same time, our technique is capable of learning to expand short texts in a task-specific way. That is, the same technique that ... is, the same technique that is applied to expand a short text in two different tasks is able to learn to produce ...
References:
S. Banerjee, K. Ramanathan, and A. Gupta. Clustering short texts using wikipedia. SIGIR, 2007.
M. Chen, X. Jin, and D. Shen. Short text classification improved by learning multi-granularity topics. IJCAI, 2011.
T. Kenter and M. de Rijke. Short text similarity with word embeddings. CIKM, 2015.
M. C. Lintean and V. Rus. Measuring semantic similarity in short texts through greedy pairing and word semantics. FLAIRS, 2012.
M. Sahami and T. D. Heilman. A web-based kernel function for measuring the similarity of short text snippets. WWW, 2006.
D. Vitale, P. Ferragina, and U. Scaiella. Classification of short texts by deploying topical annotations. ECIR, 2012.
J. Xu, P. Wang, G. Tian, B. Xu, J. Zhao, F. Wang, and H. Hao. Short text clustering via convolutional neural networks. NAACL-HLT, 2015.
Full Text:
... RamakrishnanDept of CSEIIT Bombayganesh@cse.iitb.ac.inABSTRACTCollections that contain a large number of short texts arebecoming increasingly common (eg., tweets, reviews, etc).Analytical tasks (such as ... (eg., tweets, reviews, etc).Analytical tasks (such as classification, clustering, etc.)involving short texts could be challenging due to the lack ofcontext and owing ... on the task. A standard techniqueused in the handling of short texts is expanding them be-fore subjecting them to the task. However, ... them be-fore subjecting them to the task. However, existing workson short text expansion suffer from certain limitations: (i)they depend on domain knowledge ... Learning, NLP and Information Retrieval tasks onshort texts (such as short text classification, clustering, en-tity disambiguation, and the like) without using task ... is, the same tech-nique that is applied to expand a short text in two differenttasks is able to learn to produce different ... of the Internet, Web users aregenerating an increasing number of short texts, , including?Work done when the first author was interning with ... them becomes increasingly importantto many IR and Machine Learning applications. However,short texts are quite different from the traditional docu-ments due to their ... 20, 25] attempt to addresschallenges associated with analytic tasks on short texts byintroducing more context through expansion techniques. Br-oadly, these approaches follow ... use theshort text as a search query and expand the short text usingthe top search results [1, 20, 25]. The other technique ... technique is tobuild a topic space and to project each short text into thatspace. These topics enrich the context of the short text [21,2, 17, 5].Although querying search engines using short texts canproduce good expansions, it may not be an ideal solutionfor ...
... types of applications. For example, thoughtasks such as classification of short texts can be efficientlysolved by these approaches [17, 5], they may ... they may not be suitablefor tasks on learning to rank short texts. . Primarily, this isdue to the bag-of-words model assumption that ... bag-of-words model assumption that these ap-proaches make while expanding the short texts, , that is, thepseudo topic names that are appended to ... that is, thepseudo topic names that are appended to each short text asadditional words help the classifiers to learn discriminativeweights for each ... in thedocument [24]. While all these approaches provide ways toexpand short texts, , the expansion itself is independent ofthe task that uses ... tasks. (iii) It expands the short textsselectively, that is, a short text is expanded only if suchan expansion helps the task; otherwise, ... the task; otherwise, it is not expanded.Blindly expanding all the short texts - as done in earlier ap-proaches - in fact degrades ... 3.3, we comment more on choosing the right cor-pus. Using short text as a query, a language model or anIR system initially ... of the K candidate long textsas the expansion of the short text. . The goal of learning isto produce a mapping of ... The goal of learning isto produce a mapping of the short texts to the long textswhich best improves the task accuracy. Note ... particular, our framework learns a best mapping func-tion M : {short text task} ? {long text} to maximizethe task accuracy. We ... We can present this function in an alter-nate form asM (short text) ) = argmaxlong textSim (short text, , long text; task)which is a mapping function that finds ... mapping function that finds a maximally similarlong text to a short text, , given the task. The similarity iscaptured through a variety ... be adapted by many machine learning tasks thatdeal with the short texts? ? The introduction of various classes of task agnostic fea-ture ...
thus learning to expand a short text in a task-specificway.? A block learning technique for learning feature ... Problem FormulationLet Ds = {si}ni=1 be a set of n short texts, , where si isthe representation of ith short text. . For example, si canbe a TF-IDF vector representation of ... For example, si canbe a TF-IDF vector representation of the short text, , or itcan simply be a bag of words. The ... task-specific.Let T (Ds) be a task that operates on the short texts Ds.For example, T can be a classification task and Ds ... can be a classification task and Ds is a setof short texts for training and testing the classifier. Theunderlying assumption is that ... the tasks include Clustering of shorttexts, Named Entity Recognition in short texts and Catego-rization of short texts using Wikipedia, to name a few.Let E (T (Ds)) be ... improve the accuracy of the taskT is to expand the short texts ... Ds into bigger texts. LetDe = {ei}ni=1 be the expanded short texts, ... , where ei is a rep-resentation for the expanded ith short text. . For example.,ei can be a TF-IDF vector that represent ... identify the best possible longtext li ? C for every short text ... si. This long text li is thenused to expand the short text si. We use the operator ? torepresent the expansion process ... the expansion process of obtaining an expandedtext representation of a short text from a long text. Thatis, ei = si? li. The ... done in such away that the task using the expanded short texts will havemaximum improvement in accuracy. This makes the shorttext mapping/expansion ... This makes the shorttext mapping/expansion task-specific. That means that thesame short text may map to different long texts in differ-ent tasks, depending ... M (si) to a long text li ? C forevery short text si ? Ds , such that Equation 1 is satisfiedwith ... However, not allof these mappings are meaningful. For example, mappinga short text on ?soccer? to an article on ?photosynthesis?makes very little sense. ... very little sense. To be a semantically correct expan-sion, the short text and its mapped long text have to betopically similar. Many ...
... be independent of the othershort text terms, so that the short text likelihood is given byp (si|lj) = ?mk=1 p (wk|lj). After ... language model approach has the following limita-tions in mapping a short text si to a long text li: (i) it alwaysproduces the ... candidate for the mapping of long texts{lki ? C}Kk=1for every short text si. Second, we define amapping Mw to map a short text si to the best long textli from the K candidate ... variety of feature func-tions that measure the similarity of a short text to a longtext. We group them into three classes: (i) ... topic Ti. Al-though this feature is not a function of short text s, it helpsto measure the relevance of a long text ... ofthe corpus, which by construction includes the topic spaceof the short texts. . For each topic Ti, one such similarityfunction is created..Differential ... l to the topic words without considering thewords in the short text s. The intuition here is that, we wantto eliminate the ... the part of the score that comes from matchingwords between short text s and long text l. Since IR-basedfeature functions already capture ...
... max{0, vT(w) ?v(w?)} )It measures the term-level similarity between a short texts and a long text l by incorporating semantic similarity thatis ... Ti through the word vectors,without considering the terms in the short text s.2.3.4 Selective Expansion via the Bias FeatureOur observation shows that ... Expansion via the Bias FeatureOur observation shows that not all short texts need tobe expanded to improve the accuracy of the task. ... to improve the accuracy of the task. Forcingexpansion on all short texts sometimes reduces the task ac-curacy. In order to enforce selective ... bias feature is always set to -1?bias(s, l) = ?1A short text is expanded only if?fwf?f (s, l) + wbias?bias (s, l) ... weights wf are learned for a given task such thatexpanding short texts using the mappings obtained fromwf best improves the task accuracy. ... the expanded texts De. Byvarying wf , the mapping of short texts to long texts changes,which in turn changes the expansions Ds, ...
... task results are achieved through the op-timal mapping of the short texts to the long texts. Theprocedure for this joint learning is ... of our model parameters Wand task parameters ?Input: Corpus C, Short Texts Training Set D(train), ShortTexts Development Set D(dev), Task TOutput: Model ... W ?. Theupdated weights W then produce new mappings/expansionsfor the short texts. . The procedure repeats until no fur-ther update happens to ...
... and earlier works in the literature,which describe the handling of short texts. . We demon-strate the effectiveness of our approach on two ... Expansion), which has been used through-out these experiments and evaluations.3.1 Short Text TasksIn this paper, we evaluate classification and clusteringtasks for short texts. . Though our technique can be appliedto other types of ... of ML tasks, we find that classification andclustering tasks on short texts have a good presence in theliterature, which gives us an ... represent short and long texts asTF-IDF vectors. To expand a short text si using a long textli, we smooth the TF-IDF vector ... a long textli, we smooth the TF-IDF vector of the short text using thatof the long text in the following manner: ei ... parameter that controls howmuch importance has to be given to short text and longtexts. When ? = 1, only the short text used; when ? = 0,only the long text is used. ... ?:argminw,?L (T (De))=argminw,?L (T ({?si + (1? ?)Mw (si)}ni=1))Note that short text is expanded only if similarity scoreis above a threshold as ... the796articles classified as ?Corn? or ?Wheat? and consider theirtitles as short texts. . The task here is to classify the shorttitles into ... K candi-date long texts by using the title as a short text query. Theweights wf are learned in order to choose a ... learned in order to choose a mapping longtext for every short text such that expanding the short textby using the long text ... this dataset, we know the true long text for a short text( (title), which is the long description of that news article.Using ... compute a best classifierusing the true long texts of the short texts. . This helps usto compare our technique against the true ... Entertainment. We use the titles andthe short descriptions as the short texts for the classificationtask.3.2.3 Web-SnippetsWeb-Snippets [17] dataset has around 12K short ... searchsnippets that are classified into seven classes. Out of this,10K short texts are used to train the classifier and 2K fortesting.3.2.4 ODPTweetsODPtweets4 ...
... 2012. In our experiments, we randomly select 20,000question titles as short texts from 20 different tags, as donein [23].3.3 CorpusChoosing the right ... these articles and retrieve the topK candidate long texts using short texts as queries. Theweights wf are learned such that expanding the ... 20, 25] our method does not force expansionon all the short texts. . Expansion is done only if that helpsimproving the performance ... or has noise and (iii) a corpus that con-tains the short texts themselves. In the third scenario, weshow that our method is ... and by measuring the improvement in the task?s per-6http://www.stackoverflow.com/797Expt# Dataset Short Text Corpus1 Reuters21578 News Title Articles from entire Reuters21578 collection2 Web-Snippets ... articles5 TagMyNews Title + RSS feed Wikipedia articles + all short texts from the dataset6 TagMyNews Title + RSS feed Wikipedia articles ...
... long text articles inthe corpus are initially indexed. Using the short text as aquery, the top ranked result from the IR system ... ranked result from the IR system is used toexpand the short text. . In our experiments, we used Luceneas the IR system.In ... Word2Vec based expansion, a word vector for everyword in the short text is obtained using the word2vec tool.The average word vector is ... then computed from all theseword vectors and appended to the short text to produce along/expanded text.3.5.1 Methodology for Classification TaskThere are various ... and classier) are optimized by trainingthe classifier on the expanded short texts from the trainingset and measuring the accuracy on the expanded ... dataset.The clustering performance is evaluated by comparing theclustering results of short texts with the tags/labels providedby the text corpus. The accuracy [4] ... is to demonstrate the im-provement in the clustering accuracy of short texts throughour expansion technique rather than the clustering methoditself, we used ... For each of these experimentsTable 1 shows the dataset, the short texts, , and the univer-sal corpus used. Experiments 1-4 show expansion ... articles as the corpus. That means,the true expansion of the short text (title) is present in thecorpus. We consider the body of ... is close to the true expansionaccuracy of 92.1%.798Baselines/Comparing techniques Accuracy(%) NMI(%)Short Text Only 26.3 30.2Lucene First 38.8 40.1Word2Vec 11.4 13.6Jiaming Xu et.al ... textsfrom the dataset as short articles. The IR method retrievesthese short texts as top results due to the high matchingscore and, hence, ... such scenarios. As shown in Table 2, the per-centage of short texts not expanded is up to 92% in thiscase.Table 2 also ... in thiscase.Table 2 also shows comparison of classification accuraciesreported in short text
... our framework is task agnostic. However, itlearns to expand the short texts for the classification andthe clustering tasks in a way that ... framework powerful for adaptionby many IR/ML/NLP tasks that deal with short texts. .3.6.3 Effect of Block LearningIn the next set of experiments, ...
... this work we presented a technique for learning to ex-pand short texts in a task-specific way. The expansion issuch that the task ... withthe expanded texts. Hence, our technique can be adaptedto expand short texts for any task. To learn task-specificexpansion, we presented several classes ... engines. WWW, 2007.[2] S. Banerjee, K. Ramanathan, and A. Gupta.Clustering short texts using wikipedia. SIGIR, 2007.[3] D. M. Blei, A. Y. Ng, ... indexing. KDE, 2005.[5] M. Chen, X. Jin, and D. Shen. Short text classificationimproved by learning multi-granularity topics. IJCAI,2011.[6] A. M. Dai, C. ... wikipedia-based conceptrepresentation. PAKDD, 2009.[11] T. Kenter and M. de Rijke. Short text similarity withword embeddings. CIKM, 2015.[12] J. Lafferty and C. Zhai. ... 2001.[13] M. C. Lintean and V. Rus. Measuring semanticsimilarity in short texts through greedy pairing andword semantics. FLAIRS, 2012.[14] C. D. Manning, ... WI, 2006.[22] D. Vitale, P. Ferragina, and U. Scaiella. Classificationof short texts by deploying topical annotations. ECIR,2012.[23] J. Xu, P. Wang, G. ... G. Tian, B. Xu, J. Zhao, F. Wang,and H. Hao. Short text clustering via convolutionalneural networks. NAACL-HLT, 2015.[24] S. Yagcioglu, E. Erdem, ...
3
October 2015
CIKM '15: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management
Publisher: ACM
Bibliometrics:
Citation Count: 12
Downloads (6 Weeks): 71, Downloads (12 Months): 546, Downloads (Overall): 929
Full text available:
PDF
Determining semantic similarity between texts is important in many tasks in information retrieval such as search, query suggestion, automatic summarization and image finding. Many approaches have been suggested, based on lexical matching, handcrafted patterns, syntactic parse trees, external sources of structured semantic knowledge and distributional semantics. However, lexical features, like ...
Keywords:
word embeddings, short text similarity
Title:
Short Text Similarity with Word Embeddings
Keywords:
short text similarity
Abstract:
... typically of one sentence in length.</p> <p>We investigate whether determining short text similarity is possible using only semantic features---where by <i>semantic</i> we ... of meta-features from the comparison of the word vectors for short text pairs, and from the vector means of their respective word ... means of their respective word embeddings. The features representing labelled short text pairs are used to train a supervised learning algorithm. We ... to predict the semantic similarity of new, unlabelled pairs of short texts </p> <p>We show on a publicly available evaluation set commonly ...
References:
M. C. Lintean and V. Rus. Measuring semantic similarity in short texts through greedy pairing and word semantics. In FLAIRS Conference, 2012.
P. Shrestha. Corpus-based methods for short text similarity. Rencontre des Étudiants Chercheurs en Informatique pour le Traitement automatique des Langues, 2011.
Full Text:
Short Text Similarity with Word EmbeddingsTom Kenter Maarten de Rijketom.kenter@uva.nl derijke@uva.nlUniversity of ... well-formedtexts, typically of one sentence in length.We investigate whether determining short text similarity is pos-sible using only semantic features?where by semantic we ... types of meta-features from thecomparison of the word vectors for short text pairs, and from thevector means of their respective word embeddings. ... thevector means of their respective word embeddings. The featuresrepresenting labelled short text pairs are used to train a supervisedlearning algorithm. We use ... for lexical matching aim to determine whetherthe words in two short texts look alike, e.g., in terms of edit dis-tance [31], lexical ... analysis, like dependency parses or syntactic trees, areoften used for short text similarity [19, 39]. Linguistic tools suchas parsers are commonly available ...
... word embeddings to the task of deter-mining semantic similarity of short texts is going from word-levelsemantics to short- -text- -level semantics. This problem has beenstudied extensively over the past ... 39].In this work we propose to go from word-level to short- -text- -levelsemantics by combining insights from methods based on externalsources of ... types of meta-features from the comparison of the wordvectors for short text pairs and from the vector means of their re-spective word ... embeddings, that have not been used before for thetask of short text similarity matching.We show on a publicly available test collection that ...
... generation step of ourmethod for semantic similarity of short texts3. SHORT TEXT SIMILARITY WITH SE-MANTICS ONLYTo calculate semantic similarity between two short texts we usea supervised machine learning approach. Algorithm 1 shows thepseudocode ... property of our method of computing semantic tex-tual similarity for short texts, , and one which we leverage in our ex-periments as ... how weobtain the word embeddings themselves.3.1 From word-level semantics to short- -text- -level semanticsThe meaning of longer pieces of text (containing multiple ...
... Saliency-weighted semantic networkIn Figure 1 the word embeddings of two short texts are repre-sented as dots in a two-dimensional space. As can ... example ? two-dimensional represen-tation of the word embeddings for two short texts (each con-sisting of three terms), represented as transparent and opaquedots ... way of taking into account the distribution of terms inone short text in the semantic space compared to distribution ofterms in another ... training corpus.The semantic similarity of term w with respect to short text s isrepresented by sem(w, s):sem(w, s) = maxw??sfsem(w,w?). (2)The function ... fsts to be symmetrical. Calculat-ing the semantic similarity between two short texts should yield thesame score regardless of their order. Secondly, the ... their order. Secondly, the reason why thelongest of the two short texts is summed over is that we do not wantterms to ...
... the cosine similarities in the semanticspace between all terms in short text s1 and all terms in s2. Thisgives us a matrix ... valuablefeatures for a classifier that learns to predict semantic similaritiesbetween short texts. . To verify this hypothesis we perform experi-ments without the ...
... behave differentlyfrom word2vec vectors when applied to the task of short text sim-ilarity. In our experiments we use two sets of publicly ... vectors and what settingswork best for the task of predicting short text similarity, we calcu-late auxiliary word embeddings both with the word2vec ...
... are used.4.5 BaselinesAs discussed in 2 the systems for detecting short text similarityas described in [20, 21, 38] are natural baselines to ... in Section 5 as it is commonly usedfor evaluation in short text similarity tasks [4, 17, 20, 21, 31, 38].The set consists ...
... see whether our method of computing semantic textual sim-ilarity for short texts is biased we perform an error analysis con-cerning two important ...
... have described a generic and flexible method for semanticmatching of short texts, , which leverages word embeddings of dif-ferent dimensionality, obtained by ... set thatour generic, semantics-only method of computing semantic simi-larity between short texts outperforms all baseline approaches work-ing under the same conditions, and ...
... features that can be used to yieldstate-of-the-art performance on the short text similarity task, evenif no manually tuned features are added that ... of sentences in the corpusdescribing the relation in the triple. Short text similarity can beused to improve this confidence score.An evident limitation ...
... semantic text similarity. In*SEM-2013, 2012.[38] P. Shrestha. Corpus-based methods for short text similarity.Rencontre des tudiants Chercheurs en Informatique pour leTraitement automatique des ... 2013.1420IntroductionRelated workShort Text Similarity with Semantics OnlyFrom word-level semantics to short- -text- -level semanticsSaliency-weighted semantic networkUnweighted semantic networkText level featuresDistance between vectors ...
4
August 2011
WI-IAT '11: Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 03
Publisher: IEEE Computer Society
Bibliometrics:
Citation Count: 1
Downloads (6 Weeks): 2, Downloads (12 Months): 8, Downloads (Overall): 35
Full text available:
PDF
Single document summarization is useful for extracting the major ideas from huge amount of daily information. However, it's a challenge to distinguish the relative importance among terms. In this paper, we propose a Web-based approach to term verification. Search-results of extracted terms are utilized as their expanded representation, and their ...
Keywords:
term verification, short text, Web mining
Keywords:
term verification, short text, Web mining
References:
M. Sahami and T. D. Heilman, "A Web-based kernel function for measuring the similarity of short text snippets," Proceedings of the 15th International Conference on World Wide Web (WWW 2006), 2006, pp.377-386.
Full Text:
... verification on the representativeness of extracted terms. Keywords- term verification; short text; ; Web mining I. INTRODUCTIONAutomatic document summarization identified key concepts ... resources. Therefore, we focus on applying single-document summarization techniques in short texts. . Two existing methods for single-document summarization include: abstraction-based and ... to automatically extract and verify the key concepts from single short text streams for users to understand and find more relevant information. ... effective language-independent approach to extracting and verifying major concepts from short text streams. First, statistical term extraction methods are used to extract ...
... multi-lingual processing, we adopted the PAT-tree-based local maxima method [12]. Short text similarity estimation has received attention in Web applications such as ... Huang et. al [2] proposed to build hierarchical classifiers for short text segments using search-result snippets as corpora. Sahami and Heilman [8] ... proposed a Web-based kernel function for measuring the similarity of short text snippets. Metzler, Dumais, and Meek [6] addressed content, stemmed, and ...
... or longer clauses. Pattern repetition might not be enough in short texts. . We expect statistical term extraction to be more useful ... of extracting and verifying key concepts from streams of single short texts using statistical term extraction and Web-based term verification methods. In ... Heilman, ?A Web-based kernel function for measuring the similarity of short text snippets,? Proceedings of the 15th International Conference on World Wide ...
5
May 2013
WWW '13: Proceedings of the 22nd international conference on World Wide Web
Publisher: ACM
Bibliometrics:
Citation Count: 44
Downloads (6 Weeks): 42, Downloads (12 Months): 538, Downloads (Overall): 2,217
Full text available:
PDF
Uncovering the topics within short texts, such as tweets and instant messages, has become an important task for many content analysis applications. However, directly applying conventional topic models (e.g. LDA and PLSA) on such short texts may not work well. The fundamental reason lies in that conventional topic models implicitly ...
Keywords:
short text, content analysis, topic model, biterm
Title:
A biterm topic model for short texts
Keywords:
short text
Abstract:
<p>Uncovering the topics within short texts, , such as tweets and instant messages, has become an ... applying conventional topic models (e.g. LDA and PLSA) on such short texts may not work well. The fundamental reason lies in that ... paper, we propose a novel way for modeling topics in short texts, , referred as biterm topic model (BTM). Specifically, in BTM ... patterns at document-level. We carry out extensive experiments on real-world short text collections. The results demonstrate that our approach can discover more ...
References:
O. Jin, N. Liu, K. Zhao, Y. Yu, and Q. Yang. Transferring topical knowledge from auxiliary long texts for short text clustering. In Proceedings of the 20th ACM international conference on Information and knowledge management, pages 775--784. ACM, 2011.
M. Sahami and T. Heilman. A web-based kernel function for measuring the similarity of short text snippets. In Proceedings of the 15th international conference on World Wide Web, pages 377--386. ACM, 2006.
X. Yan, J. Guo, S. Liu, X. Cheng, and Y. Wang. Learning topics in short texts by non-negative matrix factorization on term correlation matrix. In Proceedings of the SIAM International Conference on Data Mining. SIAM, 2013.
X. Yan, J. Guo, S. Liu, X.-q. Cheng, and Y. Wang. Clustering short text using ncut-weighted non-negative matrix factorization. In Proceedings of the 20th ACM international conference on Information and knowledge management, pages 2259--2262, New York, NY, USA, 2012. ACM.
Full Text:
... Technology, CASBeijing, China 100190yanxiaohui@software.ict.ac.cn, {guojiafeng, lanyanyan, cxq}@ict.ac.cnABSTRACTUncovering the topics within short texts, , such as tweets andinstant messages, has become an important ... this paper, we propose a novel way for modelingtopics in short texts, , referred as biterm topic model (BTM).Speci?cally, in BTM we ... word co-occurrence patternsat document-level. We carry out extensive experiments onreal-world short text collections. The results demonstratethat our approach can discover more prominent ... and questions in Q&A websites. Uncover-ing the topics of such short texts is crucial for a wide rangeof content analysis tasks, such ... news articles and academic papers), the lackof rich context in short texts makes the topic modeling achallenging problem.Conventional topic models, like PLSA ... co-occurrence patterns [5, 30]. There-fore, directly applying these models on short texts will su?erfrom the severe data sparsity problem (i.e. the sparse ... popular way to alleviate the sparsity prob-lem is to aggregate short texts into lengthy pseudo-documentsbefore training a standard topic model. For example, ... simpli-?ed data generation process may help alleviate the sparsityproblem in short texts. . However, it loses the ?exibility tocapture di?erent topic ingredients ...
... approaches, in this paper, we propose a noveltopic model for short texts to tackle the sparsity problem.The main idea comes from the ... propose a generative biterm topic model(BTM), which learns topics over short texts by directly mod-eling the generation of biterms in the whole ... review of related works. Section 3 intro-duces our model for short text topic modeling, and discussits implementation in Section 4. Experimental results ... above deal with normal text without con-sidering the speci?city of short texts. .2.2 Topic Models on Short TextsEarly studies mainly focused on ... focused on exploiting external knowl-edge to enrich the representation of short texts. . For ex-ample, Sahami et al.[28] suggested a search-snippet-basedsimilarity measure ... For ex-ample, Sahami et al.[28] suggested a search-snippet-basedsimilarity measure for short texts. . Phan et al.[24] learnedhidden topics from large external resources ... learnedhidden topics from large external resources to enrich therepresentation of short texts. . Jin et al.[19] learned topicson short texts via transfer learning from auxiliary long textdata. These ways may ...
... However, due to the lack of spe-ci?c topic models for short texts, , some researchers directlyapplied conventional (or slightly modi?ed) topic models ... topic modeling in Twitter, and suggested that new topicmodels for short texts are in demand.In our previous works, we developed methods based ... previous works, we developed methods based onnon-negative matrix factorization for short text clustering [34]and topic learning [33] by exploiting global word co-occurrenceinformation. ... them by proposing a moreprinciple approach to model topics over short texts. . To thebest of our knowledge, the proposed topic model ... the proposed topic model is the ?rstone focusing on general-domain short texts, , which does notexploit any external knowledge.3. OUR APPROACHConventional topic ... document-level word co-occurrence patterns, whose e?ectiveness willbe highly in?uenced in short text scenario where the wordco-occurrence patterns become very sparse in each ... we propose a novelbiterm topic model, which learns topics over short texts bydirectly modeling the generation of all the biterms (i.e. wordco-occurrence ... refers to a proper text window containing meaningfulword co-occurrences. In short texts, , since documents areusually short and speci?c, we just take ... anindividual context unit. We extract any two distinct wordsin a short text document as a biterm. For example, in theshort text document ...
... modeling?ne topics in documents. As we can see, even in short texts, ,there might be multiple topics in one document.BTM, shown in ... the frequency of the biterm b in the documentd. In short texts, , P (b|d) is nearly an uniform distributionover all biterms ...
... of BTM is about (l ? 1)/2times of LDA. In short texts, , the average length of docu-ments are very small, e.g. ... tuned via grid search: for LDA,? = 0.05 and on short text collections, and ? = 50/K onthe normal text collection, ? ...
... Evaluation on Tweets2011 CollectionTo verify the e?ectiveness of BTM on short texts, , we car-ried experiments on a standard short text collection, namelyTweets20115. It was published in TREC 2011 microblogtrack, which ... is 5.21.We compared BTM with three topic modeling methodson this short texts collection: (a) the standard LDA, whichtakes each tweet as a ...
... time. It suggests that LDAis not a good choice for short texts due to the data sparsityproblem.One may wonder the impact of ... previous experiments, we have demonstrated the e?ec-tiveness of BTM on short texts. . Although we propose BTMfor the short text scenario, there is no limitation for ourmodel to be applied ... the collection. This process is a littledi?erent from that in short texts. . Recall that a biterm isde?ned as a word-pair co-occurred ...
... common but less readable.6. CONCLUSION & FUTURE WORKSTopic modeling for short texts ... is an increasingly impor-tant task due to the prevalence of short texts on the Web.Compared to normal documents, short texts lack of wordfrequency and context information, causing severe sparsityproblems for ... In this paper, wepropose a novel probabilistic topic model for short texts, ,namely biterm topic model (BTM). BTM can well capturethe topics ... biterm topic model (BTM). BTM can well capturethe topics within short texts as it explicitly models the wordco-occurrence patterns and uses the ... inthe whole corpus. We carried on experiments on two real-world short text collections and one normal text collection.The results demonstrated that BTM ... Allthese bene?ts makes BTM a practicable choice for contentanalysis on short texts in a wide range of applications.To the best of our ... we are the ?rst to proposea topic model for general short texts. . However, there is stillroom to improve our work in ...
... Yu, and Q. Yang.Transferring topical knowledge from auxiliary longtexts for short text clustering. In Proceedings of the20th ACM international conference on Informationand ... Guo, S. Liu, X. Cheng, and Y. Wang.Learning topics in short texts by non-negative matrixfactorization on term correlation matrix. InProceedings of the ... Yan, J. Guo, S. Liu, X.-q. Cheng, and Y. Wang.Clustering short text using ncut-weightednon-negative matrix factorization. In Proceedings ofthe 20th ACM international ...
6
February 2017
ICMLC 2017: Proceedings of the 9th International Conference on Machine Learning and Computing
Publisher: ACM
Bibliometrics:
Citation Count: 0
Downloads (6 Weeks): 10, Downloads (12 Months): 10, Downloads (Overall): 10
Full text available:
PDF
While a hard clustering algorithm lets a pattern to belong to a single cluster class, soft clustering allows patterns to belong to all cluster classes with different degrees of relationship. This is important in the case of short-text clustering, in which a small-sized text fragment such as a quotation or ...
Keywords:
Soft Short-Text Clustering, Centrality Measurement, PageRank
Title:
Soft Short-Text Clustering using PageRank as a Centrality Measure
Keywords:
Soft Short-Text Clustering
Abstract:
... degrees of relationship. This is important in the case of short- -text clustering, in which a small-sized text fragment such as a ... more than one subject or topic. However, soft clustering of short- -text is complicated by the computational difficulties inherent in defining cluster ... cluster centrality measures. Therefore, this paper proposes a new soft short- -text clustering algorithm based on the use of PageRank as a ...
References:
Abdalgader, K., and Skabar, A. 2011. Short-Text Similarity Measurement Using Word Sense Disambiguation and Synonym Expansion. In Proceedings of the 23rd Australasian Joint Conference on Artificial Intelligence. (AI2010, Adelaide, Australia). vol. LNAI 6464, pp. 435--444.
Full Text:
The Association for Computing MachinerySoft Short- -Text Clustering using PageRank as a Centrality Measure Khaled Abdalgader Assistant ... degrees of relationship. This is important in the case of short- -text clustering, in which a small-sized text fragment such as a ... more than one subject or topic. However, soft clustering of short- -text is complicated by the computational difficulties inherent in defining cluster ... cluster centrality measures. Therefore, this paper proposes a new soft short- -text clustering algorithm based on the use of PageRank as a ... Concepts ? Computing methodologies?Artificial intelligent?Natural language processing?Lexical semantics Keywords Soft Short- -Text Clustering; Centrality Measurement; PageRank. 1. INTRODUCTION Although text clustering at ... is well-established in the information retrieval literature, clustering at the short- -text level is complicated by the fact that word co-occurrence (i.e., ... small-sized text fragments (e.g., sentences). To overcome this issue, several short- -text similarity measures have recently been proposed [1][2][3][4][5]. These measures have ... of pairwise similarities, hence they can readily be applied to short- -text clustering, as has been shown in [4]. However, spectral clustering ... different degrees of relationship. For example, in the case of short- -text clustering, a text fragment may be related to a number ... of centrality, and this is problematic in the case of short- -text clustering. While it is possible to define the centroid of ... because these (fragment) similarity measures are designed to compare only short- -text fragments. The idea of using PageRank [6] as a centrality ... of document summarization, in which the objective is to rank short- -text fragments according to their importance in the document or documents ... PageRank algorithm can be applied to a single set of short- -texts
... ? ?arg max mim?. Algorithm 1. Soft PageRank Clustering Input: Short- -Texts (Sentences) to be clustered ? ?| 1..iS s i N? ... step, with subsequent renormalization of cluster membership values. 2.1 Measuring Short- -Text Similarity To measure sentence similarity we use a variant of ...
... we have applied it to clustering selected famous quotations (i.e., short- -texts) ). We believe that such quotations provide a rich context ...
... to perform soft clustering is useful in the context of short- -text clustering, in which a text fragment may be related to ...
... tasks. 5. REFERENCES [1] Abdalgader, K., and Skabar, A. 2011. Short- -Text Similarity Measurement Using Word Sense Disambiguation and Synonym Expansion. In ...
7
October 2010
CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management
Publisher: ACM
Bibliometrics:
Citation Count: 2
Downloads (6 Weeks): 4, Downloads (12 Months): 15, Downloads (Overall): 204
Full text available:
PDF
People are increasingly using more and more social softwares, generating flooding communications. User analytics may be performed to mine a person's activities on different social systems and extract patterns, be it interest patterns, social patterns, or work patterns. Such patterns may benefit both the individuals and the organizations the users ...
Keywords:
collaboration analytics, work patterns, clustering, short text
Keywords:
short text
Abstract:
... The challenges we face include the clustering of items with short texts and prioritizing/weighting data items based on importance/relevance. Our solutions to ... and use such information to enrich the semantics of the short texts contained in the activity instances on collaboration tools before clustering ...
References:
S. Banerjee, K. Ramanathan, and A. Gupta. Clustering short texts using wikipedia. In SIGIR '07: Proceedings of the international ACM SIGIR conference on Research and development in information retrieval, pages 787--788, 2007. ACM.
Full Text:
... user. The challenges we face include theclustering of items with short texts and prioritizing/weighting dataitems based on importance/relevance. Our solutions to those ... data and use suchinformation to enrich the semantics of the short texts contained inthe activity instances on collaboration tools before clustering thoseinstances ... Wikipedia or WorkNet for semantic information ofthe words in the short texts on the collaboration tools. Second,a user?s activities on various collaboration ...
... importance of the work item, determine and enrich thesemantics of short texts in work items. We mine the semantic in-formation about words ... In this section, we develop an novel approachto enrich the short texts with additional information to solve thisproblem. More specifically, we examine ... with Related WordsWe first present the algorithm to enrich the short text in workitems by utilizing the semantic similarity between words. To ... a work item Ij , we attempt to extend the short text of Ijby adding semantic information from the keyword dictionary. Forevery ...
MININGWORK AREASAfter enriching the short text of work items, the next step is tomine work areas ...
... much moreefficient than LDA in terms of running time.6. RELATEDWORKClustering short texts is a challenging problem, because shorttexts do not provide enough ... solu-tions that leverage external information sources to provide greatercontext for short texts. . External information sources that havebeen considered include search engines ... solution further distin-guishes from existing work in that it enriches short texts not onlywith related words from a dictionary, but also with ... work items where the short textsappear. To our knowledge, enriching short texts with people back-ground information has not been studied in the ...
8
August 2012
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Publisher: ACM
Bibliometrics:
Citation Count: 17
Downloads (6 Weeks): 21, Downloads (12 Months): 192, Downloads (Overall): 1,076
Full text available:
PDF
We propose a simple, scalable, and non-parametric approach for short text classification. Leveraging the well studied and scalable Information Retrieval (IR) framework, our approach mimics human labeling process for a piece of short text. It first selects the most representative and topical-indicative words from a given short text as query ...
Keywords:
short text classification, search and vote
Title:
Short text classification using very few words
Keywords:
short text classification
Abstract:
<p>We propose a simple, scalable, and non-parametric approach for short text classification. Leveraging the well studied and scalable Information Retrieval (IR) ... our approach mimics human labeling process for a piece of short text. . It first selects the most representative and topical-indicative words ... selects the most representative and topical-indicative words from a given short text as query words, and then searches for a small set ... words, and then searches for a small set of labeled short texts best matching the query words. The predicted category label is ...
References:
M. Chen, X. Jin, and D. Shen. Short text classification improved by learning multi-granularity topics. In IJCAI, pages 1776--1781, 2011.
X. Hu, N. Sun, C. Zhang, and T.-S. Chua. Exploiting internal and external semantics for the clustering of short texts using world knowledge. In CIKM, pages 919--928, 2009.
Full Text:
... our approach mimics hu-man labeling process for a piece of short text. . It ?rst selects themost representative and topical-indicative words from ... Search and Vote1. INTRODUCTIONWe are now dealing with much more short texts. . Examples aresnippets in search results, tweets, status updates, comments, ... results, tweets, status updates, comments, andreviews from various social platforms. Short texts are in generalmuch shorter, nosier, and sparser, which calls for ... text class?ication.Short text classi?cation is to assign a piece of short text one ormore prede?ned categories. Most existing approaches try to enrichthe ... categories. Most existing approaches try to enrichthe representation of a short text ... using additional semantics. Thesemantics could be derived internally from the short text collec-tion [3], externally from a collection of much longer documents ... collection of much longer documents ina similar domain as the short texts [5], or from much larger externalsources such as Wikipedia and ... from Wikipediapages using topic model.In the opposite direction of enriching short text representation,we propose to trim a short text representation to get a few mostrepresentative words for topical classi?cation. ... frameworkmimic the human labeling process. Due to the length of short text( (e.g., a search result snippet is usually fewer than 20 ... is assigned mainly based onthe few keywords observed from the short text. . More speci?cally,given a short text, , we identify one or more words that best repre-sent ... we identify one or more words that best repre-sent the short text and formulate a weighted word query using eachselected word and ... the one receiving highestvotes from the search results (i.e., labeled short texts) ).Leveraging the well studied IR techniques, the proposed approachis extremely ... proposed ap-proach is the selection of representative words from a short text asquery words. Ideally, the query words should be (i) well ... should be (i) well represent-ing the main content of the short text, , and (ii) topically indicative.A word is topically indicative if ...
... poorer than the results reported on the samedataset using enriched short text representation [1]. The enrichedrepresentation is a combination of word feature ... The enrichedrepresentation is a combination of word feature from the short text, ,and topic feature from a topic space derived from Wikipedia. ...
... We also note that ourapproach complements the research on enriching short text rep-resentation. Moreover, the non-parametric approach o?ers muchmore ?exibility in handling ... word or phrase selec-tion, (ii) relevance ranking techniques speci?cally for short texts, ,and (iii) short text ... representation enrichment.5. REFERENCES[1] M. Chen, X. Jin, and D. Shen. Short text classi?cation improved bylearning multi-granularity topics. In IJCAI, pages 1776?1781, 2011.[2] ... T.-S. Chua. Exploiting internal andexternal semantics for the clustering of short texts using worldknowledge. In CIKM, pages 919?928, 2009.[4] X. H. Phan, ...
9
December 2015
The VLDB Journal — The International Journal on Very Large Data Bases: Volume 24 Issue 6, December 2015
Publisher: Springer-Verlag New York, Inc.
Bibliometrics:
Citation Count: 0
Downloads (6 Weeks): 1, Downloads (12 Months): 27, Downloads (Overall): 39
Full text available:
PDF
Micro-blogging services have become major venues for information creation, as well as channels of information dissemination. Accordingly, monitoring them for relevant information is a critical capability. This is typically achieved by registering content-based subscriptions with the micro-blogging service. Such subscriptions are long-running queries that are evaluated against the stream of ...
Keywords:
Publish/subscribe, Short text matching, Stream processing
Title:
S$$^3$$3-TM: scalable streaming short text matching
Keywords:
Short text matching
Abstract:
... address this challenge, we present the S$$^3$$3-TM system for streaming short text matching. S$$^3$$3-TM is organized as a stream processing application, in ...
Full Text:
S -TM: scalable streaming short text matchingThe VLDB Journal (2015) 24:849?866DOI 10.1007/s00778-015-0404-3REGULAR PAPERS3-TM: scalable streaming short text matchingFuat Bas?k1 Bug?ra Gedik1 Hakan Ferhatosmanog?lu1 Mert ... To address this challenge,we present the S3-TM system for streaming short text match-ing. S3-TM is organized as a stream processing application,in the ... to the baseline multicast and out-performs the graph partitioning-based approaches.Keywords Short text matching Stream processing Publish/subscribe1 IntroductionMicro-blogging has enjoyed wide adoption ... In thiswork, we present S3-TM?a stream processing-based solu-tion to scalable short text matching under the content-basedsubscription model. We develop effective techniques andalgorithms ...
... number of machines, to breakthe hot words apart.123S3-TM: scalable streaming short text matching 8511.5 Overload and load sheddingFinally, under unexpected spikes in ...
... subscriptions assigned to the same stage to reduce123S3-TM: scalable streaming short text matching 853the overall matching load. Finally, the dispatcher stage isresponsible ...
... much as possible. Since this baseline does not123S3-TM: scalable streaming short text matching 855(a) (b)(c) (d)Fig. 2 Word network partitioning algorithms: a ...
... more child nodes, each associated with a word,123S3-TM: scalable streaming short text matching 857Alg. 2: LASP, Load-Aware Subscription PlacementData: S, subscription to ...
... overload segment. The idea is that the sys-123S3-TM: scalable streaming short text matching 859tem will increase the level of load shedding when ...
... 2.56times better throughput than the baseline hash-based routing123S3-TM: scalable streaming short text matching 861a b c de f g hFig. 5 Relative ...
... the topic-based model. The former can be explained123S3-TM: scalable streaming short text matching 863by the reverse, that is low overlap among the ...
... of distributedcontent-based pub/sub middleware that use broker overlays.123S3-TM: scalable streaming short text matching 865For instance, PADRES employs a network of brokers andclients ...
... contrast to StreamHub and Cobra, we take advan-tage of the short text matching problem domain to avoid thebroadcast. Most importantly, our work ... delta.8 ConclusionIn this paper, we presented S3-TM?a system for scalablestreaming short text matching. S3-TM is designed to be runin a data center ...
... boolean model. ACM Trans.Database Syst. 19(2), 332?364 (1994)123S3-TM: scalable streaming short text matchingAbstract1 Introduction1.1 Publication routing1.2 Load balancing1.3 Subscription placement and matching1.4 ...
10
October 2012
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management
Publisher: ACM
Bibliometrics:
Citation Count: 5
Downloads (6 Weeks): 5, Downloads (12 Months): 70, Downloads (Overall): 527
Full text available:
PDF
Non-negative matrix factorization (NMF) has been successfully applied in document clustering. However, experiments on short texts, such as microblogs, Q&A documents and news titles, suggest unsatisfactory performance of NMF. An major reason is that the traditional term weighting schemes, like binary weight and tfidf , cannot well capture the terms' ...
Keywords:
clustering, short text, NMF, normalized cut
Title:
Clustering short text using Ncut-weighted non-negative matrix factorization
Keywords:
short text
Abstract:
... has been successfully applied in document clustering. However, experiments on short texts, , such as microblogs, Q&A documents and news titles, suggest ... cannot well capture the terms' discriminative power and importance in short texts, , due to the sparsity of data. To tackle this ... sets show our weighting scheme significantly boosts NMF's performance on short text
Full Text:
Clustering short text using Ncut-weighted non-negative matrix factorizationClustering Short Text Using Ncut-weighted Non-negativeMatrix FactorizationXiaohui Yan, Jiafeng GuoInstitute of ComputingTechnology, CASBeijing, ... (NMF) has been success-fully applied in document clustering. However, experimentson short texts, , such as microblogs, Q&A documents andnews titles, suggest unsatisfactory ... t?df, cannot well capture the terms'discriminative power and importance in short texts, , dueto the sparsity of data. To tackle this problem, ... two datasets show our weighting scheme signi?cantly boosts NMF'sperformance on short text clustering.Categories and Subject DescriptorsH.3.3 [Information Search and Retrieval]: InformationSearch and ... N is the total numberof documents in corpus. However, in short text both tfand idf are not very di?erentiable. Figure 1(a) shows ... information.Instead, we propose a novel term weighting scheme for NMFon short text clustering based on words co-occurrence infor-mation. This weight is derived ...
... weighting scheme has theproblem of skew to high values in short texts. . However,the Ncut-weight refrains from such problem by counting theterm ...
... bi-nary weights are least discriminative, while Kmeans worksworst on these short texts. .5. CONCLUSIONTerm weighting is important for NMF in document clus-tering. ... weighting schemes, like binary weightsand t?df, are not e?ective for short text clustering. Wehave proposed a novel term weight called Ncut-weight, whichmeasures ...
11
December 2011
AusDM '11: Proceedings of the Ninth Australasian Data Mining Conference - Volume 121
Publisher: Australian Computer Society, Inc.
Bibliometrics:
Citation Count: 0
Downloads (6 Weeks): 5, Downloads (12 Months): 25, Downloads (Overall): 118
Full text available:
PDF
The automatic clustering of textual data according to their semantic concepts is a challenging, yet important task. Choosing an appropriate method to apply when clustering text depends on the nature of the documents being analysed. For example, traditional clustering algorithms can struggle to correctly model collections of very short text ...
Keywords:
background knowledge, text mining, clustering, short text
Title:
Enhancing short text clustering with small external repositories
Keywords:
short text
Abstract:
... clustering algorithms can struggle to correctly model collections of very short text due to their extremely sparse nature. In recent times, much ... attention has been directed to finding methods for adequately clustering short text. . Many popular approaches employ large, external document repositories, such ... apply.</p> <p>This paper also employs external document collections to aid short text clustering performance. The external collections are referred to in this ... a separate collection of Background Knowledge is obtained for each short text dataset. However, this Background Knowledge contains several orders of magnitude ... is described where the Background Knowledge is used to re-express short text in terms of a much richer feature space. A discussion ...
References:
Banerjee, S., Ramanathan, K. & Gupta, A. (2007), Clustering short texts using wikipedia, in 'SIGIR'.
Hu, X., Sun, N., Zhang, C. & Chua, T. (2009), Exploiting internal and external semantics for the clustering of short texts using world knowledge, in 'CIKM'.
Sahami, M. & Heilman, T. (2006), A web-based kernel function for measuring the similarity of short text snippets, in 'WWW'.
Full Text:
Enhancing Short Text Clustering with Small External RepositoriesHenry Petersen Josiah PoonSchool of Information ... example, traditionalclustering algorithms can struggle to correctly modelcollections of very short text due to their extremelysparse nature. In recent times, much attention ... much attention hasbeen directed to finding methods for adequately clus-tering short text. . Many popular approaches em-ploy large, external document repositories, such ... to apply.This paper also employs external document collec-tions to aid short text clustering performance. Theexternal collections are referred to in this paper ... a separate collection of Background Knowl-edge is obtained for each short text dataset. However,this Background Knowledge contains several ordersof magnitude fewer documents ... approach is de-scribed where the Background Knowledge is used tore-express short text in terms of a much richer featurespace. A discussion of ... datasets represented in the richer fea-ture space.keywords: Text Mining, Clustering, Short Text, ,Background Knowledge1 IntroductionThe huge volume of information available through re-sources ...
... asBackground Knowledge. Their work uses the Back-ground Knowledge to map short text strings into analternative representation called the Bridging space.Proceedings of the ... Knowledge need not be drawn fromthe same distribution as the short text documents,and can differ significantly in length and structure.Additionally, and in ... (2002)). We demonstrate the effective-ness of the Bridging space for short text cluster-ing tasks.? A function is proposed for use clustering text ... distribu-tion. Background documents may be substantiallygreater in length than the short text in the target col-lection. The only requirement is that the ... is a problemspecific task. For example given a target collectionof short text consisting of a set of technical paper ti-tles we wished ... et al. (2008)).2.1 MotivationDue to the highly sparse nature of short text, , it canbe very difficult for algorithms to effectively modelthe ... difficult for algorithms to effectively modelthe co-occurrence structure of a short text collection.Within such a problem domain, it is highly likely thatmany ... and ?notebook? are all semantically related, how-ever when dealing with short text corpora are unlikelyto appear together in a single document.Each individual ... possibility that such a pair of re-lated words from the short text will co-occur in thelarger documents (Zelikovitz (2002)). As a resultsof ...
... Space? (Weng & Poon (2006)). In or-der to represent a short text in the Bridging space wegenerate one feature for each document ... equal to the result of the cosine similarity be-tween the short text and the corresponding Back-ground document. More explicitly, given a vector ... thenpropose a clustering function that, when applied overa clustering of short texts represented in the Bridg-ing space will provide a good measure ... a good measure of the overallquality of the solution.Given a short text collection and associated cor-pus of Background Knowledge, each Background doc-ument ...
... 1: The value produced by the functionfor any pair of short texts should depend onlyon features with non-zero similarity to both doc-uments ... onlyon features with non-zero similarity to both doc-uments (ie. the short texts both share at leastone term with the corresponding Background doc-ument).As ...
... considered purely as a direct measureof similarity between pairs of short texts, , (2) appearsto possess some interesting properties. When used asa ... claims, consider figure 2 which shows theattribute values for three short text snippets as wellas a Background Knowledge corpus with three doc-uments. ... anindividual attribute can be considered to describe thesimilarity between a short text string and the topicsdescribed in a piece of background knowledge. ... background knowledge. It fol-lows then that when applied to two short text snip-pets, the function described in (2) can be regardedas a ...
... text and each class label inthe data set.4 Related WorkClustering short text based on semantic similarity is aproblem that has seen much ... class of solution for measuring seman-tic similarity between pairs of short text involves theapplication of additional data not available in theoriginal dataset ... al.(2007)) propose methods to employ the results ofGoogle searches on short text strings to measure theirsimilarity. While such algorithms have proven effec-tive ... measure theirsimilarity. While such algorithms have proven effec-tive for suitably short text, , they are inappropriatefor application to longer documents. This is ... documents. This is due tothe algorithms? use of the target short text snippet asGoogle queries. Our approach has no such limitation.Some researchers ...
... space can improve theability of clustering algorithms to measure similaritybetween short text documents.We note that when clustering the 2CNews,2CPhys and 3CPhys in ... each vector (ie. from Background documentsthat share terms with both short text strings), andfeatures for which both vectors are zero (ie. Back-ground ...
... thatBackground documents sharing no terms with one orboth of the short text strings should not influence theresult of the function.6 ConclusionsIn this ... Confere (AusDM'11), Ballarat, Australia87ground Knowledge to construct an alternativerepresentation for short text called the Bridgingspace. We show that using Background Knowl-edge with ... similarity function has beenshown to significantly increase the purity whenclustering short text documents. An explicitcomparison of small and large external documentcorpora was ... would be interesting.ReferencesBanerjee, S., Ramanathan, K. & Gupta, A. (2007),Clustering short texts using wikipedia, in ?SIGIR?.Blei, D. M., Ng, A. Y. & ...
... Chua, T. (2009),Exploiting internal and external semantics for theclustering of short texts using world knowledge, in?CIKM?.Jing, L., Zhou, L., Ng, M. & ...
12
April 2012
WWW '12 Companion: Proceedings of the 21st International Conference on World Wide Web
Publisher: ACM
Bibliometrics:
Citation Count: 2
Downloads (6 Weeks): 7, Downloads (12 Months): 55, Downloads (Overall): 341
Full text available:
PDF
Partly due to the proliferance of microblog, short texts are becoming prominent. A huge number of short texts are generated every day, which calls for a method that can efficiently accommodate new data to incrementally adjust classification models. Naive Bayes meets such a need. We apply several smoothing models to ...
Keywords:
naive bayes, short text classification, smoothing
Title:
Enhancing naive bayes with various smoothing methods for short text classification
Keywords:
short text classification
Abstract:
<p>Partly due to the proliferance of microblog, short texts are becoming prominent. A huge number of short texts are generated every day, which calls for a method that ... Naive Bayes for question topic classification, as an example of short text classification, and study their performance. The experimental results on a ...
13
August 2012
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Publisher: ACM
Bibliometrics:
Citation Count: 4
Downloads (6 Weeks): 1, Downloads (12 Months): 17, Downloads (Overall): 134
Full text available:
PDF
Mobile SMS spam is on the rise and is a prevalent problem. While recent work has shown that simple machine learning techniques can distinguish between ham and spam with high accuracy, this paper explores the individual contributions of various textual features in the classification process. Our results reveal the surprising ...
Keywords:
short text, messaging, spam, feature selection, sms
Keywords:
short text
Full Text:
... Storage and Retrieval]: Information Searchand Retrieval?Information filteringGeneral TermsAlgorithms, ExperimentationKeywordsSMS, Spam, Short Text, , Feature Selection, Messaging1. INTRODUCTIONSpam detection has historically focused on ... phones and the advent of Twit-ter and other social media, short text spam detection has garneredinterest and importance. Our work focuses on ... filtersand methods for email spam are sufficiently accurate for identify-ing short text spam [1, 2, 5].However, most of the work in this ...
14
July 2011
SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Publisher: ACM
Bibliometrics:
Citation Count: 9
Downloads (6 Weeks): 6, Downloads (12 Months): 46, Downloads (Overall): 716
Full text available:
PDF
Recently, more and more short texts (e.g., ads, tweets) appear on the Web. Classifying short texts into a large taxonomy like ODP or Wikipedia category system has become an important mining task to improve the performance of many applications such as contextual advertising and topic detection for micro-blogging. In this ...
Keywords:
short text, large scale hierarchy, classification
Title:
Towards effective short text deep classification
Keywords:
short text
Abstract:
<p>Recently, more and more short texts (e.g., ads, tweets) appear on the Web. Classifying short texts into a large taxonomy like ODP or Wikipedia category system ... semantic analysis is used to add more features for both short texts and categories. Second, we leverage information retrieval technologies to fetch ... technologies to fetch the most relevant categories for an input short text from thousands of candidates. Finally, a SVM classifier is applied ...
References:
B. Sriram, D. Fuhry, E. Demir, H. Ferhatosmanoglu, and M. Demirbas. Short text classification in twitter to improve information filtering. In SIGIR, pages 841--842, 2010.
Full Text:
custom2.epsTowards Effective Short Text Deep ClassificationXinruo SunApex Data & KnowledgeManagement LabShanghai Jiao Tong UniversityShanghai, ... & KnowledgeManagement LabShanghai Jiao Tong UniversityShanghai, Chinayyu@apex.sjtu.edu.cnABSTRACTRecently, more and more short texts (e.g., ads, tweets) ap-pear on the Web. Classifying short texts into a large tax-onomy like ODP or Wikipedia category system ... explicit semantic analysis is used to add morefeatures for both short texts and categories. Second, weleverage information retrieval technologies to fetch the ... retrieval technologies to fetch the mostrelevant categories for an input short text from thousands ofcandidates. Finally, a SVM classi?er is applied on ... years, with the emergence of the Social Web,more and more short texts such as ads, tags and tweetsare generated and consumed by ... and tweetsare generated and consumed by web applications and users.Classifying short texts into a large taxonomy like ODP orWikipedia category system has ... faces two main challenges.Firstly, the information is expressed in very short text, , whichresults in feature sparseness. Secondly, categories at deeplevels of ... 978-1-4503-0757-4/11/07.There are two categories of research work related to ourtask: short text classi?cation and deep classi?cation. Theformer tries to deal with the ... classi?cation.In this paper, we propose a multi-stage approach for ef-fective short text deep classi?cation. The contributions arethreefolds. First we leverage explicit semantic ... employed to ?nd a few mostrelevant categories to the input short text for further ?ner-grained classi?cation, thus avoiding the poor performance oftraditional ...
... conclude that our approach successfully improves thedeep classi?cation performance on short text. . As of futurework, we will try a more advanced ... B. Sriram, D. Fuhry, E. Demir, H. Ferhatosmanoglu,and M. Demirbas. Short text
15
July 2016
SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval
Publisher: ACM
Bibliometrics:
Citation Count: 2
Downloads (6 Weeks): 79, Downloads (12 Months): 1,211, Downloads (Overall): 1,211
Full text available:
PDF
For many applications that require semantic understanding of short texts, inferring discriminative and coherent latent topics from short texts is a critical and fundamental task. Conventional topic models largely rely on word co-occurrences to derive topics from a collection of documents. However, due to the length of each document, short ...
Keywords:
short texts, topic model, word embeddings
Title:
Topic Modeling for Short Texts with Auxiliary Word Embeddings
Keywords:
short texts
Abstract:
<p>For many applications that require semantic understanding of short texts, , inferring discriminative and coherent latent topics from short texts is a critical and fundamental task. Conventional topic models largely ... of documents. However, due to the length of each document, short texts are much more sparse in terms of word co-occurrences. Data ... bottleneck for conventional topic models to achieve good results on short texts. . On the other hand, when a human being interprets ... other hand, when a human being interprets a piece of short text, , the understanding is not solely based on its content ... Exploiting such auxiliary word embeddings to enrich topic modeling for short texts is the main focus of this paper. To this end, ... we propose a simple, fast, and effective topic model for short texts, , named GPU-DMM. Based on the Dirichlet Multinomial Mixture (DMM) ... documents can be easily exploited to improve topic modeling for short texts. . Through extensive experiments on two real-world short text collections in two languages, we show that GPU-DMM achieves comparable ...
References:
M. Chen, X. Jin, and D. Shen. Short text classification improved by learning multi-granularity topics. In IJCAI, 2011.
X. Cheng, X. Yan, Y. Lan, and J. Guo. BTM: topic modeling over short texts. IEEE Trans. Knowl. Data Eng., 2014.
O. Jin, N. N. Liu, K. Zhao, Y. Yu, and Q. Yang. Transferring topical knowledge from auxiliary long texts for short text clustering. In CIKM, 2011.
T. Kenter and M. de Rijke. Short text similarity with word embeddings. In CIKM, 2015.
B. Sriram, D. Fuhry, E. Demir, H. Ferhatosmanoglu, and M. Demirbas. Short text classification in twitter to improve information filtering. In SIGIR, 2010.
A. Sun. Short text classification using very few words. In SIGIR, 2012.
X. Yan, J. Guo, Y. Lan, and X. Chen. A biterm topic model for short texts. In WWW, 2013.
J. Yin and J. Wang. A dirichlet multinomial mixture model-based approach for short text clustering. In SIGKDD, 2014.
Full Text:
Topic Modeling for Short Texts with Auxiliary WordEmbeddingsChenliang Li1, Haoran Wang1, Zhiqian Zhang1, Aixin Sun2, ... collec-tion of documents. However, due to the length of each document,short texts are much more sparse in terms of word co-occurrences.Data sparsity ... a bottleneck for conventional topicmodels to achieve good results on short texts. . On the other hand,when a human being interprets a ... the other hand,when a human being interprets a piece of short text, , the understand-ing is not solely based on its content ... Exploiting such auxiliary word embed-dings to enrich topic modeling for short texts is the main focus ofthis paper. To this end, we ... end, we propose a simple, fast, and effective topicmodel for short texts, , named GPU-DMM. Based on the DirichletMultinomial Mixture (DMM) model, ... external documents can be easilyexploited to improve topic modeling for short texts. . Through ex-tensive experiments on two real-world short text collections in twolanguages, we show that GPU-DMM achieves comparable or ... e-valuation.Categories and Subject DescriptorsI.2.7 [Natural Language Processing]: Text analysisKeywordsTopic Model, Short Texts, , Word EmbeddingsPermission to make digital or hard copies of ... and questions/answers,to name a few. Given the large volume of short texts available, ef-fective and efficient models to discover the latent topics ... tasks, conventional topic models ex-perience a large performance degradation over short texts becauseof limited word co-occurrence information in short texts. . In oth-er words, data sparsity impedes the generation of ... have been proposed to deal with thedata sparsity problem in short texts. . One strategy is to aggregate asubset of short texts to form a longer pesudo-document. Conven-tional topic models are then ...
... Another strategy is to restrict the document-topicdistribution, such that each short text is sampled from a single top-ic, known as mixture of ... 1A short documentword 9topicsFigure 1: GPU-DMM Overviewis limited to the short text corpus itself. Given two words havingstrong semantically relatedness but rarely ... for a more generic model that can be effectively appliedto short texts, , without the need of manually constructed thesauri,and not limited ... domains.In this paper, we propose a new topic model for short texts, ,named GPU-DMM. GPU-DMM is designed to leverage the gen-eral word ... no co-occurrences in the current1https://code.google.com/p/word2vec2Details are listed at https://github.com/3Top/word2vec-apicollection of short texts being modeled. A filtering strategy is al-so introduced in GPU-DMM ... and effective topic model to learnthe latent topic patterns over short texts. . The model is flexi-ble in directly taking in word ...
... based on word embeddings with GPU model.2. On two real-world short text collections in two languages,we evaluate the proposed GPU-DMM against a ... evaluate the proposed GPU-DMM against a few state-of-the-art alternatives for short texts. . Experimental resultsdemonstrate our model?s superiority, in topic coherence, textclassification ... WORKWe review recent advances on learning better topic represen-tations on short texts. . We then focus on the models with wordembeddings because ... because our model uses word embeddings as externalknowledge.Topic Models for Short Texts. . Conventional topic models suchas pLSA and LDA are designed ... models suffer a lot from the data sparsity prob-lem in short texts, , leading to inferior topic inferences. Earlier stud-ies focus on ... on exploiting external knowledge to help refine the topicinference of short texts. . Phan et al. proposed to infer topic structureof short texts by using the learnt latent topics from Wikipedia [28].Similarly, Jin ... from Wikipedia [28].Similarly, Jin et al. infer latent topics of short texts for clustering byusing auxiliary long texts [15]. These models require ... availablein some domains and/or languages.Given the limited context information in short texts, , many ag-gregation strategies have been studied by merging short texts in-to long pseudo-documents. Conventional topic modeling is thenapplied to infer ... news headlines.These studies suggest that topic models specifically designed forgeneral short texts are imperative.166A simple and effective topic model, named Dirichlet Multino-mial ... Mixture (DMM) model, has been employed to discover latenttopics in short texts in many tasks [40, 42]. DMM is based on theassumption ... from a single la-tent topic. Given the limited content of short texts, , this assumptionis reasonable and is proven to be more ... collapsed Gibbs Sampling algorithm for DMM and show itseffectiveness in short text clustering. Due to its simplicity and ef-fectiveness, we develop GPU-DMM ... topic model (SATM) for shorttexts [30]. SATM assumes that each short text is a segment of along pseudo-document and shares the same ... involving bothtext aggregation and topic sampling is time-consuming.Topic Models for Short Texts with Word Embeddings. Wordembeddings, first introduced in [33], have been ... al. [24]. Theypropose a topic model with word embeddings for short texts, , calledLF-DMM. Built based on DMM, LF-DMM replaces the topic-wordmultinomial ...
... continuous word embedding com-ponent. That is, each word in a short text is generated from eitherthe Dirichlet multinomial distribution or the probability ... combine word embeddings and GPU model forsolving sparsity problem of short texts. .3. GPU-DMMAs its name suggests, the proposed GPU-DMM model is ... a document are generatedby using the same topic distribution.Given a short text corpus of D documents, with a vocabulary ofsize V , ... + V ?(2)3.2 Auxiliary Word EmbeddingsBecause of the length of short texts, , the words with high seman-tic relatedness may not frequently ... document frequency of theword in the Web snippet dataset. This short text dataset contains12, 340 Web search snippets (see Section 4.1 for ...
... wordembeddings learned from large corpus would significantly enhancetopic modeling on short texts. .3.3 Incorporating Word Embeddings by GPUIt has been validated that ... the global word relatedness knowledge,leading to a generic solution for short text topic modeling.Formally, given pre-trained word embeddings, we measure thesemantic relatedness ... may not be the best choice for topic modeling in short texts, ,explained next.Word Filtering. Based on DMM, GPU-DMM samples a shorttext ... a shorttext from a single topic. All words in the short text are assigned tothe same topic. Take a short text "info website web cern consortiumserver project world copy first" as ...
scenario is prevalentin many short texts and could be harmful to DMM based topic mod-els. In ...
... number of semanticallyrelated words for each word appear in the short text corpus. KV isthe computation required for calculating all p(z = ... not all these semantically related words appear in the underly-ing short texts being modeled. In this sense, ? is expected to berelatively ...
... topics by directlymodeling the generation of word co-occurrence patterns inthe short text corpus [39]. In BTM, a biterm is an unorderedword pair ... a short context.? Self-Aggregation based Topic Model (SATM) assumes thateach short text is sampled from a long pseudo-document un-observed in the current ... A hard constraint in LF-DMM isthat all words in the short text corpus should have corre-sponding word embeddings. Therefore we remove all ... each doc-ument, which has been proven to be inappropriate for short texts [13,30, 42]. Hence, we exclude this model from the comparison.We ...
... in Chinese wordsegmentation might introduce some performance variations.4.4 Evaluation by Short Text ClassificationWith topic modeling, we can represent each document with it-s ...
... the direct topic assignment. Thisis not a desired method for short texts. . In the rest of this paper, weonly report the ...
... in Table 6 suggestthat GPU-DMM is a desired choice for short text topic modeling,with respect to both effectiveness and efficiency.5. CONCLUSIONUnlike normal ... modeling,with respect to both effectiveness and efficiency.5. CONCLUSIONUnlike normal documents, short texts carry limited context infor-mation, causing severe sparsity problems when applying ... global word co-occurrence knowl-edge to help distil better topics over short texts. . Instead of extract-ing word semantic relatedness knowledge from external ... topic similarity for two semantically related words which rarelyco-occur in short texts. . We conduct extensive experiments on t-wo real-world short text corpora. The experimental results showthat GPU-DMM outperforms existing state-of-the-art alternativesin ... In NIPS, 2009.[4] M. Chen, X. Jin, and D. Shen. Short text classificationimproved by learning multi-granularity topics. In IJCAI,2011.[5] Z. Chen and ... X. Yan, Y. Lan, and J. Guo. BTM: topic modelingover short texts. . IEEE Trans. Knowl. Data Eng., 2014.[9] R. Collobert and ...
... textclustering. In CIKM, 2011.[16] T. Kenter and M. de Rijke. Short text similarity with wordembeddings. In CIKM, 2015.[17] M. J. Kusner, Y. ... B. Sriram, D. Fuhry, E. Demir, H. Ferhatosmanoglu, andM. Demirbas. Short text classification in twitter to improveinformation filtering. In SIGIR, 2010.[35] A. ... in twitter to improveinformation filtering. In SIGIR, 2010.[35] A. Sun. Short text classification using very few words. InSIGIR, 2012.[36] C. Wang and ... Guo, Y. Lan, and X. Chen. A biterm topic modelfor short texts. . In WWW, 2013.[40] J. Yin and J. Wang. A ... Yin and J. Wang. A dirichlet multinomial mixturemodel-based approach for short text clustering. In SIGKDD,2014.[41] X. Yunqing, T. Nan, H. Amir, and ...
16
October 2011
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management
Publisher: ACM
Bibliometrics:
Citation Count: 35
Downloads (6 Weeks): 19, Downloads (12 Months): 115, Downloads (Overall): 1,249
Full text available:
PDF
With the rapid growth of social Web applications such as Twitter and online advertisements, the task of understanding short texts is becoming more and more important. Most traditional text mining techniques are designed to handle long text documents. For short text messages, many of the existing techniques are not effective ...
Keywords:
statistical generative models, short text, unsupervised learning
Title:
Transferring topical knowledge from auxiliary long texts for short text clustering
Keywords:
short text
Abstract:
... such as Twitter and online advertisements, the task of understanding short texts is becoming more and more important. Most traditional text mining ... mining techniques are designed to handle long text documents. For short text messages, many of the existing techniques are not effective due ... be utilized as the auxiliary data when mining the target short texts data. In this article, we present a novel approach to ... In this article, we present a novel approach to cluster short text messages via transfer learning from auxiliary long text data. We ... We show that while some previous work exists that enhance short text clustering with related long texts, most of them ignore the ... can obtain superior performance over several state-of-art techniques for clustering short text
References:
X. Hu, N. Sun, C. Zhang, and T.-S. Chua. Exploiting internal and external semantics for the clustering of short texts using world knowledge. In Proceeding of the 18th ACM conference on Information and knowledge management, CIKM '09, pages 919--928. ACM, 2009.
X. Quan, G. Liu, Z. Lu, X. Ni, and L. Wenyin. Short text similarity based on probabilistic topics. Knowledge and Information Systems, 25:473--491, December 2010.
M. Sahami and T. D. Heilman. A web-based kernel function for measuring the similarity of short text snippets. In Proceedings of the 15th international conference on World Wide Web, WWW '06, pages 377--386. ACM, 2006.
B. Sriram, D. Fuhry, E. Demir, H. Ferhatosmanoglu, and M. Demirbas. Short text classification in twitter to improve information filtering. In Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval, SIGIR '10, pages 841--842. ACM, 2010.
Full Text:
Transferring topical knowledge from auxiliary long texts for short text clusteringTransferring Topical Knowledge from Auxiliary Long Textsfor Short Text ClusteringOu Jin?Shanghai Jiao Tong University800 Dongchuan RoadShanghai, Chinakingohm@apex.sjtu.edu.cnNathan N. LiuHong ... applications such asTwitter and online advertisements, the task of understand-ing short texts is becoming more and more important. Mosttraditional text mining techniques ... text mining techniques are designed to handlelong text documents. For short text messages, many of theexisting techniques are not effective due to ... be utilized as the auxiliary data when min-ing the target short texts ... data. In this article, we presenta novel approach to cluster short text messages via transferlearning from auxiliary long text data. We show ... text data. We show that whilesome previous works for enhancing short text clustering withrelated long texts exist, most of them ignore the ... length to be less than 140 characters.Effective techniques for mining short texts are crucial tothese application domains. While many successful text min-ing ... textcorpus such as blogs or newswires. Directly applying thesemethods on short texts ... often leads to poor results [12]. Com-pared with long texts, short text mining has to address twoinherent difficulties caused by their highly ... of context information in the text.To alleviate the sparseness of short texts, , a common ap-proach is to conduct ?pseudo relevance feedback? ... to conduct ?pseudo relevance feedback? in orderto enrich the original short text corpus with an additionalset of auxiliary data consisting of semantically ... the data or topics areoften directly combined with the original short texts, , whichare then processed by some traditional text mining models.While ...
... unrelated long texts or topicscould hurt the performance on the short texts. . The problemcan become even more serious for unsupervised learning ... (DLDA), which jointly learns a set of target topicson the short texts and another set of auxiliary topics on thelong texts while ... the auxiliary long texts are onlypartially related to the input short texts. . Clustering exper-iments on two real world data sets consisting ... ad-vertisements and twitter messages demonstrated consistentimprovements over existing methods for short text cluster-ing, especially when there are many irrelevant documents inthe auxiliary ... TextDue to its importance in popular web applications such asTwitter, short text mining has attracted growing interestsin recent years. Hong et al. ... which utilizes search engines to provide additionalcontext for the given short text, , just like query expansiontechniques. This similarity measure can also ... knowledge base to topics to improve the repre-sentation of the short texts. . The knowledge base is crawledwith selected seeds from several ... on how to acquire auxiliary data in order to en-rich short text. . They generally make the implicit assump-tion that the auxiliary ... assump-tion that the auxiliary data are semantically related to theinput short texts, , which is hardly true in practice due to thenoisy ... auxiliary data set. However, the model isnot designed for handling short texts. . Moreover, it makes astrong assumption that the word co-clusters ... (DLDA)Due to its high dimensional yet extremely sparse repre-sentations, clustering short texts directly based on the bagof words representation can be very ... capture thesemantic relationships between documents. Directly learn-ing topic models on short text is much harder than on tradi-tional long text. For this ... long text in the same domainand then make inference on short text to help the learn-ing task on short texts. . However, in highly dynamic do-mains like Twitter where novel ...
... knowledge base such asWikipedia. Furthermore, for application domains like ad-vertisement, short texts and long texts are often used forvery different purposes. As ... consistent and inconsistent topical structures acrossdomains when learning topics from short texts with an ad-ditional set of auxiliary long texts. In the ... Setting and BackgroundLetWtar = {~wtarm }Mtarm=1 denote the set of short texts fromthe target domain (i.e., domain of interests), which are tobe ... Wikipedia, or extracted from some documents rele-vant to the target short texts. . In this work, we attempt totransfer the topical knowledge ... [16] and our work attempt to apply topic model forcharacterizing short text document collections, the focus ofour works and their objectives are ...
... (C2C)shopping Web site and a collection of tweets (TWEETs)from Twitter.The short texts in the ADs data set are a collection of text-based ... in the Table 3, fromwhich we can see that the short texts in each data set con-tain a very small fraction of ... the number of words in thecorresponding long texts. Moreover, the short texts in theTWEETs data set contain very few words on average ...
... that our model does not require anycorrespondence structures between the short texts and thelong texts, such as those indicating which tweet containswhich ... CLUTO 1 with the TF-IDF repre-sentation.LDA-short Learning a LDA from short texts and directlyclustering with the ? learned.LDA-long Learning a LDA from ... Learning a LDA from long texts and then ap-ply to short text, , which is proposed in [12]. We useGibbsLDA++ 2, the ... Learning a LDA from the combined collectionof long texts and short texts as proposed in [20].Self-Taught Clustering (STC) This is a state ...
clustering the short texts gave the poorestperformance. LDA-short outperformed Direct on both datasets, which ... auxiliary long texts. Interestingly, the LDA-longmodel, which ignores all the short texts, , performs betterthan LDA-short. We believe that this is because ... LDA-short. We believe that this is because learningtopic models on short texts is inherently much more diffi-cult.Both variations of the proposed DLDA ...
... explicitly considering domain inconsistency, we caneffectively improve the robustness of short text clusteringwith auxiliary long texts.When dealing with the auxiliary documents without ... paper, we presented a type of novel topic modelfor enhancing short text clustering by incorporating auxil-iary long texts. The model jointly learns ... difference be-tween auxiliary data and target data, the clustering qualityon short text can be improved.In the future, we wish to evaluate other ...
... B. Sriram, D. Fuhry, E. Demir, H. Ferhatosmanoglu,and M. Demirbas. Short text classification in twitterto improve information filtering. In Proceeding of the33rd ...
17
July 2010
SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Publisher: ACM
Bibliometrics:
Citation Count: 101
Downloads (6 Weeks): 92, Downloads (12 Months): 811, Downloads (Overall): 5,774
Full text available:
PDF
In microblogging services such as Twitter, the users may become overwhelmed by the raw data. One solution to this problem is the classification of short text messages. As short texts do not provide sufficient word occurrences, traditional classification methods such as "Bag-Of-Words" have limitations. To address this problem, we propose ...
Keywords:
classification, feature selection, short text, twitter
Title:
Short text classification in twitter to improve information filtering
Keywords:
short text
Abstract:
... data. One solution to this problem is the classification of short text messages. As short texts do not provide sufficient word occurrences, traditional classification methods such ...
References:
Banerjee, S., Ramanthan, K., and Gupta, A. Clustering short text using Wikipedia. In Proc. SIGIR (Amsterdam, The Netherlands, July 2007), 787--788.
Hu, X., Sun, N., Zhang, C., and Chua, T.-S. Exploiting internal and external semantics for the clustering of short texts using world knowledge. In Proc. CIKM (Hong Kong, China, Nov. 2009), 919--928.
Full Text:
... data. One solution to this problem is the classification of short text messages. As short texts do not provide sufficient word occurrences, traditional classification methods such ... and Retrieval]: Information filtering. General Terms Algorithms, Performance, Experimentation. Keywords Short text, , classification, Twitter, feature selection. 1. INTRODUCTION Twitter1 is a ... held devices like smart phones. Existing works on classification of short text messages integrate messages with meta-information from other information sources such ... well when there is meta-information or the context of the short text is extended with knowledge extracted using large collections. We propose ...
... 861-862. [2] Banerjee, S., Ramanthan, K., and Gupta, A. Clustering short text using Wikipedia. In Proc. SIGIR (Amsterdam, The Netherlands, July 2007), ... T.-S. Exploiting internal and external semantics for the clustering of short texts using world knowledge. In Proc. CIKM (Hong Kong, China, Nov. ...
18
February 2017
WSDM '17: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining
Publisher: ACM
Bibliometrics:
Citation Count: 0
Downloads (6 Weeks): 58, Downloads (12 Months): 146, Downloads (Overall): 146
Full text available:
PDF
We aim at summarizing answers in community question-answering (CQA). While most previous work focuses on factoid question-answering, we focus on the non-factoid question-answering. Unlike factoid CQA, non-factoid question-answering usually requires passages as answers. The shortness, sparsity and diversity of answers form interesting challenges for summarization. To tackle these challenges, we ...
Keywords:
document summarization, community question-answering, short text processing, sparse coding
Keywords:
short text processing
Abstract:
... sentence is represented as a feature vector trained from a short text convolutional neural network model. We then use these sentence representations ...
References:
A. Severyn and A. Moschitti. Learning to rank short text pairs with convolutional deep neural networks. In SIGIR, 2015.
H. Wang, Z. Lu, H. Li, and E. Chen. A dataset for research on short-text conversations. In EMNLP, 2013.
Y. Zhao, S. Liang, Z. Ren, J. Ma, E. Yilmaz, and M. de Rijke. Explainable user clustering in short text streams. In SIGIR, 2016.
Full Text:
... each sentence is represented as a feature vector trainedfrom a short text convolutional neural network model. We thenuse these sentence representations to ... to state-of-the-artbaselines in terms of ROUGE metrics.KeywordsCommunity question-answering; Sparse coding; Short text pro-cessing; Document summarization? These three authors contributed equally to the ...
process using tra-ditional representation of short text, , based on term frequency orlatent topic modeling [34]. (3) ... tackle the challenge of sparsity in sentence presentation, we ap-ply short text convolutional neural networks (CNNs) to model sen-tences given the input ...
... texts using Wikipedia articles.? To tackle the sparsity problem in short texts, , we apply convo-lutional neural networks to model sentences from ...
... in many language processing tasks, such astext classification [14] and short text ranking [39], by efficientlyembedding sentences into low-dimensional vector space.Given a ...
... Lu, H. Li, and E. Chen. A dataset for researchon short- -text conversations. In EMNLP, 2013.[47] M. Wang. A survey of answer ...
19
November 2014
CIKM '14: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management
Publisher: ACM
Bibliometrics:
Citation Count: 7
Downloads (6 Weeks): 17, Downloads (12 Months): 147, Downloads (Overall): 512
Full text available:
PDF
Most existing approaches for text classification represent texts as vectors of words, namely ``Bag-of-Words.'' This text representation results in a very high dimensionality of feature space and frequently suffers from surface mismatching. Short texts make these issues even more serious, due to their shortness and sparsity. In this paper, we ...
Keywords:
msn channel, short text classification, taxonomy knowledge, query recommendation
Title:
Concept-based Short Text Classification and Ranking
Keywords:
short text classification
Abstract:
... dimensionality of feature space and frequently suffers from surface mismatching. Short texts make these issues even more serious, due to their shortness ... and sparsity. In this paper, we propose using ``Bag-of-Concepts'' in short text representation, aiming to avoid the surface mismatching and handle the ... Based on ``Bag-of-Concepts,'' a novel framework is proposed for lightweight short text classification applications. By leveraging a large taxonomy knowledgebase, it learns ... learns a concept model for each category, and conceptualizes a short text to a set of relevant concepts. A concept-based similarity mechanism ... A concept-based similarity mechanism is presented to classify the given short text to the most similar category. One advantage of this mechanism ... category. One advantage of this mechanism is that it facilitates short text ranking after classification, which is needed in many applications, such ...
References:
M. Chen, X. Jin, and D. Shen. Short text classification improved by learning multi-granularity topics. In IJCAI, pages 1776--1781. AAAI Press, 2011.
X. Hu, N. Sun, C. Zhang, and T.-S. Chua. Exploiting internal and external semantics for the clustering of short texts using world knowledge. In CIKM, pages 919--928. ACM, 2009.
Y. Song, H. Wang, Z. Wang, H. Li, and W. Chen. Short text conceptualization using a probabilistic knowledgebase. In IJCAI, pages 2330--2336. AAAI Press, 2011.
A. Sun. Short text classification using very few words. In SIGIR, pages 1145--1146. ACM, 2012.
Z. Wang, H. Wang, and Z. Hu. Head, modifier, and constraint detection in short texts. In ICDE, pages 280--291, 2014.
Full Text:
Concept-based Short Text Classification and Ranking?Fang Wang y;#1 Zhongyuan Wang z;#2 Zhoujun Li ... dimensionality of feature space and fre-quently suffers from surface mismatching. Short texts make theseissues even more serious, due to their shortness and ... shortness and sparsity. Inthis paper, we propose using ?Bag-of-Concepts? in short text repre-sentation, aiming to avoid the surface mismatching and handle thesynonym ... problem. Based on ?Bag-of-Concepts,? anovel framework is proposed for lightweight short text classifica-tion applications. By leveraging a large taxonomy knowledgebase,it learns a ... concepts. A concept-based similaritymechanism is presented to classify the given short text to the mostsimilar category. One advantage of this mechanism is ... category. One advantage of this mechanism is that it facili-tates short text ranking after classification, which is needed in manyapplications, such as ... in the text. Poor per-formance of BoW is unavoidable in Short Text Classification (STC),because short texts (e.g., search queries, tweets, or Facebook status)are sparse, noisy, and ... ambiguous. The drawback of surface-basedsimilarity is even more serious in short texts, , since two short textswith similar meanings do not necessarily ... necessarily share many words. Mostexisting work tries to expand the short text by leveraging searchengines [31, 11, 37] or external knowledge bases ... stillweak in semantics.?Bag-of-Words? limits text classification in many applications,especially in short texts or other lightweight online applications thatrequire faster training and new ... and new words adaption. As an alternative,we propose representing the short text from a higher perspective ofconcepts, rather than directly using terms ...
... and?Honda? can be represented by Car. These generic concepts canbenefit short text classification. For example, although the twoshort texts ?Beyonce named People?s ... Replace surface matching with semantic similarity: BoCmea-sures the similarity between short texts from the higher con-cept level, rather than surface level, which ... word changing.We propose a novel framework based on ?Bag-of-Concepts?for lightweight short- -text oriented classification applications (Boc-STC). Specifically, given the training texts per ... classified, the framework needs to understand itscontent and associate the short text with the relevant concepts (herewe call this operation Conceptualization [35]). ... the con-ceptual expression, we propose a concept-based similarity mech-anism for short text classification and ranking. The framework issuitable for many online lightweight ... follows:? We propose using ?Bag-of-Concepts? as a replacement of?Bag-of-Words? in short text representation. This enablesus to compute short text similarities at the semantic level ofconcepts.? Based on ?Bag-of-Concepts,? we ... level ofconcepts.? Based on ?Bag-of-Concepts,? we propose a novel frameworkfor short text classification (BocSTC). Our framework is muchmore applicable for online lightweight ...
... we use the concept cluster inmultiple-ways including sense detection in short texts and subtopicrepresentation in the targeted categories.3. THE FRAMEWORKIn this section, ... FRAMEWORKIn this section, we present the novel framework (BocSTC) forlightweight short- -text oriented classification applications (Fig. 3).It consists of two components: offline ... Class NsasClass iTrainingData<Music , Score>Figure 3: The framework of concept-based short- -text classifi-cation and rankingset. A large taxonomy knowledgebase is exploited for ...
... More details are given in subsection 3.1.At runtime, when a short text is coming, we need to first under-stand its main topics ... main topics by leveraging conceptualization. This enablesus to translate the short text to ?Bag-of-Concepts,? so as to classifyit in the same concept ... Finally, weget the concept model CMl for the class CLl.3.2 Short Text ConceptualizationShort Text Conceptualization aims to abstract a set of most ... set of most rep-resentative concepts that can best describe the short text [35, 39].To avoid over abstracting, specific entities are preferred during ... abstracting, specific entities are preferred during en-tity recognition from the short text. . Therefore, we first detect allpossible entities and then remove ... and then remove those contained by others. Forexample, given the short text ?windows phone app,? the recognizedentity set will be {?windows phone,? ...
... issue affectingthe conceptualization accuracy. A typical example is to understandthe short text ?apple ipad?, where ?apple? has two senses, namely3Stop concepts generally ... the same sensewith its near context. Formally, we conceptualize the short text stiin the following two steps:Sense Detection: This step aims to ... this subsection, we describe the similarity-based mechanismto classify the given short text sti and rank the items assigned to aclass CLl.3.3.1 ClassificationThe ... their similarities to the user query. In ourframework, as each short text sti assigned to CLl has a similarityscore, we can rank ...
... concept clusters as subtopics of the class, wecan diversify the short texts by subtopic Proportionality proposedby Dang et al [12].4. CHANNEL-BASEDQUERYRECOMMEN-DATIONIn this ... or any user preference logs; ii) a need tounderstand the short text (user query); iii) requiring both classifi-cation and ranking (how to ... BocSTC captures query topicsfrom the concept level, by leveraging the short text conceptualiza-tion module mentioned in subsection 3.2. Query conceptualizationinfers typical concepts ...
... paper are mainly related to short textclassification and query recommendation.6.1 Short Text ClassificationShort text classification delivers short texts (e.g., queries, tweetsand comments) to some pre-defined categories based on ... extra context information through search engines [31, 11,37]. The expanded short texts are regarded as long texts and can beclassified with long ... a set of explicit or implicit topics andthen connect the short text through these topics. Using pre-definedtopics or taxonomy relaxes the dependence ...
... classification.7. CONCLUSIONIn this paper, we propose a novel framework for short text clas-sification and ranking applications. Compared with existing ap-proaches for short text classification, our framework has two advan-tages: i) It measures the ... it suitable for online lightweight applications thatneed to deal with short texts with requirements of fast learning andword changing adaption. We demonstrate ... TIST, 2(3):27, 2011.[7] M. Chen, X. Jin, and D. Shen. Short text classificationimproved by learning multi-granularity topics. In IJCAI,pages 1776?1781. AAAI Press, ...
... T.-S. Chua. Exploiting internaland external semantics for the clustering of short texts usingworld knowledge. In CIKM, pages 919?928. ACM, 2009.[18] L. Huang. ... ofsemantic knowledge. In WWW, pages 697?706. ACM, 2007.[37] A. Sun. Short text classification using very few words. InSIGIR, pages 1145?1146. ACM, 2012.[38] ... H. Wang, and Z. Hu. Head, modifier, and constraintdetection in short texts. . In ICDE, pages 280?291, 2014.[40] W. Wu, H. Li, ...
20
August 2012
KDD '12: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Publisher: ACM
Bibliometrics:
Citation Count: 25
Downloads (6 Weeks): 29, Downloads (12 Months): 274, Downloads (Overall): 1,293
Full text available:
PDF
Recent years have witnessed the explosive growth of online social media. Weibo, a Twitter-like online social network in China, has attracted more than 300 million users in less than three years, with more than 1000 tweets generated in every second. These tweets not only convey the factual information, but also ...
Keywords:
online social media, sentiment analysis, weibo, chinese short text
Keywords:
chinese short text
Full Text:
... China.Copyright 2012 ACM 978-1-4503-1462-6 /12/08 ...$15.00.General TermsMeasurement, ExperimentationKeywordsSentiment Analysis; Chinese Short Text; ; Online Social Me-dia; Weibo1. INTRODUCTIONThe development of online social ...