天天看點

Applications(4)

CONTENTS

In this section we cover a few other types of applications of deep learning that are different from the standard object recognition, speech recognition and natural language processing tasks discussed above. Part III of this book will expand that scope even further to tasks that remain primarily research areas.

One of the major families of applications of machine learning in the information technology sector is the ability to make recommendations of items to potential users or customers. Two major types of applications can be distinguished: online advertising and item recommendations (often these recommendations are still for the purpose of selling a product). Both rely on predicting the association between a user and an item, either to predict the probability of some action (the user buying the product, or some proxy for this action) or the expected gain (which may depend on the value of the product) if an ad is shown or a recommendation is made regarding that product to that user.

Often, this association problem is handled like a supervised learning problem: given some information about the item and about the user, predict the proxy of interest (user clicks on ad, user enters a rating, user clicks on a “like” button, user buys product, user spends some amount of money on the product, user spends time visiting a page for the product, etc). This often ends up being either a regression problem (predicting some conditional expected value) or a probabilistic classification problem (predicting the conditional probability of some discrete event).

The early work on recommender systems relied on minimal information as inputs for these predictions: the user ID and the item ID. In this context, the only way to generalize is to rely on the similarity between the patterns of values of the target variable for different users or for different items. Suppose that user 1 and user 2 both like items

A

,

B

\mathrm{A}, \mathrm{B}

A,B and

C

\mathrm{C}

C. From this, we may infer that user 1 and user 2 have similar tastes. If user 1 likes item D, then this should be a strong cue that user 2 will also like D. Algorithms based on this principle come under the name of collaborative filtering.

Both non-parametric approaches (such as nearest-neighbor methods based on the estimated similarity between patterns of preferences) and parametric methods are possible. Parametric methods often rely on learning a distributed representation (also called an embedding) for each user and for each item.

Bilinear prediction of the target variable (such as a rating) is a simple parametric method that is highly successful and often found as a component of state-of-the-art systems. The prediction is obtained by the dot product between the user embedding and the item embedding (possibly corrected by constants that depend only on either the user ID or the item ID). Let

R

^

\hat{\boldsymbol{R}}

R^ be the matrix containing our predictions,

\boldsymbol{A}

A a matrix with user embeddings in its rows and

\boldsymbol{B}

B a matrix with item embeddings in its columns. Let

b

\boldsymbol{b}

b and

c

\boldsymbol{c}

c be vectors that contain respectively a kind of bias for each user (representing how grumpy or positive that user is in general) and for each item (representing its general popularity). The bilinear prediction is thus obtained as follows:

u

i

=

+

j

\hat{R}_{u, i}=b_{u}+c_{i}+\sum_{j} A_{u, j} B_{j, i}

R^u,i​=bu​+ci​+j∑​Au,j​Bj,i​

Typically one wants to minimize the squared error between predicted ratings

\hat{R}_{u, i}

R^u,i​ and actual ratings

R_{u, i}

Ru,i​. User embeddings and item embeddings can then be conveniently visualized when they are first reduced to a low dimension (two or three), or they can be used to compare users or items against each other, just like word embeddings.

One way to obtain these embeddings is by performing a singular value decomposition of the matrix

\boldsymbol{R}

R of actual targets (such as ratings). This corresponds to factorizing

U

D

V

\boldsymbol{R}=\boldsymbol{U} \boldsymbol{D} \boldsymbol{V}^{\prime}

R=UDV′ (or a normalized variant) into the product of two factors, the lower rank matrices

\boldsymbol{A}=\boldsymbol{U} \boldsymbol{D}

A=UD and

.

\boldsymbol{B}=\boldsymbol{V}^{\prime} .

B=V′. One problem with the SVD is that it treats the missing entries in an arbitrary way, as if they corresponded to a target value of 0 . Instead we would like to avoid paying any cost for the predictions made on missing entries. Fortunately, the sum of squared errors on the observed ratings can also be easily minimized by gradient-based optimization. The SVD and the bilinear prediction of equation

12.20

12.20 both performed very well in the competition for the Netflix prize( Bennett and Lanning , 2007), aiming at predicting ratings for films, based only on previous ratings by a large set of anonymous users. Many machine learning experts participated in this competition, which took place between 2006 and 2009 . It raised the level of research in recommender systems using advanced machine learning and yielded improvements in recommender systems. Even though it did not win by itself, the simple bilinear prediction or SVD was a component of the ensemble models presented by most of the competitors, including the winners (Töscher et al., 2009 ; Koren, 2009 )

Beyond these bilinear models with distributed representations, one of the first uses of neural networks for collaborative filtering is based on the RBM(Restricted Boltzmann machines) undirected probabilistic model (Salak nov et al., 2007). RBMs were an important element of the ensemble of methods that won the Netflix competition (Töscher et al., Koren, 2009). More advanced variants on the idea of factorizing the ratings matrix have also been explored in the neural networks community (Salakhutdinov and Mnih, 2008

)

).

However, there is a basic limitation of collaborative filtering systems: when a new item or a new user is introduced, its lack of rating history means that there is no way to evaluate its similarity with other items or users (respectively), or the degree of association between, say, that new user and existing items. This is called the problem of cold-start recommendations. A general way of solving the cold-start recommendation problem is to introduce extra information about the individual users and items. For example, this extra information could be user profile information or features of each item. Systems that use such information are called content-based recommender systems. The mapping from a rich set of user features or item features to an embedding can be learned through a deep learning architecture (Huang et al., 2013; Elkahky et al., 2015 ).

Specialized deep learning architectures such as convolutional networks have also been applied to learn to extract features from rich content such as from musical audio tracks, for music recommendation (van den Oörd et al., 2013). In that work, the convolutional net takes acoustic features as input and computes an embedding for the associated song. The dot product between this song embedding and the embedding for a user is then used to predict whether a user will listen to the song.

When making recommendations to users, an issue arises that goes beyond ordinary supervised learning and into the realm of reinforcement learning. Many recommendation problems are most accurately described theoretically as contextual bandits (Langford and Zhang, 2008; Lu et al., 2010). The issue is that when we use the recommendation system to collect data, we get a biased and incomplete view of the preferences of users: we only see the responses of users to the items they were recommended and not to the other items. In addition, in some cases we may not get any information on users for whom no recommendation has been made (for example, with ad auctions, it may be that the price proposed for an ad was below a minimum price threshold, or does not win the auction, so the ad is not shown at all). More importantly, we get no information about what outcome would have resulted from recommending any of the other items. This would be like training a classifier by picking one class

y

\hat{y}

y^​ for each training example

x

\boldsymbol{x}

x (typically the class with the highest probability according to the model) and then only getting as feedback whether this was the correct class or not. Clearly, each example conveys less information than in the supervised case where the true label

y is directly accessible, so more examples are necessary.

Worse, if we are not careful, we could end up with a system that continues picking the wrong decisions even as more and more data is collected, because the correct decision initially had a very low probability: until the learner picks that correct decision, it does not learn about the correct decision. This is similar to the situation in reinforcement learning where only the reward for the selected action is observed.

In general, reinforcement learning can involve a sequence of many actions and many rewards. The bandits scenario is a special case of reinforcement learning, in which the learner takes only a single action and receives a single reward. The bandit problem is easier in the sense that the learner knows which reward is associated with which action. In the general reinforcement learning scenario, a high reward or a low reward might have been caused by a recent action or by an action in the distant past. The term contextual bandits refers to the case where the action is taken in the context of some input variable that can inform the decision. For example, we at least know the user identity, and we want to pick an item. The mapping from context to action is also called a policy. The feedback loop between the learner and the data distribution (which now depends on the actions of the learner) is a central research issue in the reinforcement learning and bandits literature.

Reinforcement learning requires choosing a tradeoff between exploration and exploitation. Exploitation refers to taking actions that come from the current, best version of the learned policy-actions that we know will achieve a high reward.

Exploration refers to taking actions specifically in order to obtain more training data. If we know that given context

x, action

a

a gives us a reward of 1, we do not know whether that is the best possible reward. We may want to exploit our current policy and continue taking action

a in order to be relatively sure of obtaining a reward of 1 .

However, we may also want to explore by trying action

a^{\prime} .

a′. We do not know what will happen if we try action

a′. We hope to get a reward of 2, but we run the risk of getting a reward of

0.

0 .

0. Either way, we at least gain some knowledge.

Exploration can be implemented in many ways, ranging from occasionally taking random actions intended to cover the entire space of possible actions, to model-based approaches that compute a choice of action based on its expected reward and the model’s amount of uncertainty about that reward.

Many factors determine the extent to which we prefer exploration or exploitation.

One of the most prominent factors is the time scale we are interested in.

If the agent has only a short amount of time to accrue reward, then we prefer more exploitation.

If the agent has a long time to accrue reward, then we begin with more exploration so that future actions can be planned more effectively with more knowledge.

As time progresses and our learned policy improves, we move toward more exploitation.

Supervised learning has no tradeoff between exploration and exploitation because the supervision signal always specifies which output is correct for each input. There is no need to try out different outputs to determine if one is better than the model’s current output-we always know that the label is the best output.

Another difficulty arising in the context of reinforcement learning, besides the exploration-exploitation trade-off, is the difficulty of evaluating and comparing different policies. Reinforcement learning involves interaction between the learner and the environment. This feedback loop means that it is not straightforward to evaluate the learner’s performance using a fixed set of test set input values. The policy itself determines which inputs will be seen. Dudik et al. (2011) present techniques for evaluating contextual bandits.

Deep learning approaches have been very successful in language modeling, machine translation and natural language processing due to the use of embeddings for symbols (Rumelhart et al., 1986a) and words (Deerwester et al., 1990; Bengio et al., 2001). These embeddings represent semantic knowledge about individual words and concepts. A research frontier is to develop embeddings for phrases and for relations between words and facts. Search engines already use machine learning for this purpose but much more remains to be done to improve these more advanced representations.

One interesting research direction is determining how distributed representations can be trained to capture the relations between two entities. These relations allow us to formalize facts about objects and how objects interact with each other.

In mathematics, a binary relation is a set of ordered pairs of objects. Pairs that are in the set are said to have the relation while those who are not in the set do not.

For example, we can define the relation “is less than” on the set of entities

{

1

2

3

}

\{1,2,3\}

{1,2,3} by defining the set of ordered pairs

S

(

\mathbb{S}=\{(1,2),(1,3),(2,3)\}

S={(1,2),(1,3),(2,3)}. Once this relation is defined, we can use it like a verb. Because

(1,2) \in \mathbb{S}

(1,2)∈S, we say that 1 is less than 2. Because

(2,1) \notin \mathbb{S}

(2,1)∈/​S, we can not say that 2 is less than

1 .

1. Of course, the entities that are related to one another need not be numbers. We could define a relation is_a_type_of containing tuples like (dog, mammal).

In the context of AI, we think of a relation as a sentence in a syntactically simple and highly structured language. The relation plays the role of a verb, while two arguments to the relation play the role of its subject and object. These sentences take the form of a triplet of tokens (subject, verb,object)

with values (entity

_{i}

i​, relation

_{j}

j​, entity

k

_{k}

k​)

We can also define an attribute, a concept analogous to a relation, but taking only one argument: (entity

i​, attribute

j​)

For example, we could define the has_fur attribute, and apply it to entities like dog.

Many applications require representing relations and reasoning about them. How should we best do this within the context of neural networks?

Machine learning models of course require training data. We can infer relations between entities from training datasets consisting of unstructured natural language. There are also structured databases that identify relations explicitly. A common structure for these databases is the relational database, which stores this same kind of information, albeit(盡管) not formatted as three token sentences. When a database is intended to convey commonsense knowledge about everyday life or expert knowledge about an application area to an artificial intelligence system, we call the database a knowledge base. Knowledge bases range from general ones like Freebase, OpenCyc, WordNet, or Wikibase, etc. to more specialized knowledge bases, like GeneOntology. Representations for entities and relations can be learned by considering each triplet in a knowledge base as a training example and maximizing a training objective that captures their joint distribution (Bordes et al., 2013a).

In addition to training data, we also need to define a model family to train. A common approach is to extend neural language models to model entities and relations. Neural language models learn a vector that provides a distributed representation of each word. They also learn about interactions between words, such as which word is likely to come after a sequence of words, by learning functions of these vectors. We can extend this approach to entities and relations by learning an embedding vector for each relation. In fact, the parallel between modeling language and modeling knowledge encoded as relations is so close that researchers have trained representations of such entities by using both knowledge bases and natural language sentences (Bordes et al., 2011,2012; Wang et al., 2014a) or combining data from multiple relational databases (Bordes et al., 2013b). Many possibilities exist for the particular parametrization associated with such a model. Early work on learning about relations between entities (Paccanaro and Hinton 2000 ) posited highly constrained parametric forms (“linear relational embeddings”), often using a different form of representation for the relation than for the entities. For example, Hinton (2000) and Bor al. (2011) used vectors for entities and matrices for relations, with the idea that a relation acts like an operator on entities. Alternatively, relations can be considered as any other entity (Bor et al., 2012), allowing us to make statements about relations, but more flexibility is put in the machinery that combines them in order to model their joint distribution.

A practical short-term application of such models is link prediction: predicting missing arcs in the knowledge graph. This is a form of generalization to new facts, based on old facts. Most of the knowledge bases that currently exist have been constructed through manual labor, which tends to leave many and probably the majority of true relations absent from the knowledge base. See Wang (2014b), Lin et al. (2015) and Garcia-Duran et al. (2015) for examples of such an application.

Evaluating the performance of a model on a link prediction task is difficult because we have only a dataset of positive examples (facts that are known to be true). If the model proposes a fact that is not in the dataset, we are unsure whether the model has made a mistake or discovered a new, previously unknown fact. The metrics are thus somewhat imprecise and are based on testing how the model ranks a held-out of set of known true positive facts compared to other facts that are less likely to be true.

A common way to construct interesting examples that are probably negative (facts that are probably false) is to begin with a true fact and create corrupted versions of that fact, for example by replacing one entity in the relation with a different entity selected at random. The popular precision at

10

%

10 \%

10% metric counts how many times the model ranks a “correct” fact among the top

10% of all corrupted versions of that fact.

Another application of knowledge bases and distributed representations for them is word-sense disambiguation (Navigli and Velardi,

2005

;

2005 ;

2005; Bordes et al., 2012), which is the task of deciding which of the senses of a word is the appropriate one, in some context.

Eventually, knowledge of relations combined with a reasoning process and understanding of natural language could allow us to build a general question answering system. A general question answering system must be able to process input information and remember important facts, organized in a way that enables it to retrieve and reason about them later. This remains a difficult open problem which can only be solved in restricted “toy” environments. Currently, the best approach to remembering and retrieving specific declarative facts is to use an explicit memory mechanism, as described in section 10.12. Memory networks were first proposed to solve a toy question answering task (Weston et al., 2014). Kumar et al. (2015) have proposed an extension that uses GRU recurrent nets to read the input into the memory and to produce the answer given the contents of the memory.

This concludes part II, which has described modern practices involving deep networks, comprising all of the most successful methods. Generally speaking, these methods involve using the gradient of a cost function to find the parameters of a model that approximates some desired function. With enough training data, this approach is extremely powerful. We now turn to part III, in which we step into the territory of research - methods that are designed to work with less training data or to perform a greater variety of tasks, where the challenges are more difficult and not as close to being solved as the situations we have described so far.

繼續閱讀