Presentation is loading. Please wait.

Presentation is loading. Please wait.

Towards Understanding End-of-trip Instructions in a Taxi Ride Scenario

Similar presentations


Presentation on theme: "Towards Understanding End-of-trip Instructions in a Taxi Ride Scenario"— Presentation transcript:

1 Towards Understanding End-of-trip Instructions in a Taxi Ride Scenario
Deepthi Karkada*1 Ramesh Manuvinakurike*2 Kallirroi Georgila2 * Equal contributions 1 Intel corp 2 USC Institute for Creative Technologies Towards Understanding End-of-trip Instructions in a Taxi Ride Scenario Ongoing project. Feedback and ideas welcome.

2 Motivation End-of-trip scenario in a taxi ride.
e.g., Uber, Lyft, Taxi cab, etc. Before ‘bye’, the last few exchanges in a taxi ride are usually the rider informing the driver where they prefer to be dropped off. Driver: Where would you like me to stop? Passenger: Could you please drop me off in front of the white car?

3 Motivation Understanding [and responding to] such instructions requires complex vision and language understanding capabilities. Driver: Where would you like me to stop? Passenger: Could you please drop me off in front of the white car?

4 Motivation Understanding [and responding to] such instructions requires complex vision and language understanding capabilities. Vision component: Object labeling, position identification etc. White van Blue car Street sign Black car White car Red car Driver: Where would you like me to stop? Passenger: Could you please drop me off in front of the white car?

5 Motivation Understanding [and responding to] such instructions requires complex vision and language understanding capabilities. Vision component: Object labeling position identification etc. Language component: Dialogue-act classification (e.g., Request). Parameter identification (e.g., Action, Referent, etc.). Blue car White van Black car Street sign Black car White car Red car Driver: Where would you like me to stop? Passenger: Could you please [drop me off] [in front of] [the white car]? ACTION DD Referent

6 Motivation Understanding [and responding to] such instructions requires complex vision and language understanding capabilities. Vision component: Object labeling position identification etc. Language component: Dialogue-act classification (e.g., Request). Parameter identification (e.g., Action, Referent, etc.). Combining information from vision and language modalities. Referent identification. Target-location identification. Blue car White van Black car Street sign Black car White car Red car Driver: Where would you like me to stop? Passenger: Could you please [drop me off] [in front of] [the white car]? ACTION DD Referent

7 End-of-trip in a taxi ride
Images data-collection. - Synthetic images. - Real world images. Instructions data collection. Annotations. Models. - Referent identification Conclusions & Future work - Destination identification - Real-world images dataset

8 End-of-trip in a taxi ride
Images data-collection. - Synthetic images. - Real world images. Instructions data collection. Annotations. Models. - Referent identification Conclusions & Future work - Destination identification - Real-world images dataset In this work, we’re interested only in a single utterance user instruction. E.g.: Stop right in front of the cop car, Please park behind the street sign, etc.

9 Images construction (Synthetic)
Synthetic images: Constructed from vehicle and object templates. Bird’s eye view . Objects are present on the sidewalk. Vehicles are parked on either side of the street in left-hand drive scenario. Target location placed on the street randomly. Pros: Abstracts the vision problem. Easier to construct large (with known configuration of objects) sets of images. Cons: Not real. Doesn’t capture the dynamics of a real-world scenario. Objects inserted on the sidewalk Street lamp Tree Fire hydrant Safety cone Vehicles parked on either sides of the street Limo Generic Car Van Taxi Cop car E.g. synthetic image with target location.

10 Images construction (Real)
Real images: Constructed using Google street view. Manually placed “target locations” on the street randomly. Pros: Real-scenes and captures the dynamics of a real-world scenario. Cons: Expensive dataset construction, image annotations, target-location identification, etc.

11 User instructions collection
Used Amazon Mechanical Turk (AMT) crowdsourcing platform for the corpus collection. These users on AMT, called turkers are presented with a hypothetical scenario. They’re asked to imagine a hypothetical taxi-ride. The location where they prefer the taxi to stop is marked by a red cross. If they were to instruct the driver, how would they do it? Three such descriptions collected for variety.

12 A typical description Please stop in front of the cop car
Park behind the blue pickup truck ACTION DIRECTION REFERENT ACTION DIRECTION REFERENT

13 Sample data from the synthetic 2D images

14 Sample data from the real-world 3d street images
Target location description samples: Annotated sample: Stop next to the first white car you see. Stop next to the car behind the blue car. Stop next to the white car. [ACTION: Stop] [DD: next to] [REF: the first white car you see] [ACTION: Stop] [DD: next to] [REF:the car behind the blue car] [ACTION: Stop] [DD: next to] [REF:the white car]

15 Directional descriptions
Natural Language Description Annotations statistics Synthetic Real-world Actions Referents Directional descriptions 273 408 372 173 217 219 The inter-rater reliability for word-level annotations using Cohen’s kappa is 0.81.

16 Baseline model: Task pipeline
Task pipeline for identification of the target location using the user descriptions:

17 Referent identification
BLUE CAR ? Identify the vehicle/object described among the distractors. Once the referent (language) has been identified, we locate the object (visual). The object ground truth labels are used. The embeddings for the visual objects and the referents are extracted.

18 Referent identification
Identify the vehicle/object described among the distractors. Once the referent has been identified, we locate the object. The object ground truth labels are used. The embeddings for the visual objects and the referents are extracted.

19 Comparison of reference resolution approaches
Performs better than a simple sub-string matching method. Embeddings trained using wiki-corpus perform better than the embeddings trained using the in-domain utterances.

20 A typical description Directional description:
Describes the direction to stop w.r.t. the referent.

21 A typical description Directional description:
Describes the direction to stop w.r.t. the referent. The same description can refer to multiple directions: Example: Next to. The context such as the direction of the motion is important.

22 A typical description Directional description:
Describes the direction to stop w.r.t. the referent. The same description can refer to multiple directions: Example: Next to. The context such as the direction of the motion is important. The approach we will take is region prediction: Region prediction using the directional descriptions and the referents is challenging.

23 Future work: Target location/region identification
Once the referent is identified, the goal is to identify the target location. The target location is the function of r (distance from the referent) and ‘theta’ (direction from the referent)

24 Future work We’re extending the work with a multi-turn dialogue corpus. Use BDD corpus for real-world as the images are pre-annotated. Transfer learning: Identify the brands of the car, colors, etc. BDD-Berkeley Deep Drive dataset (Yu et.al 2018) Cars dataset (Krause et.al 2018)

25 Contributions A novel corpus containing user descriptions of target locations for synthetic and real-world street images. The natural language description annotations along with the visual annotations for the task of target location prediction. A baseline model for the task of identification of referents from user descriptions.

26 Thank you Special thanks for discussions: David Traum Ron Artstein
Maike Paetzel


Download ppt "Towards Understanding End-of-trip Instructions in a Taxi Ride Scenario"

Similar presentations


Ads by Google