1478849 - AddressSanitizer: heap-use-after-free [@ IPC::Channel::Unsound_IsClosed] with READ of size 8

Reporter

Description

•

6 years ago

The attached crash information was submitted via the ASan Nightly Reporter on mozilla-central-asan-nightly revision 63.0a1-20180725103029-https://hg.mozilla.org/mozilla-central/rev/0b9c0b211eb3d84c5afdb64f4a91d079d0aa4c75.

For detailed crash information, see attachment.

Christian Holler (:decoder)

Reporter

Comment 1

•

6 years ago

Attached file Detailed Crash Information — Details

Daniel Veditz [:dveditz]

Updated

•

6 years ago

Group: core-security → media-core-security

Christian Holler (:decoder)

Reporter

Updated

•

6 years ago

No longer blocks: asan-nightly-project

Andrew McCreight [:mccr8]

Comment 3

•

6 years ago

It looks like maybe ~GMPServiceChild is touching a channel after it was destroyed, via mContentParents. Maybe ~ContentParent needs to call RemoveGMPContentParent()?

Nils, any idea who could look at this?

Flags: needinfo?(drno)

Nils Ohlmeier [:drno]

Updated

•

6 years ago

Flags: needinfo?(drno)

Nils Ohlmeier [:drno]

Comment 4

•

6 years ago

Byron, could you please take a look at this?
Ask cpearce for advice if needed.

Flags: needinfo?(docfaraday)

Nils Ohlmeier [:drno]

Updated

•

6 years ago

Rank: 8

Priority: -- → P1

Byron Campen [:bwc]

Comment 5

•

6 years ago

Wow, look at all those bare pointers in the IPC code.

Byron Campen [:bwc]

Comment 6

•

6 years ago

So, it is extremely hard to tell what is going on here, but I have a guess. The only place we create a DeleteTask<IPC::Channel> is here:

https://searchfox.org/mozilla-central/source/ipc/glue/ProtocolUtils.cpp#691

GMPContentParent, the thing we're destroying on main (the crashing thread) happens to be an IToplevelProtocol. What I think has happened is we are destroying a GMPContentParent, and are currently running the d'tor code for the IToplevelProtocol. This dispatches a DeleteTask<IPC::Channel> to Chrome_~dThread. Most of the time, this is fine, because that runnable takes a little bit of time to run. But when it is executed before the GMPContentParent is done tearing down, we get a UAF.

This is a bug in the IPC code, I think.

Flags: needinfo?(docfaraday)

Byron Campen [:bwc]

Comment 7

•

6 years ago

The quick-and-dirty fix to this particular crash would be to destroy |mState| before dispatching the DeleteTask, I think. But given the web of bare pointers in this code, I am not sure.

Nils Ohlmeier [:drno]

Comment 8

•

6 years ago

Jim, could you help me to find an IPC expert have a look at this and Byron guess work in comment #6?

Flags: needinfo?(jmathies)

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 9

•

6 years ago

I've looked at the code, and comment #6 looks right.  The ToplevelState contains the MessageChannel, whose destructor looks at MessageChannel::mLink, which in this case is-a ProcessLink, which has a raw pointer (“weak reference”) to the Transport a.k.a. IPC::Channel.

So that means that IToplevelProtocol::mTrans (owning reference to the Transport) must outlive that object's IProtocol::mState (borrowed reference to the Transport).  And it doesn't, but comment #6's suggestion of clearing mState before sending the runnable looks reasonable.

Also, IToplevelProtocol::mTrans itself doesn't seem to be used much; it looks doable to move the Transport ownership into ProcessLink.  That might want to be a followup instead.

(Maybe not directly relevant, but I see that bug 1283744 has some historical context on Transport management.)

Assignee: nobody → jld

Component: Audio/Video → IPC

Flags: needinfo?(jmathies)

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 10

•

6 years ago

(In reply to Jed Davis [:jld] (⏰UTC-6) from comment #9)
> Also, IToplevelProtocol::mTrans itself doesn't seem to be used much; it
> looks doable to move the Transport ownership into ProcessLink.  That might
> want to be a followup instead.

Not quite.  The transport owner isn't always IToplevelProtocol::mTrans; for the initial actor (like PContent or PGPU) it seems to be 
ChildThread::channel_ on the child side and ChildProcessHost::channel_ on the parent side.  IToplevelProtocol::mTrans appears to be used only for toplevel actors created via Endpoint::Bind.  Simplifying that is *definitely* outside the scope of this bug.

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 11

•

6 years ago

Attached patch Patch — Details — Splinter Review

Attachment #9001807 - Flags: review?(continuation)

Andrew McCreight [:mccr8]

Updated

•

6 years ago

Attachment #9001807 - Flags: review?(continuation) → review+

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 12

•

6 years ago

About the possibility of regressions from other raw-pointer issues, I haven't noticed any while looking through the code, but to be a little more rigorous about it:

This patch causes the destruction of mState to happen-before the destruction of mTrans, when previously they were unordered, so if there's a vulnerability from that ordering then it was already possible without the patch (especially because that ordering seems to have been what usually happened in practice, e.g. in our ASan CI runs).

It also causes mState to be destroyed before the other members of IToplevelProtocol and IProtocol, but they're all POD (integers, enums, raw pointers not touched by destructors) so there shouldn't be any interactions there.

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 13

•

6 years ago

Also, don't rely on this without double-checking it, but I think this isn't usefully exploitable: the only thing we're doing to the freed `Channel` is reading the `channel_impl_` pointer, and then that's used only to read a `bool` which is used to decide whether or not to MOZ_CRASH.

There is a virtual call to MessageLink::Unsound_IsClosed in this, but the object it's called on is the ProcessLink, which isn't freed (MessageChannel::mLink) until later in MessageChannel::Clear.  The methods that are called on the Channel and ChannelImpl are non-virtual.

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 14

•

6 years ago

Comment on attachment 9001807 [details] [diff] [review]
Patch

This bug doesn't have a severity rating yet, but I'm requesting sec-approval in case it needs it.


[Security approval request comment]
* How easily could an exploit be constructed based on the patch?

Not simple, but probably doable: it's a use-after-free race condition between two threads in the parent process, and it looks like content could trigger it by creating/destroying EME or WebRTC contexts but it's not trivial to delay the thread that needs to be slower.  There'd also need to be some parent-process heap manipulation to do anything with it.  And see also comment #13 about severity.

* Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?

There are no comments in this version, but it's hard to hide: it's pretty obviously rearranging destruction order near something async/thread related.

* Which older supported branches are affected by this flaw?

All of them.  This code hasn't changed much since bug 1283744 (and it may have been more broken before that).

* Do you have backports for the affected branches? If not, how different, hard to create, and risky will they be?

The patch crossports cleanly to beta and esr60.

* How likely is this patch to cause regressions; how much testing does it need?

Not likely; see comment #12.

Attachment #9001807 - Flags: sec-approval?

Christian Holler (:decoder)

Reporter

Comment 15

•

6 years ago

Fwiw, unlike all other use-after-free reports in the ASan Nightly Project, this one is actually frequent. I got multiple reports from different people, in total 6 reports. That might sound low, but given a total population of around 100 clients, this is very frequent for such an issue. So this might indeed have some stability impact as well.


:jld, what is the MOZ_CRASH that you spoke of in comment 13? Can we see that in crash-stats?

Daniel Veditz [:dveditz]

Comment 16

•

6 years ago

going with sec-moderate based on comment 13

Although it's not directly this crash, comment 5 is worrying.

Keywords: csectype-uaf, sec-moderate

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 17

•

6 years ago

https://searchfox.org/mozilla-central/rev/71ef4447db179639be9eff4471f32a95423962d7/ipc/glue/MessageChannel.cpp#804-831

There are a lot of "MessageChannel destroyed without being closed" crashes; all the ones I looked at were during shutdown, when the cycle collector destroyed a ContentParent, and I don't know if they're caused by this bug or because the channel is actually still open.

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 18

•

6 years ago

Comment on attachment 9001807 [details] [diff] [review]
Patch

sec-moderate, so doesn't need approval.

Landed: https://hg.mozilla.org/integration/mozilla-inbound/rev/9c34c00e44d0b021fb1d5ac4bf464b9de4955374

Attachment #9001807 - Flags: sec-approval?

Daniel Veditz [:dveditz]

Updated

•

6 years ago

Flags: sec-bounty+

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 19

•

6 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/9c34c00e44d0b021fb1d5ac4bf464b9de4955374
https://hg.mozilla.org/mozilla-central/rev/9c34c00e44d0

Group: media-core-security → core-security-release

Status: NEW → RESOLVED

Closed: 6 years ago

status-firefox63: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla63

Ryan VanderMeulen [:RyanVM]

Comment 20

•

6 years ago

Please request Beta and ESR60 approval on this when you get a chance.

status-firefox61: --- → wontfix

status-firefox62: --- → affected

status-firefox-esr52: --- → wontfix

status-firefox-esr60: --- → affected

tracking-firefox62: --- → +

tracking-firefox63: --- → +

tracking-firefox-esr60: --- → 62+

Flags: needinfo?(jld)

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 21

•

6 years ago

Comment on attachment 9001807 [details] [diff] [review]
Patch

Approval Request Comment
[Feature/Bug causing the regression]: Unclear, because it's a very old bug — at least since the bug 1283744 cleanup and probably before that.
[User impact if declined]: This is a use-after-free that can be triggered by content; I have doubts about its exploitability, but I wouldn't bet on that.  See comment #16; also see comment #15 about frequency in real-world use.
[Is this code covered by automated tests?]: Yes; this is core IPC code.
[Has the fix been verified in Nightly?]: Landed in m-c in comment #19.
[Needs manual test from QE? If yes, steps to reproduce]: No.  It's an intermittent bug and we don't have STR.
[List of other uplifts needed for the feature/fix]: None.
[Is the change risky?]: Low risk.
[Why is the change risky/not risky?]: This essentially just imposes an ordering on some things we were previously doing in parallel; see comment #12 for more analysis.
[String changes made/needed]: None.

Flags: needinfo?(jld)

Attachment #9001807 - Flags: approval-mozilla-esr60?

Attachment #9001807 - Flags: approval-mozilla-beta?

Ryan VanderMeulen [:RyanVM]

Comment 22

•

6 years ago

Comment on attachment 9001807 [details] [diff] [review]
Patch

Fixes an IPC security issue. Approved for 62.0b20 and ESR 60.2.

Attachment #9001807 - Flags: approval-mozilla-esr60?

Attachment #9001807 - Flags: approval-mozilla-esr60+

Attachment #9001807 - Flags: approval-mozilla-beta?

Attachment #9001807 - Flags: approval-mozilla-beta+

Ryan VanderMeulen [:RyanVM]

Comment 23

•

6 years ago

uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/cc960a07cd0e

status-firefox62: affected → fixed

Christian Holler (:decoder)

Reporter

Comment 24

•

6 years ago

When you find some time, it would be very helpful if you could answer the following questions so I can work on improving ASan Nightly further. In particular, many of the reports we saw about use-after-free were sent only once, not multiple times. So I'm asking myself what changes to ASan Nightly I can make to improve overall uaf detection rate:

1) Do you think more frequent GC/CC would have helped detecting this problem more frequently?

2) Do you think GC/CC on specific DOM events (e.g. beforeunload, pagehide, DOMContentLoaded) would have helped?

3) Do you think triggering the event loop in the specific child processes would have helped here? (e.g. dummy XMLHttpRequest onbeforeunload, useful in fuzzing).

4) Do you have any other ideas to reproduce this particular problem better or do you think any of the above points (or other approaches) might help to find other use-after-free issues in areas that you work in?

Any feedback would be greatly appreciated. Thanks!

Flags: needinfo?(jld)

Comment hidden (obsolete)

https://hg.mozilla.org/releases/mozilla-esr60/rev/07aedc90a6c1

Ryan VanderMeulen [:RyanVM]

Comment 26

•

6 years ago

backout

Backed out from esr60 for bustage.
https://treeherder.mozilla.org/logviewer.html#?job_id=195535566&repo=mozilla-esr60

https://hg.mozilla.org/releases/mozilla-esr60/rev/d3e100fe45ae459b9e44f5be686bfcebf2b6c392

status-firefox-esr60: fixed → affected

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 27

•

6 years ago

60 predates bug 1451363 and the state factoring, but I forgot about that while digging through older history.  Mea culpa.

Pre-1451363, the MessageChannel is an instance variable of each autogenerated abstract actor class (the PFooSide classes, which inherit from IToplevelProtocol), named mChannel but not the same as IProtocol::mChannel, and accessible through the virtual GetIPCChannel methods.

Which means that the MessageChannel and ProcessLink are already destroyed before entering ~IToplevelProtocol, and this bug is actually a regression caused by bug 1451363, and ESR 60 isn't affected.

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 28

•

6 years ago

(In reply to Christian Holler (:decoder) from comment #24)
> 4) Do you have any other ideas to reproduce this particular problem better
> or do you think any of the above points (or other approaches) might help to
> find other use-after-free issues in areas that you work in?

For this UAF, an async runnable/task needs to run before something else that's done synchronously, which more or less means that the first thread needs to be preempted.  This is the kind of thing that rr chaos mode approaches more generally, but for this specific case we might get some mileage out of doing a sched_yield() and/or a short sleep when posting a runnable/task to another thread.

Flags: needinfo?(jld)

Ryan VanderMeulen [:RyanVM]

Updated

•

6 years ago

status-firefox-esr60: affected → unaffected

tracking-firefox-esr60: 62+ → -

Ryan VanderMeulen [:RyanVM]

Updated

•

6 years ago

Attachment #9001807 - Flags: approval-mozilla-esr60+ → approval-mozilla-esr60-

Ryan VanderMeulen [:RyanVM]

Updated

•

6 years ago

status-firefox-esr52: wontfix → unaffected

Bogdan Maris, Desktop QA

Updated

•

6 years ago

Flags: qe-verify-

Whiteboard: [post-critsmash-triage]

Al Billings [:abillings - ex-MoCo]

Updated

•

6 years ago

Whiteboard: [post-critsmash-triage] → [post-critsmash-triage][adv-main62+]

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Updated

•

6 years ago

Updated

•

5 years ago

Group: core-security-release

Bob Clary [:bc] (inactive)

Updated

•

4 years ago

Updated

•

4 years ago

Blocks: asan-maintenance

Detailed Crash Information 6 years ago Christian Holler (:decoder) 31.61 KB, text/plain		Details
Patch 6 years ago Jed Davis [:jld] ⟨⏰\|UTC-7⟩ ⟦he/him⟧ 759 bytes, patch	mccr8 : review+ RyanVM : approval-mozilla-beta+ RyanVM : approval-mozilla-esr60-	Details \| Diff \| Splinter Review