Closed
Bug 1443748
Opened 6 years ago
Closed 6 years ago
Crash in mozilla::ipc::IPDLParamTraits<T>::Write
Categories
(Core :: IPC, defect, P1)
Tracking
()
RESOLVED
FIXED
mozilla64
Tracking | Status | |
---|---|---|
firefox-esr52 | --- | unaffected |
firefox-esr60 | 63+ | fixed |
firefox58 | --- | unaffected |
firefox59 | --- | unaffected |
firefox60 | - | wontfix |
firefox61 | - | wontfix |
firefox62 | --- | wontfix |
firefox63 | + | fixed |
firefox64 | + | fixed |
People
(Reporter: calixte, Assigned: haik)
References
(Blocks 1 open bug)
Details
(4 keywords, Whiteboard: [post-critsmash-triage][adv-main63+][adv-esr60.3+])
Crash Data
Attachments
(4 files, 1 obsolete file)
876 bytes,
patch
|
abillings
:
sec-approval+
|
Details | Diff | Splinter Review |
876 bytes,
patch
|
pascalc
:
approval-mozilla-beta+
|
Details | Diff | Splinter Review |
876 bytes,
patch
|
jcristau
:
approval-mozilla-release-
|
Details | Diff | Splinter Review |
868 bytes,
patch
|
RyanVM
:
approval-mozilla-esr60+
|
Details | Diff | Splinter Review |
This bug was filed from the Socorro interface and is report bp-e468b93b-cfe9-4200-b1f2-b6b490180307. ============================================================= Top 10 frames of crashing thread: 0 XUL mozilla::ipc::IPDLParamTraits<mozilla::ipc::FileDescriptor>::Write ipc/glue/FileDescriptor.cpp:235 1 XUL std::__1::__function::__func<mozilla::net::PNeckoParent::OnMessageReceived ipc/glue/IPDLParamTraits.h:60 2 XUL mozilla::net::ExtensionJARFileOpener::SendBackFD clang/include/c++/v1/functional:1916 3 XUL mozilla::detail::RunnableMethodImpl<mozilla::net::ExtensionJARFileOpener*, nsresult xpcom/threads/nsThreadUtils.h:1149 4 XUL nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1040 5 XUL NS_ProcessPendingEvents xpcom/threads/nsThreadUtils.cpp:459 6 XUL nsBaseAppShell::NativeEventCallback widget/nsBaseAppShell.cpp:98 7 XUL nsAppShell::ProcessGeckoEvents widget/cocoa/nsAppShell.mm:436 8 CoreFoundation CoreFoundation@0xa7320 9 CoreFoundation CoreFoundation@0x8821c ============================================================= There is 1 crash in nightly 60 with buildid 20180306100123. In analyzing the backtrace, the regression may have been introduced by patch [1] to fix bug 1440511. [1] https://hg.mozilla.org/mozilla-central/rev?node=5467d67a2577
Flags: needinfo?(nika)
Comment 1•6 years ago
|
||
I suspect this is not actually a new crash, but merely a signature change, since we moved some code around. That being said, I can't find the previous crash...
Comment 2•6 years ago
|
||
I think what :froydnj said is probably true, but I also can't find the original crash. I took a look anyway. It looks like a null pointer crash, and the only pointer which has any chance of being null there is the aActor pointer. This is what the code which uses it looks like (not linking to searchfox, because it's generated): WeakPtr<PNeckoParent> self__ = this; GetExtensionFDResolver resolver = [this, self__, id__, seqno__](const FileDescriptor& aParam) { if ((!(self__))) { NS_WARNING("Not resolving response because actor is dead."); return; } if ((mState) == (PNecko::__Dead)) { NS_WARNING("Not resolving response because actor is destroyed."); return; } bool resolve__ = true; FileDescriptor fd; fd = mozilla::Move(aParam); IPC::Message* reply__ = PNecko::Reply_GetExtensionFD(id__); WriteIPDLParam(reply__, self__, resolve__); // Sentinel = 'resolve__' (reply__)->WriteSentinel(3997392463); WriteIPDLParam(reply__, self__, fd); The only way I can see that self__ could become null between the check at the top and the call to Write is if one of the other calls (like WriteIPDLParam or something) caused the PNecko to be destroyed, which I have no idea how it would happen?
Flags: needinfo?(nika)
Updated•6 years ago
|
Priority: -- → P3
Updated•6 years ago
|
Updated•6 years ago
|
Crash Signature: [@ mozilla::ipc::IPDLParamTraits<T>::Write] → [@ mozilla::ipc::IPDLParamTraits<T>::Write]
[@ mozilla::ipc::FatalError | mozilla::ipc::IProtocol::HandleFatalError | mozilla::ipc::IPDLParamTraits<T>::Write]
OS: Mac OS X → All
Comment 3•6 years ago
|
||
this is a UAF crash in many cases (resetting the priority to re-evaluate under these circumstances).
Updated•6 years ago
|
Group: core-security → dom-core-security
Updated•6 years ago
|
tracking-firefox60:
--- → ?
Comment 4•6 years ago
|
||
:mystor can you take another look based on comment 3, since it seems this isn't actually null much of the time?
Comment 5•6 years ago
|
||
Sorry I don't have much time to look into this right now, I tried to think through the way this could happen, and I don't have any ideas.
Flags: needinfo?(nika)
Updated•6 years ago
|
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → INCOMPLETE
Comment 7•6 years ago
|
||
Unactionable sec-high bugs make the list of open sec-high bug less useful.
Flags: needinfo?(continuation)
Comment 8•6 years ago
|
||
(In reply to Julien Cristau [:jcristau] from comment #6) > Why was this closed? The Dev says he has no ideas of how to fix this so this bug is just going to sit open until... In that situation, critsmash triage resolves them as incomplete so the list of sec-high (or critical) bugs is a list of bugs that are actually being worked on to a resolution.
Comment 9•6 years ago
|
||
No sense tracking a bug that isn't going anywhere. That said, Fx60 betas are still showing a pretty regular crash rate here...
Updated•6 years ago
|
Comment 10•6 years ago
|
||
Given the high crash rate here, it might be more appropriate to have this marked as stalled but to keep it open. However, given the below, I'm not yet sure we're stalled. Peter Van Der Beken and I looked into this a bit and with Jeff Muizelaar's help, we were able to get to https://bit.ly/2LJRfGm vs https://bit.ly/2osbsHy (we weren't sure if the template parameter in the signature was catching too many unrelated crashes). It looks like the crashes with mozilla::net::ExtensionJARFileOpener::SendBackFD() on the stack are the most common. Haik touched that code last with reviews by Jim so I'll needinfo them here to see if they have ideas.
Status: RESOLVED → REOPENED
Flags: needinfo?(jmathies)
Flags: needinfo?(haftandilian)
Resolution: INCOMPLETE → ---
Assignee | ||
Comment 12•6 years ago
|
||
I don't know what's causing this yet, but given what :nika and :philipp said above, I suspect this is due to inadequate synchronization in the NeckoParent handling of the async GetExtensionFD message. We handle that message by offloading work to another thread then jumping to the main thread to send back the result. The GetExtensionFD message is sent from a child process when it needs to get the FD for an extension JAR file so it can open and read the JAR without having filesystem access to it directly. This async message is handled in the Necko parent process. The JAR file is opened in the parent process and the FD is sent back to the child. The JAR file is opened off the main thread in ExtensionJARFileOpener::OpenFile() and then the FD is sent back from the main thread in ExtensionJARFileOpener::SendBackFD(). Tear down of the NeckoParent needs to wait for any ExtensionJARFileOpener instances to complete, but that is not being done. As a fix, we could keep a list of outstanding ExtensionJARFileOpener instances and then wait for them to all complete duing teardown. I need to do some more debugging to confirm that this is the problem. Taking this bug.
Assignee: nobody → haftandilian
Flags: needinfo?(haftandilian)
Priority: -- → P1
Comment 13•6 years ago
|
||
(In reply to Haik Aftandilian [:haik] from comment #12) > Tear down of the > NeckoParent needs to wait for any ExtensionJARFileOpener instances to > complete, but that is not being done. As a fix, we could keep a list of > outstanding ExtensionJARFileOpener instances and then wait for them to all > complete duing teardown. What I don't understand here is: if the actor is destroyed before the resolve function is called, the captured WeakPtr self__ should be null here: https://crash-stats.mozilla.com/sources/highlight/?url=https://gecko-generated-sources.s3.amazonaws.com/ccc17ebc155a410d6b3a59b3247a330b6e1af1ea1217cdeb22906fbcd899a89c5b1711fcfa6472957bff18c5957fe964a8d521b445c95db3e7e534fe63987c49/ipc/ipdl/PNeckoParent.cpp#L-2321 And in that case we shouldn't be able to reach the point where the crash is reported. I was wondering if maybe something was happening on the wrong thread, but I didn't see any obvious way that would happen from looking at the code.
Comment 14•6 years ago
|
||
(In reply to Jed Davis [:jld] (⏰UTC-6) from comment #13) > What I don't understand here is: See also comment #2.
Comment 15•6 years ago
|
||
A thing that's useful to know: on amd64, accesses to non-canonical addresses (where it's not a sign-extended 48-bit value) raise #GP, not #PF, and the CPU doesn't supply the address in question to the OS. Linux and Mac report the address as 0, Windows as -1. The 32-bit versions of this crash are almost all 0xe5e5e5e5 (or the 64-bit sign extension of that value). Many of the 64-bit versions are the address we'd get if it were a non-canonical address, but some of them aren't. This suggests comparing two crashes from the same Firefox build: bp-677abd07-4382-4a24-94c1-1dad60180828 has crash address 0xe2d82c1351c; in the Raw Dump tab, rsi and rdi have the same value. bp-5e96cbb7-72a6-4933-8bab-3765c0180825 reports crash address 0x0, but in the raw dump, rsi and rdi are 0xe5e5e5e5e5e5e5e5 So maybe the value we're getting from self__ is the result of loading from a dangling pointer, which would explain why it's not null. Specifically: Are we copy-constructing or destroying a copy of the resolver function off-main-thread? It contains a non-thread-safe WeakPtr, so that could cause a lost update to the WeakReference refcount and explain what we're seeing. WeakPtr has thread-safety assertions, but they're debug-only, so if this happens only in a timing-sensitive edge case they might not catch it.
Comment 16•6 years ago
|
||
bp-ccf7bb2a-d6b7-4ca1-bc27-ce8dc0180907 and bp-c8630a42-07c4-4c92-b6fb-ff4bd0180909 are from one of our builds, so I can do some diassembly. The bad address really is the aActor argument to IPDLParamTraits<FileDescriptor>::Write, and it does appear to be result of converting self__ to a raw pointer (and is the same value checked against 0 at the start of the function).
Assignee | ||
Comment 17•6 years ago
|
||
Working with Jed, we've got an explanation for the crash. In short, the ExtensionJARFileOpener (and therefore the WeakPtr within the resolve closure) can be deleted off the main thread. WeakPtr isn't thread safe and in this case needs to be deleted on the main thread to avoid racing with other instances of WeakPtr<PNeckoParent> referring to the same PNeckoParent. This could cause the WeakReference's reference counting to be off which can cause it to be prematurely deleted. ExtensionJARFileOpener gets deleted on the OpenFile thread if the SendBackFD Runnable it creates to send back the FD on the main thread completes before ExtensionJARFileOpener is released by the OpenFile thread.
Comment 18•6 years ago
|
||
For the record, I was wrong about the assertions. Destroying a WeakPtr doesn't check thread safety; other operations (creation, assignment, dereference) do.
Assignee | ||
Comment 19•6 years ago
|
||
This patch clears the resolver on the main thread after it is used in SendBackFD. If the final Release of ExtensionJARFileOpener ends up being called after SendBackFD on the worker thread, the resolver will be already deleted and we'll avoid this crash because we won't be deleting a WeakPtr<PNeckoParent> off the main thread. I intentionally did not add comments because this is a security bug. I haven't reproduced the reported crash, but I have reproduced the WeakPtr being deleted off the main thread which could lead to the crash. I'm still testing the patch.
Attachment #9009664 -
Flags: review?(jld)
Assignee | ||
Updated•6 years ago
|
Flags: needinfo?(jmathies)
Comment 20•6 years ago
|
||
Comment on attachment 9009664 [details] [diff] [review] Patch to avoid the crash be ensuring the PNeckoParent resolver is deleted on the main thread Review of attachment 9009664 [details] [diff] [review]: ----------------------------------------------------------------- I like the simplicity of this approach. ::: netwerk/protocol/res/ExtensionProtocolHandler.cpp @@ +193,5 @@ > private: > + virtual ~ExtensionJARFileOpener() > + { > + if (!NS_IsMainThread()) { > + MOZ_RELEASE_ASSERT(mResolve == nullptr); This part of the patch seems like a relatively big hint about what the bug is, but whoever does the sec-approval would have a better idea of how much of a concern that is. Specifically, this should work without the main thread check — as far as I can see, once we create one of these objects, every path either eventually calls SendBackFD or leaks it (if dispatch fails during shutdown).
Attachment #9009664 -
Flags: review?(jld) → review+
Assignee | ||
Comment 21•6 years ago
|
||
(In reply to Jed Davis [:jld] (⏰UTC-6) from comment #20) > Comment on attachment 9009664 [details] [diff] [review] > ::: netwerk/protocol/res/ExtensionProtocolHandler.cpp > @@ +193,5 @@ > > private: > > + virtual ~ExtensionJARFileOpener() > > + { > > + if (!NS_IsMainThread()) { > > + MOZ_RELEASE_ASSERT(mResolve == nullptr); > > This part of the patch seems like a relatively big hint about what the bug > is, but whoever does the sec-approval would have a better idea of how much > of a concern that is. Thanks for the review. I'll err on the side of obfuscation and remove the new ASSERT. I'll use another bug to add more comments and add the ASSERT back once this lands. > Specifically, this should work without the main thread check — as far as I > can see, once we create one of these objects, every path either eventually > calls SendBackFD or leaks it (if dispatch fails during shutdown). I agree.
Assignee | ||
Comment 22•6 years ago
|
||
Removed the destructor ASSERT. Carrying forward r+.
Attachment #9009664 -
Attachment is obsolete: true
Attachment #9010110 -
Flags: review+
Assignee | ||
Updated•6 years ago
|
Attachment #9010110 -
Flags: review+
Assignee | ||
Comment 23•6 years ago
|
||
Comment on attachment 9010110 [details] [diff] [review] Patch to avoid the crash by ensuring the PNeckoParent resolver is deleted on the main thread [Security approval request comment] How easily could an exploit be constructed based on the patch? Not easily. The ramifications of the bug aren't clear from the patch. The UAF occurs in the parent process, but the code is triggered from the child process via an IPC message. A compromised child process could repeatedly trigger the code to run in the parent and eventually hit the race condition causing the UAF. The compromised child would also have to control allocations in order to leverage the UAD. I don't think web content or an extension could repeatedly trigger the code to run in the parent, hitting the UAF. Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem? No, but please review the check-in comment. Which older supported branches are affected by this flaw? All (ESR60, Release, Beta, Nightly) If not all supported branches, which bug introduced the flaw? Do you have backports for the affected branches? No If not, how different, hard to create, and risky will they be? The patches should all be the same 1-line change and should not be hard to create. Given the small size of the patch, I think the change is well understood and low risk. How likely is this patch to cause regressions; how much testing does it need? Low probability of causing regressions. Standard bug fix testing. Existing automated tests will exercise the changed code.
Attachment #9010110 -
Flags: sec-approval?
Updated•6 years ago
|
status-firefox64:
--- → affected
tracking-firefox63:
--- → +
tracking-firefox64:
--- → +
tracking-firefox-esr60:
--- → 63+
Comment 24•6 years ago
|
||
Comment on attachment 9010110 [details] [diff] [review] Patch to avoid the crash by ensuring the PNeckoParent resolver is deleted on the main thread sec-approval+ for trunk. Let's get Beta and ESR60 patches nominated as well.
Attachment #9010110 -
Flags: sec-approval? → sec-approval+
Assignee | ||
Comment 25•6 years ago
|
||
This problem was introduced by my fix for bug 1334550 "Proxy moz-extension protocol requests to the parent process". Working on getting the other patches ready.
Assignee | ||
Comment 26•6 years ago
|
||
Identical patch for Beta.
Assignee | ||
Comment 27•6 years ago
|
||
Identical patch for Release.
Assignee | ||
Comment 28•6 years ago
|
||
Identical patch for ESR60.
Assignee | ||
Comment 29•6 years ago
|
||
Comment on attachment 9010773 [details] [diff] [review] Beta patch [Security approval request comment] The code changes are the same as on Nightly. See comment 23.
Attachment #9010773 -
Flags: sec-approval?
Assignee | ||
Comment 30•6 years ago
|
||
Comment on attachment 9010774 [details] [diff] [review] Release patch [Security approval request comment] The code changes are the same as on Nightly. See comment 23.
Attachment #9010774 -
Flags: sec-approval?
Assignee | ||
Comment 31•6 years ago
|
||
Comment on attachment 9010775 [details] [diff] [review] ESR60 patch [Security approval request comment] The code changes are the same as on Nightly. See comment 23.
Attachment #9010775 -
Flags: sec-approval?
Assignee | ||
Comment 32•6 years ago
|
||
:RyanVM informed me sec-approval is only needed for the Nightly patch. I'll update patches to request standard approvals.
Assignee | ||
Comment 33•6 years ago
|
||
Comment on attachment 9010773 [details] [diff] [review] Beta patch Approval Request Comment [Feature/Bug causing the regression]: Bug 1334550 [User impact if declined]: User is exposed to a UAF crash in the parent process. [Is this code covered by automated tests?]: Yes [Has the fix been verified in Nightly?]: No, not yet on Nightly due to it being a security bug. [Needs manual test from QE? If yes, steps to reproduce]: No specific tests that reproduce the crash. Try builds have been shown to exercise the changed code, but don't cause the crash because that requires a race condition that has not been reproduced. [List of other uplifts needed for the feature/fix]: None [Is the change risky?]: No [Why is the change risky/not risky?]: The change is small and well understood. [String changes made/needed]: None
Attachment #9010773 -
Flags: sec-approval? → approval-mozilla-beta?
Assignee | ||
Comment 34•6 years ago
|
||
Comment on attachment 9010774 [details] [diff] [review] Release patch Approval Request Comment [Feature/Bug causing the regression]: Bug 1334550 [User impact if declined]: User is exposed to a UAF crash in the parent process. [Is this code covered by automated tests?]: Yes [Has the fix been verified in Nightly?]: No, not yet on Nightly due to it being a security bug. [Needs manual test from QE? If yes, steps to reproduce]: No specific tests that reproduce the crash. Try builds have been shown to exercise the changed code, but don't cause the crash because that requires a race condition that has not been reproduced. [List of other uplifts needed for the feature/fix]: None [Is the change risky?]: No [Why is the change risky/not risky?]: The change is small and well understood. [String changes made/needed]: None
Attachment #9010774 -
Flags: approval-mozilla-release?
Assignee | ||
Updated•6 years ago
|
Attachment #9010774 -
Flags: sec-approval?
Assignee | ||
Comment 35•6 years ago
|
||
Comment on attachment 9010775 [details] [diff] [review] ESR60 patch [Approval Request Comment] If this is not a sec:{high,crit} bug, please state case for ESR consideration: User impact if declined: sec-high Fix Landed on Version: Not yet landed. Risk to taking this patch (and alternatives if risky): Low risk due to the size of the change. It's in code used for loading JAR files which is used by extensions so it's frequently used. String or UUID changes made by this patch: None See https://wiki.mozilla.org/Release_Management/ESR_Landing_Process for more info.
Attachment #9010775 -
Flags: sec-approval? → approval-mozilla-esr60?
Comment 36•6 years ago
|
||
I filed bug 1493045 to assert thread-safety on WeakPtr destruction. In this case, that would have caught the wrong-thread destruction even if it wouldn't have caused a lost update to the refcount, which is probably the much more common case, and hopefully (1) failed often enough that we'd have caught it in CI, and (2) pointed much more directly at the problem. (I'm not linking the bugs with a See Also because I'd like to not put up a big sign saying the other bug is inspired by a live security issue.)
Comment 37•6 years ago
|
||
Comment on attachment 9010774 [details] [diff] [review] Release patch Unless this is being actively exploited it should go in 63.
Attachment #9010774 -
Flags: approval-mozilla-release? → approval-mozilla-release-
Comment 38•6 years ago
|
||
Comment on attachment 9010773 [details] [diff] [review] Beta patch Uplift approved for 63 beta 9, thanks.
Attachment #9010773 -
Flags: approval-mozilla-beta? → approval-mozilla-beta+
Comment 39•6 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/bad3f36435488b47c23c46ab8e8ee7aa74d62464
Comment 40•6 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/bad3f3643548
Group: dom-core-security → core-security-release
Status: REOPENED → RESOLVED
Closed: 6 years ago → 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla64
Comment 41•6 years ago
|
||
I also filed bug 1493342 about most of the 64-bit crashes having the wrong address.
Comment 42•6 years ago
|
||
uplift |
https://hg.mozilla.org/releases/mozilla-beta/rev/817efa50e613e4f8fc704fb106b13654ab0d6688
Updated•6 years ago
|
Flags: qe-verify-
Whiteboard: [post-critsmash-triage]
Comment 43•6 years ago
|
||
Comment on attachment 9010775 [details] [diff] [review] ESR60 patch Fixes a sec-high crash. Approved for ESR 60.3.
Attachment #9010775 -
Flags: approval-mozilla-esr60? → approval-mozilla-esr60+
Comment 44•6 years ago
|
||
uplift |
https://hg.mozilla.org/releases/mozilla-esr60/rev/b80f94262165
Updated•6 years ago
|
Whiteboard: [post-critsmash-triage] → [post-critsmash-triage][adv-main63+][adv-esr60.3+]
Updated•5 years ago
|
Group: core-security-release
You need to log in
before you can comment on or make changes to this bug.
Description
•