1294095 - Crash in mozilla::DataChannelConnection::SctpDtlsOutput

Reporter

Description

•

8 years ago

This bug was filed from the Socorro interface and is 
report bp-c237a768-7c3a-4c28-b18c-ad34f2160810.
=============================================================
Crashing Thread (54)
Frame 	Module 	Signature 	Source
0 	xul.dll 	mozilla::DataChannelConnection::SctpDtlsOutput(void*, void*, unsigned int, unsigned char, unsigned char) 	netwerk/sctp/datachannel/DataChannel.cpp:691
1 	xul.dll 	sctp_lowlevel_chunk_output 	netwerk/sctp/src/netinet/sctp_output.c:5010
2 	mozglue.dll 	je_malloc 	memory/mozjemalloc/jemalloc.c:6184
3 	xul.dll 	sctp_abort_an_association 	netwerk/sctp/src/netinet/sctputil.c:4212
4 	xul.dll 	sctp_threshold_management 	netwerk/sctp/src/netinet/sctp_timer.c:167
5 	xul.dll 	sctp_t3rxt_timer 	netwerk/sctp/src/netinet/sctp_timer.c:832
6 	xul.dll 	sctp_timeout_handler 	netwerk/sctp/src/netinet/sctputil.c:1706
7 	xul.dll 	sctp_handle_tick 	netwerk/sctp/src/netinet/sctp_callout.c:155
8 	xul.dll 	user_sctp_timer_iterate 	netwerk/sctp/src/netinet/sctp_callout.c:194
9 	kernel32.dll 	BaseThreadInitThunk 	
10 	ntdll.dll 	__RtlUserThreadStart 	
11 	ntdll.dll 	_RtlUserThreadStart

crashes with this signature are regressing in number since the 49 nightly cycle & the signature is currently making up 0.3% of browser crashes in 49.0b1.

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 1

•

8 years ago

e5e5 signature

I wonder if this is another instance of the sctp timer iterator issue?  Byron?

I don't see how peer could point to freed memory, or peer->mSTS, while the sctp stack is still open.  In DataChannelConnection::Destory() we close() the sctp socket on STS thread (mSTS), and the runnable that does that has a RefPtr to the DataChannelConnection, so we can't free the DataChannelConnection until we've finished the close().  This implies that usrsctp_close() could leave a dangling timer reference about to fire...

Assignee: nobody → rjesup

Group: media-core-security

Rank: 5

Flags: needinfo?(tuexen)

Flags: needinfo?(docfaraday)

Keywords: sec-critical

Priority: -- → P1

Michael Tüxen

Comment 2

•

8 years ago

I don't really see how this pointer can go wrong... Are you also tearing down the SCTP stack at the same time?

Flags: needinfo?(tuexen)

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 3

•

8 years ago

So we're not calling sctp_finish until the browser is shutdown, so the timer thread will continue to operate.  However, as per above, we are certain that until usrsctp_close() returns that the pointer will be valid.  I would hope we would not get an callback to send a packet (per the stack) after close() returns...

Byron Campen [:bwc]

Comment 4

•

8 years ago

(In reply to Michael Tüxen from comment #2)
> I don't really see how this pointer can go wrong... Are you also tearing
> down the SCTP stack at the same time?

The stack is not being torn down, but we are calling |usrsctp_deregister_address| and |usrsctp_close|. Should that be enough?

Flags: needinfo?(docfaraday)

Michael Tüxen

Comment 5

•

8 years ago

I think so.

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 6

•

8 years ago

(In reply to Byron Campen [:bwc] from comment #4)
> (In reply to Michael Tüxen from comment #2)
> > I don't really see how this pointer can go wrong... Are you also tearing
> > down the SCTP stack at the same time?
> 
> The stack is not being torn down, but we are calling
> |usrsctp_deregister_address| and |usrsctp_close|. Should that be enough?

Note: the order of usrsctp_deregister_address and usrsctp_close() isn't fixed here (and not on the same thread).  Could that impact it?  We could always call deregister_address before usrsctp_close though.  Calling it afterwards would be possible; we'd have to pass a refptr to the connection along with the DestroyOnSTS(), and move the deregister into DestroyOnSTS (along with the usrsctp_close()es).

Michael Tüxen

Comment 8

•

8 years ago

When looking at the traces at https://crash-stats.mozilla.com/signature/?product=Firefox&version=49.0b&signature=mozilla%3A%3ADataChannelConnection%3A%3ASctpDtlsOutput&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_sort=-date&page=1
it is
* always Windows as a platform
* always when a timer (either HEARTBEAT or data retransmission timer) runs off
* always when this results in aborting the association due to excessive retransmissions.
So how do you think usrsctp_close() is in the game here?

Michael Tüxen

Comment 9

•

8 years ago

A question: When do you call usrsctp_deregister_address()? When the Peer connection dies? When it get closed? Do you leave some SCTP associations open?
Based on that I could do some testing with the usrsctp stack using valgrind, that should give a hint if the current code handles the situation well. This wouldn't mean that the code you are using, handles it, but possibly I can identify a bug in the current code...

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 10

•

8 years ago

So the only way the DataChannelConnection object can be freed is if we go through Destroy() first, which fires off DestroyOnSTS (which does usrsctp_close()) to the STS thread, then calls deregister_address (on the main/calling thread).  Thus due to thread ordering the DestroyOnSTS may run before or after deregister_address.  I presume the crashes are due to a race between the abort timer going off and the teardown of the peerconnection.

After Destroy() returns, the DataChannelConnection object will be released, which (if that's the last ref) will delete it.  (Various things in-flight might hold a ref to the connection, but generally it will be freed on return from Destroy()).

This pattern assumes that after deregister_address, no callbacks can occur (the DataChannelConnection ptr (raw ptr) is associated with the socket callback).  If instead we have to do usrsctp_close to stop callbacks, or both that and deregister, we could move deregister to before we fire off DestroyOnSTS(), or if deregister must come after the close, move it to DestroyOnSTS and also pass a reference to DataChannelConnection to DestroyOnSTS, so that the connection object can't be deleted until DestroyOnSTS completes.

Flags: needinfo?(tuexen)

Michael Tüxen

Comment 11

•

8 years ago

Let me test a bit...

Flags: needinfo?(tuexen)

Liz Henry (:lizzard) (relman/hg->git project)

Comment 12

•

8 years ago

This signature is still on the rise for 49 beta 2 & 3, though not huge volume. Tracking.

tracking-firefox49: --- → +

Byron Campen [:bwc]

Comment 13

•

8 years ago

So, looking through the stuff that |usrsctp_close| calls, I don't see an |sctp_timer_stop| for |SCTP_TIMER_TYPE_SEND|, which is the timer type that is blowing us up here. I only see an |sctp_timer_stop| for |SCTP_TIMER_TYPE_NEWCOOKIE|. Is there something else that |usrsctp_close| does that ought to be stopping (or at least neutering) this timer?

Flags: needinfo?(tuexen)

Michael Tüxen

Comment 14

•

8 years ago

It is for example stopped when we clear the net in the sctp_free_remote_addr() macro.

My current guess is that it is fine that the timer fires. The problem is that as a result
of the firing, the upper layer gets notified and probably deregisters the pointer to the lower
layer. I need to understand the C++ stuff...
I can add code that the lower layer pointer is not provided anymore after it is deregistered,
that might solve the issue. But I would like to understand the issue first. Not sure if I
can reproduce the problem with firefox... I was busy with my dayjob earlier this week...

Flags: needinfo?(tuexen)

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 15

•

8 years ago

Attached patch wrap and lock DataChannelConnection for SCTP callbacks (obsolete) — Details — Splinter Review

Proposed wrapper idea, though Michael's idea may be better

Randell Jesup [:jesup] (needinfo me)

Assignee

Updated

•

8 years ago

Attachment #8782576 - Flags: feedback?(docfaraday)

Michael Tüxen

Comment 16

•

8 years ago

OK, I looked at the code. It is not so easy as I thought to add the protection. And I have no
idea why this is happening.

If you just want to avoid the problem to occur, add

	usrsctp_sysctl_set_sctp_assoc_rtx_max_default(0xffffffff);

after
https://dxr.mozilla.org/mozilla-central/source/netwerk/sctp/datachannel/DataChannel.cpp#351

This will basically turn off the detection that the peer is dead. All error logs I looked at, this was the code path the problem occurs.

This doesn't solve the real problem (which I currently haven't been to identify based on the information available) but will avoid the problem.

Flags: needinfo?(rjesup)

Byron Campen [:bwc]

Comment 17

•

8 years ago

Comment on attachment 8782576 [details] [diff] [review]
wrap and lock DataChannelConnection for SCTP callbacks

Review of attachment 8782576 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/sctp/datachannel/DataChannel.h
@@ +296,5 @@
>    nsCOMPtr<nsIThread> mInternalIOThread;
> +
> +  // Temporary wrapper object to use with sctp library callbacks so we can
> +  // neuter it on Destroy().  Leaked on purpose when we delete this object.
> +  ConnectionWrapper *mTempWrapper;

Oof. Leaking a mutex on purpose? Surely there's a better fix.

Attachment #8782576 - Flags: feedback?(docfaraday) → feedback-

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 18

•

8 years ago

Hmmm.  That will turn off detection of a failed connection.  In theory ICE Consent Freshness should eventually kick in and mark it as failed.  There might even be advantages in not having an independent timeout for datachannels... since that will be confusing for JS apps (they'd see close events but the peerconnection may still be "connected")

However... while the stacks are all from association timeout, if we do this we may retransmit "forever", and in the case where we timed out, we might try to send a packet retransmit instead, and I'm concerned that this would use the bad ptr - unless the normal retransmit is blocked by the deregister/close.

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 19

•

8 years ago

(In reply to Byron Campen [:bwc] from comment #17)
> Comment on attachment 8782576 [details] [diff] [review]
> wrap and lock DataChannelConnection for SCTP callbacks
> 
> Review of attachment 8782576 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: netwerk/sctp/datachannel/DataChannel.h
> @@ +296,5 @@
> >    nsCOMPtr<nsIThread> mInternalIOThread;
> > +
> > +  // Temporary wrapper object to use with sctp library callbacks so we can
> > +  // neuter it on Destroy().  Leaked on purpose when we delete this object.
> > +  ConnectionWrapper *mTempWrapper;
> 
> Oof. Leaking a mutex on purpose? Surely there's a better fix.

Leaking a mutex is better than a UAF.

However, I wonder if I can so some magic with atomics... or do some deferred cleanup of these (clean them up on a 10s timer).  Initially I'd hoped to do this with a bare ptr to minimize the leak, but that opens us to holes when trying to clear it (inherently from a different thread)

Flags: needinfo?(rjesup)

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 20

•

8 years ago

I could also use a single static Mutex for all datachannelconnections.... likely no contention (or minimal), and the leak would then just be a class ConnectionWrapper { ConnectionWrapper(foo) : mConnection(foo); RefPtr<DataChannelConnection> mConnection; }

Michael Tüxen

Comment 21

•

8 years ago

(In reply to Randell Jesup [:jesup] from comment #18)
> Hmmm.  That will turn off detection of a failed connection.  In theory ICE
> Consent Freshness should eventually kick in and mark it as failed.  There
> might even be advantages in not having an independent timeout for
> datachannels... since that will be confusing for JS apps (they'd see close
> events but the peerconnection may still be "connected")
> 
> However... while the stacks are all from association timeout, if we do this
> we may retransmit "forever", and in the case where we timed out, we might
> try to send a packet retransmit instead, and I'm concerned that this would
> use the bad ptr - unless the normal retransmit is blocked by the
> deregister/close.

We have retransmitted multiple times before we do the cleanup. So I think
the problem is NOT related to the timerou (either HEARTBEAT or DATA retransmit),
but related to the cleanup. Which, I guess, also involves somehow the freeing
of the DTLS connection object. That is why I think my change would work around
the issue.

You can also:
* make sure you call usrscpt_close() before usrsctp_deregister().
* make sure you do not free whatever the pointer given to usrsctp points to before
  calling usrsctp_deregister().
That would (I'm guessing here) the real fix. However, it is not within the usrsctp stack.

I also think that the ICE layer detects that the peer is unreachable earlier than SCTP
does with the used default settings.

Flags: needinfo?(rjesup)

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 22

•

8 years ago

Attached patch wrap and lock DataChannelConnection for SCTP callbacks — Details — Splinter Review

Switched to a shared StaticMutex for locking access to the DataChannelConnection pointer

Attachment #8782954 - Flags: review?(docfaraday)

Randell Jesup [:jesup] (needinfo me)

Assignee

Updated

•

8 years ago

Attachment #8782576 - Attachment is obsolete: true

Byron Campen [:bwc]

Comment 23

•

8 years ago

(In reply to Michael Tüxen from comment #21)
> (In reply to Randell Jesup [:jesup] from comment #18)
> > Hmmm.  That will turn off detection of a failed connection.  In theory ICE
> > Consent Freshness should eventually kick in and mark it as failed.  There
> > might even be advantages in not having an independent timeout for
> > datachannels... since that will be confusing for JS apps (they'd see close
> > events but the peerconnection may still be "connected")
> > 
> > However... while the stacks are all from association timeout, if we do this
> > we may retransmit "forever", and in the case where we timed out, we might
> > try to send a packet retransmit instead, and I'm concerned that this would
> > use the bad ptr - unless the normal retransmit is blocked by the
> > deregister/close.
> 
> We have retransmitted multiple times before we do the cleanup. So I think
> the problem is NOT related to the timerou (either HEARTBEAT or DATA
> retransmit),
> but related to the cleanup. Which, I guess, also involves somehow the freeing
> of the DTLS connection object. That is why I think my change would work
> around
> the issue.
> 
> You can also:
> * make sure you call usrscpt_close() before usrsctp_deregister().
> * make sure you do not free whatever the pointer given to usrsctp points to
> before
>   calling usrsctp_deregister().

You mean usrsctp_deregister_address, right? Because if that will fix this, that's really easy.

Michael Tüxen

Comment 24

•

8 years ago

Yes, I meant usrsctp_deregister_address(). If that is easy, I would say do that and let us see
if that fixes the issue in the field.

Byron Campen [:bwc]

Comment 25

•

8 years ago

(In reply to Michael Tüxen from comment #14)
> It is for example stopped when we clear the net in the
> sctp_free_remote_addr() macro.
> 

I added an abort to this macro, and it never fires when running our mochitest suite...

Byron Campen [:bwc]

Comment 26

•

8 years ago

|sctp_timer_stop| for |SCTP_TIMER_TYPE_SEND| is not called (transitively) by either |usrsctp_close| or |usrsctp_deregister_address|. Is there some other way this kind of timer could be getting canceled?

Flags: needinfo?(tuexen)

Michael Tüxen

Comment 27

•

8 years ago

The retransmission timers (like all timers) are stopped in sctp_free_assoc() for each path:

	TAILQ_FOREACH(net, &asoc->nets, sctp_next) {
		(void)SCTP_OS_TIMER_STOP(&net->rxt_timer.timer);
		net->rxt_timer.self = NULL;
		(void)SCTP_OS_TIMER_STOP(&net->pmtu_timer.timer);
		net->pmtu_timer.self = NULL;
		(void)SCTP_OS_TIMER_STOP(&net->hb_timer.timer);
		net->hb_timer.self = NULL;
	}

Flags: needinfo?(tuexen)

Michael Tüxen

Comment 28

•

8 years ago

Hmm. sctp_free_assoc() also calls:

	TAILQ_FOREACH_SAFE(net, &asoc->nets, sctp_next, nnet) {
#ifdef INVARIANTS
		if (SCTP_BASE_INFO(ipi_count_raddr) == 0) {
			panic("no net's left alloc'ed, or list points to itself");
		}
#endif
		TAILQ_REMOVE(&asoc->nets, net, sctp_next);
		sctp_free_remote_addr(net);
	}
So could it be that you don't tear down the associations in your test cases?

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 29

•

8 years ago

So, we can call close() before deregister_address() (per my comments in comment 10.)  if that's all it takes, that's easy.  
Right now, we call either close and then deregister_address, or deregister_address and then close (since they're occurring on separate threads.)  There's no way the DataChannelConnection object can be released before both have happened.  The crashes we're seeing imply that we've freed the object, and thus close and deregister_address must have completed

Flags: needinfo?(rjesup)

Michael Tüxen

Comment 30

•

8 years ago

After calling usrsctp_deregister_address(), the usrsctp stack will still call the lower layer output routine using the provided pointer. I tested that. Avoiding that is harder than I thought, since most
of the address handling we normally use isn't used in the case where we deal with the pointer.

However, if you say that the DataChannelConnection isn't released before usrsctp_close() has been called AND usrsctp_deregister_address() has been called, then the sequence doesn't matter.

Just to be crystal clear; Here is what I get from the stack traces:

In all cases, SCTP detects via multiple retransmissions (10 to be precisely, taking several minutes of probing) that the peer is dead. Then it calls the upper layer to let him know. There you are doing some cleanups. See
https://dxr.mozilla.org/mozilla-central/source/netwerk/sctp/datachannel/DataChannel.cpp#1470
I'm not sure what happens there. Could it be that you are releasing there the DataChannelConnection?
After that callback, the SCTP stack will send out an ABORT. This is where FF crashes, since the DataChannelConnection object is gone at this point of time.

The missing piece for me is that I don't understand under which condition the DataChannelConnection gets freed (since I'm not familiar with C++)... I suspect that this might be done in the cleanup.
So that is why I'm proposing the make sure it never happens (until we understand what is going on and we have a way to test this).

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 31

•

8 years ago

So that send an ON_DISCONNECTED message to the MainThread, which calls DataChannelConnection::CloseAll().

CloseAll() calls Close() on each DataChannel, and does some cleanup on pending opens (calls Close on them).  Lastly if it closed any channels, it does SCTP_STREAM_RESET_OUTGOING (in SendOutgoingStreamReset()).

None of these should affect the DataChannelConnection object itself, which is refcounted - so long as the PeerConnection (or anything else) has a RefPtr<> to the connection, it will not be deleted.  Before the PeerConnection will drop the RefPtr (which can delete the connection), it will call Destroy() on the connection, which will at that point call close() and deregister_address().  After those have been called, it *will* destroy the connection object (we leave the sctp stack available until browser shutdown (i.e. we normally *don't* call sctp_finish() until the browser exits, though I'd like to look at doing so.)

So it looks like the original assessment is correct - there's no easy way to avoid a race between an SCTP timeout and someone destroying the connection.  Now, I don't know that's what *did* happen (i.e. that it's not some other problem), but this is a plausible if unlikely race.

So, ugly as it is, we may have to go with the original idea, since without calling usrsctp_finish there's no way to ensure that the stack won't call us (and if multiple connections in different peerconnections are active, calling usrsctp_finish would be bad.

Michael Tüxen

Comment 32

•

8 years ago

OK. So you are saying that we have a race:
(a) One thread processing a timeout which results in terminating the SCTP association.
(b) Another thread destroying the DataChannelConnection.
Any idea how (b) can happen? Which usrsctp_*() functions might be involved?
If you have a conjecture, I could look at the code and/or write a stress tester...

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 33

•

8 years ago

Thread B is the thread that opened the SCTP library, and from previous experience (bugs) must be the one that closes it.

<user shutdown of PeerConnection>
Thread A (Mainthread): 
  Destroy() {
     Send message to Thread B to close SCTP (DestroyOnSTSThread())
     usrsctp_deregister_address()
  }
  ~DataChannelConnection() - delete of the structure (object ptr) we registered with the open

Thread B (STS):
  DestroyOnSTSThread() {
    usrsctp_close()
  }

Browser Shutdown:
  usrsctp_finish();

Because of threading, DestroyOnSTSThread() can happen before or after deregister_address.  However, you're telling me that after deregister_address "the usrsctp stack will still call the lower layer output routine using the provided pointer", so it appears there's no way to avoid that other than usrsctp_finish (which we don't do until browser shutdown and all users of sctp have shut down).

I could make the message dispatch to the STS thread synchronous (though we prefer to avoid synchronous messages starting from MainThread, since that can block the main thread of the browser) and do deregister on STS or after it returns, or I could move deregister to before the message is sent.  But from your quote above I don't think that will help me (I might still get called back using the ptr even after deregister *and* close)

Michael Tüxen

Comment 34

•

8 years ago

Just as a side note: usrsctp_finish() will teardown the stack if there are not endpoints anymore.

Doesn't the above mean that you possibly delete the structure within thread A before thread B calls usrsctp_close().

Please note that in the crashes we are looking at, the timer thread is involved. We run the notification code within the timer thread. So the cleanup will be triggered from that thread.

Again: I'm not sure about the timeframe you need to handle this problem. If you need a quick way to avoid it, think about raising the threshold for dead peer detection.

Byron Campen [:bwc]

Comment 35

•

8 years ago

(In reply to Michael Tüxen from comment #34)
> Just as a side note: usrsctp_finish() will teardown the stack if there are
> not endpoints anymore.
> 
> Doesn't the above mean that you possibly delete the structure within thread
> A before thread B calls usrsctp_close().

   The Runnable that calls DestroyOnSTS holds a strong ref to the DataChannelConnection.

Byron Campen [:bwc]

Comment 36

•

8 years ago

(In reply to Michael Tüxen from comment #30)
> In all cases, SCTP detects via multiple retransmissions (10 to be precisely,
> taking several minutes of probing) that the peer is dead. Then it calls the
> upper layer to let him know. There you are doing some cleanups. See
> https://dxr.mozilla.org/mozilla-central/source/netwerk/sctp/datachannel/
> DataChannel.cpp#1470

Is there an event that signals that the SCTP stack will definitely not be calling us back anymore?

Flags: needinfo?(tuexen)

Michael Tüxen

Comment 37

•

8 years ago

Currently not. When aborting the association, we provide the notification first and then send an ABORT.
We are currently reworking the callback API to overcome several limitations of the current way, but that needs some further testing...

Flags: needinfo?(tuexen)

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 38

•

8 years ago

Comment 35 (DestroyOnSTS holding a strong ref) means we will not destroy it until close() returns.

However - this isn't on the timer thread.  Sounds like there's a race between close/etc on other threads (close() on STS in this case) and the timer firing; i.e. some missing interlock that stops the timer from firing if close happens.

This is a security bug, and as it's a UAF we really want to take a fix for it in the next week or two so it can go out in 49 -- even if that fix is my "leak a ptr" fix

Byron Campen [:bwc]

Comment 39

•

8 years ago

So, we have unlocked a mutex in the crashing callstack:

https://dxr.mozilla.org/mozilla-central/source/netwerk/sctp/src/netinet/sctp_callout.c#155

It does look like this leaves a brief window that could allow the Destroy() code to finish at the same time the timer is popping.

Byron Campen [:bwc]

Comment 40

•

8 years ago

Is there some way we could get the sctp stack to schedule an immediate timer that comes around as a callback on the timer thread? We could schedule this timer in DestroyOnSTS (and move the deregister there too), and once that timer pops we can dispose of the DataChannelConnection.

If there is not currently a way to do this, how hard would it be to make one?

Flags: needinfo?(tuexen)

Michael Tüxen

Comment 41

•

8 years ago

(In reply to Byron Campen [:bwc] from comment #39)
> So, we have unlocked a mutex in the crashing callstack:
> 
> https://dxr.mozilla.org/mozilla-central/source/netwerk/sctp/src/netinet/
> sctp_callout.c#155
> 
> It does look like this leaves a brief window that could allow the Destroy()
> code to finish at the same time the timer is popping.
The TIMERQ_LOCK just protects the list of timers. If the stored timer is stopped,
sctp_os_timer_next is adopted.

Flags: needinfo?(tuexen)

Michael Tüxen

Comment 42

•

8 years ago

(In reply to Byron Campen [:bwc] from comment #40)
> Is there some way we could get the sctp stack to schedule an immediate timer
> that comes around as a callback on the timer thread? We could schedule this
> timer in DestroyOnSTS (and move the deregister there too), and once that
> timer pops we can dispose of the DataChannelConnection.
> 
> If there is not currently a way to do this, how hard would it be to make one?

It is just software, so it can be changed.
However, the code has evolved and the base you are using is over a year old.
I would have to look at that code and adding such a timer without having a reproducible
test case to make sure we are working around that issue is something I would like
to avoid. I guess the simplest workaround for the crashes I have seen is to disable the
dead peer detection at the SCTP layer for now. Is there a reason not going down this path?
If Randell feels save to add his patch, it can be added in addition to setting the maximum
number of retransmits to 0xffffffff.

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 43

•

8 years ago

My only concern with the "retransmit forever" case is this:
> However... while the stacks are all from association timeout, if we do this we may retransmit
> "forever", and in the case where we timed out, we might try to send a packet retransmit instead, and
> I'm concerned that this would use the bad ptr - unless the normal retransmit is blocked by the
> deregister/close.

If close/deregister_address will block *that* timer (the retransmit) - then it could help.

Michael Tüxen

Comment 44

•

8 years ago

All stack traces I have looked at showed that sctp_abort_an_association() was called from sctp_threshold_management. So it was always happening when SCTP was destroying the association,
not when it did all the retransmits before.
That is why I belief that it is related to the cleanup when handling dead peer detection.
I have not seen a trace where a normal retransmission triggered the crash. Have you?

Michael Tüxen

Comment 45

•

8 years ago

OK. By looking at the code again, this might be a fix for the issue:

Move the "notify the ulp" code in
https://dxr.mozilla.org/mozilla-central/source/netwerk/sctp/src/netinet/sctputil.c#4207
below the "notify the peer" code in
https://dxr.mozilla.org/mozilla-central/source/netwerk/sctp/src/netinet/sctputil.c#4211

This would mean we send out the ABORT (using your object pointer) before notifying
the C++ code, which might free the pointer.

For consistency, We would need a similar swap around
https://dxr.mozilla.org/mozilla-central/source/netwerk/sctp/src/netinet/sctputil.c#4066

However, without a test, we can't be sure. But it might be worth a try... At least
I would try to test it. What do you think?

Flags: needinfo?(rjesup)

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 46

•

8 years ago

It makes sense in any case, since when you notify the owner of the association failure it may break the channel between usrsctp and the network.  (this becomes an issue once sctp travels in a tunnel via the callbacks, as we use it).  As to if this could cause the problem... maybe.  We will fire close events to JS, and it may well tear down the peerconnection in response.

Flags: needinfo?(rjesup)

Michael Tüxen

Comment 47

•

8 years ago

I committed such a change to the usrsctp code:
https://github.com/sctplab/usrsctp/commit/d1208ae09e4e8bfb552b52773722d49748da0818

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 48

•

8 years ago

Attached patch Swap order of notifications on association failure — Details — Splinter Review

I think it's worth landing this in any case.  I forget the details of the analysis, but perhaps this causes the race with a JS app that responds to onclose by tearing down the peerconnection.

Attachment #8783983 - Flags: review?(docfaraday)

Byron Campen [:bwc]

Comment 49

•

8 years ago

Comment on attachment 8783983 [details] [diff] [review]
Swap order of notifications on association failure

Review of attachment 8783983 [details] [diff] [review]:
-----------------------------------------------------------------

This might do the trick? Worth a shot at any rate.

Attachment #8783983 - Flags: review?(docfaraday) → review+

Byron Campen [:bwc]

Updated

•

8 years ago

Attachment #8782954 - Flags: review?(docfaraday)

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 50

•

8 years ago

Comment on attachment 8783983 [details] [diff] [review]
Swap order of notifications on association failure

[Security approval request comment]
How easily could an exploit be constructed based on the patch? Very hard

Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem? No - at most it is only of interest if it lands on a sec bug and is uplifted.  Also, patch has landed upstream - we could land it under a generic "cherry-pick mod from upstream" bug if we care.

Which older supported branches are affected by this flaw? All, though it appears as if other changes recently exposed the flaw (timing/references/etc).

If not all supported branches, which bug introduced the flaw? N/A

Do you have backports for the affected branches? If not, how different, hard to create, and risky will they be? Trivial

How likely is this patch to cause regressions; how much testing does it need? Very low chance of regression; swaps the order of notifications.

Attachment #8783983 - Flags: sec-approval?

Attachment #8783983 - Flags: approval-mozilla-beta?

Attachment #8783983 - Flags: approval-mozilla-aurora?

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 51

•

8 years ago

Likely affects 48 and ESR45, but they may not have changes elsewhere that expose the issue.  They might still be exploitable though.

status-firefox48: --- → wontfix

status-firefox-esr45: --- → ?

Al Billings [:abillings - ex-MoCo]

Comment 52

•

8 years ago

Comment on attachment 8783983 [details] [diff] [review]
Swap order of notifications on association failure

Approvals given. This is a tiny patch. We should take this on ESR45 as well.

Attachment #8783983 - Flags: sec-approval?

Attachment #8783983 - Flags: sec-approval+

Attachment #8783983 - Flags: approval-mozilla-beta?

Attachment #8783983 - Flags: approval-mozilla-beta+

Attachment #8783983 - Flags: approval-mozilla-aurora?

Attachment #8783983 - Flags: approval-mozilla-aurora+

Al Billings [:abillings - ex-MoCo]

Updated

•

8 years ago

status-firefox-esr45: ? → affected

tracking-firefox50: --- → +

tracking-firefox51: --- → +

tracking-firefox-esr45: --- → ?

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 53

•

8 years ago

Comment on attachment 8783983 [details] [diff] [review]
Swap order of notifications on association failure

[Approval Request Comment]
User impact if declined: sec-crit fixed on other branches

Fix Landed on Version: 51 (uplift to 49)

Risk to taking this patch (and alternatives if risky): very low risk

String or UUID changes made by this patch: none

See https://wiki.mozilla.org/Release_Management/ESR_Landing_Process for more info.

Attachment #8783983 - Flags: approval-mozilla-esr45?

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 54

•

8 years ago

Note: We might hold landing it on ESR45 until we see some crashstats data that confirms the fix in the field.

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 55

•

8 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/fab696bc92fd

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 56

•

8 years ago

https://hg.mozilla.org/releases/mozilla-aurora/rev/a15468866522
https://hg.mozilla.org/releases/mozilla-beta/rev/382c9188549d

status-firefox49: affected → fixed

status-firefox50: affected → fixed

Ryan VanderMeulen [:RyanVM]

Comment 57

•

8 years ago

https://hg.mozilla.org/mozilla-central/rev/fab696bc92fd

Status: NEW → RESOLVED

Closed: 8 years ago

status-firefox51: affected → fixed

Flags: in-testsuite?

Resolution: --- → FIXED

Target Milestone: --- → mozilla51

Sylvestre Ledru [:Sylvestre]

Comment 58

•

8 years ago

sec critical, taking it to esr 45. Should be in 45.4

tracking-firefox-esr45: ? → 49+

Sylvestre Ledru [:Sylvestre]

Updated

•

8 years ago

Attachment #8783983 - Flags: approval-mozilla-esr45? → approval-mozilla-esr45+

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 59

•

8 years ago

https://hg.mozilla.org/releases/mozilla-esr45/rev/3c4958a98908

status-firefox-esr45: affected → fixed

Daniel Veditz [:dveditz]

Updated

•

8 years ago

Group: media-core-security → core-security-release

Al Billings [:abillings - ex-MoCo]

Updated

•

8 years ago

Whiteboard: [adv-main49+][adv-esr45.4+]

Cornel Ionce [:noni] [Hubs QA]

Comment 61

•

8 years ago

7 crashes with this signature in Socorro n the last week, the most recent affected build being 49.0b6. 
https://crash-stats.mozilla.com/signature/?product=Firefox&signature=mozilla%3A%3ADataChannelConnection%3A%3ASctpDtlsOutput

Closing this issue.

Status: RESOLVED → VERIFIED

status-firefox49: fixed → verified

status-firefox50: fixed → verified

status-firefox51: fixed → verified

status-firefox-esr45: fixed → verified

Randell Jesup [:jesup] (needinfo me)

Assignee

Updated

•

8 years ago

Blocks: 1311380

Daniel Veditz [:dveditz]

Updated

•

7 years ago

Group: core-security-release

wrap and lock DataChannelConnection for SCTP callbacks 8 years ago Randell Jesup [:jesup] (needinfo me) 9.90 KB, patch	bwc : feedback-	Details \| Diff \| Splinter Review
wrap and lock DataChannelConnection for SCTP callbacks 8 years ago Randell Jesup [:jesup] (needinfo me) 11.00 KB, patch		Details \| Diff \| Splinter Review
Swap order of notifications on association failure 8 years ago Randell Jesup [:jesup] (needinfo me) 2.55 KB, patch	bwc : review+ abillings : approval-mozilla-aurora+ abillings : approval-mozilla-beta+ Sylvestre : approval-mozilla-esr45+ abillings : sec-approval+	Details \| Diff \| Splinter Review