- reinforce()
Stochastic functions, i.e.
were removed because of their limited functionality and broad performance implications. The motivation for stochastic functions was to avoid book-keeping of sampled values. In practice, users were still book-keeping in their code for various reasons. We constructed an alternative, equally effective API, but did not have a reasonable deprecation path to the new API. Hence this removal is a breaking change.
- Variable.reinforce()
We introduce the torch.distributions package to replace Stochastic functions.
Your previous code typically looked like this:
- probs = policy_network(state)
- action = probs.multinomial()
- next_state, reward = env.step(action)
- action.reinforce(reward)
- action.backward()
This is the new equivalent code:
- probs = policy_network(state)
- # NOTE: categorical is equivalent to what used to be called multinomial
- m = torch.distributions.Categorical(probs)
- action = m.sample()
- next_state, reward = env.step(action)
- loss = -m.log_prob(action) * reward
- loss.backward()
Now, Some loss functions can compute per-sample losses in a mini-batch
to return individual losses for each sample in the mini-batch
- reduce=False
- loss = nn.CrossEntropyLoss(..., reduce=False)
,
- MSELoss
,
- NLLLoss
,
- NLLLoss2d
,
- KLDivLoss
,
- CrossEntropyLoss
,
- SmoothL1Loss
- L1Loss
We built a low-level profiler to help you identify bottlenecks in your models
Let us start with an example:
- >>> x = Variable(torch.randn(1, 1), requires_grad=True)
- >>> with torch.autograd.profiler.profile() as prof:
- ... y = x ** 2
- ... y.backward()
- >>> # NOTE: some columns were removed for brevity
- ... print(prof)
- -------------------------------- ---------- ---------
- Name CPU time CUDA time
- ------------------------------- ---------- ---------
- PowConstant 142.036us 0.000us
- N5torch8autograd9GraphRootE 63.524us 0.000us
- PowConstantBackward 184.228us 0.000us
- MulConstant 50.288us 0.000us
- PowConstant 28.439us 0.000us
- Mul 20.154us 0.000us
- N5torch8autograd14AccumulateGradE 13.790us 0.000us
- N5torch8autograd5CloneE 4.088us 0.000us
The profiler works for both CPU and CUDA models.
For CUDA models, you have to run your python program with a special
prefix. For example:
- nvprof
- nvprof --profile-from-start off -o trace_name.prof -- python <your arguments>
- # in python
- >>> with torch.cuda.profiler.profile():
- ... model(x) # Warmup CUDA memory allocator and profiler
- ... with torch.autograd.profiler.emit_nvtx():
- ... model(x)
Then, you can load
in PyTorch and print a summary profile report.
- trace_name.prof
- >>> prof = torch.autograd.profiler.load_nvprof('trace_name.prof')
- >>> print(prof)
Read additional documentation here
Added higher-order gradients support for the following layers
and
- nearest
modes.
- linear
.
- padding_mode="border"
expects a grid in the range of
- grid_sample
, and if the values are out of these bounds, padding with the value
- [-1, 1]
is applied by default. However, in a lot of cases, using the border value (i.e. the nearest valid value) helps improve accuracy of the overall model.
- 0.0
and
- nn.utils.parameters_to_vector
- nn.utils.vector_to_parameters
takes
- parameters_to_vector
and return a 1D vector that contains all the parameters
- net.parameters()
takes a vector of flattened parameters and copies the values over to a network's parameters
- vector_to_parameters
and infer them at runtime.
- AdaptivePool*d
- # target output size of 10x7
- m = nn.AdaptiveMaxPool2d((None, 7))
and
- torch.erf
that compute the error function and the inverse error function of each element in the Tensor.
- torch.erfinv
and
- Tensor.put_
similar to
- torch.take
and
- numpy.take
.
- numpy.put
equivalents:
- numpy
has an optional axis argument, which behaves like
- numpy.take
. This
- index_select
argument is not yet present.
- axis
repeats the values if necessary to make them as long as indices. This behavior is not yet replicated.
- numpy.put
and
- zeros
for sparse Tensors.
- zeros_like
works now.
- int(torch.Tensor([5]))
and
- torch.cuda.get_device_name
that do what the names say. Example:
- torch.cuda.get_device_capability
- >>> torch.cuda.get_device_name(0)
- 'Quadro GP100'
- >>> torch.cuda.get_device_capability(0)
- (6, 0)
, then the CuDNN convolutions use deterministic algorithms
- torch.backends.cudnn.deterministic = True
and
- torch.cuda_get_rng_state_all
are introduced to let you save / load the state of the random number generator over all GPUs at once
- torch.cuda_set_rng_state_all
frees the cached memory blocks in PyTorch's caching allocator. This is useful when having long-running ipython notebooks while sharing the GPU with other processes.
- torch.cuda.emptyCache()
and
- softmax
now take a
- log_softmax
argument that specifies the dimension in which slices are taken for the softmax operation.
- dim
allows negative dimensions as well (
- dim
will be the last dimension)
- dim = -1
(Cholesky decomposition) is now differentiable and defined on
- torch.potrf
- Variable
and replace it with
- device_id
, to make things consistent
- device
now allows you to specify inputs that are unused in the autograd graph if you use
- torch.autograd.grad
This gets useful when using
- allow_unused=True
in large graphs with lists of inputs / outputs
- torch.autograd.grad
- x,
- y = Variable(...),
- Variable(...) torch.autograd.grad(x * 2, [x, y])#errors torch.autograd.grad(x * 2, [x, y], allow_unused = True)#works
now allows a
- pad_packed_sequence
argument that can be used instead of zero-padding
- padding_value
now has a
- Dataset
operator (which uses
- +
). You can do something like
- ConcatDataset
for example, and you will get a concatenated dataset containing samples from both.
- MNIST(...) + FashionMNIST(...)
allows Tensors to be received from any sender (hence,
- torch.distributed.recv
is optional).
- src
returns the rank of the sender.
- recv
to
- zero_()
- Variable
returns the size of the Tensor (now made consistent with Tensor)
- Variable.shape
specifies the CUDA version that PyTorch was compiled with
- torch.version.cuda
for CUDA.
- random_
object, which is a standard Python3 typed filepath object
- pathlib.Path
into another model (for example to fine-tune a pre-trained network),
- state_dict
was strict on matching the key names of the parameters. Now we provide a
- load_state_dict
option to
- strict=False
where it only loads in parameters where the keys match, and ignores the other parameter keys.
- load_state_dict
that is equivalent to
- nn.functional.embedding_bag
- nn.EmbeddingBag
functions on Variables was around 10 microseconds. This has been brought down to ~1.5 microseconds by moving most of the core autograd formulas into C++ using our ATen library. This speeds-up models that are very small, such as small LSTMs and other common models seen in NLP.
- torch
and a batch size of 1024, it is 33x faster.
- 100k x 128
(depthwise convolution). Speedups range from 5x to 1000x for tested layer sizes. See thebenchmark tablefor more details as well asthis table.
- groups == nInputPlane
's memory usage for sparse gradients (for ex.
- optim.SGD
), reducing the usage on a user-provided test script by 10x.
- nn.Embedding(..., sparse=True)
over the right-most dimensions is faster
- torch.nn.utils.weight_norm
is sped up by ~1.5x
- torch.norm
- pack_padded_sequence
. For example
- torch.arange
- torch.arange(10)
DLPack Tensors are cross-framework Tensor formats. We now have
and
- torch.utils.to_dlpack(x)
to convert between DLPack and torch Tensor formats. The conversion has zero memory copy and hence is very efficient.
- torch.utils.from_dlpack(x)
ONNX is a common model interchange format that can be executed in Caffe2, CoreML, CNTK, MXNet, Tensorflow at the moment. PyTorch models that are ConvNet-like and RNN-like (static graphs) can now be shipped to the ONNX format.
is printed to the user.
- warning
- load_state_dict
(instead, the calls are queued and run when CUDA is initialized)
- torch.manual_seed
is 2D,
- x
was needed to trigger advanced indexing. The trailing comma is no longer needed, and you can do
- x[[0, 3],]
- x[[0, 3]]
used to incorrectly fail for Tensors. Fixed a bug in the argument checking logic to allow this.
- x.sort(descending=True)
- torch.DoubleTensor(np.array([0,1,2], dtype=np.float32))
will now work by making a copy.
- torch.cuda.FloatTensor(np.random.rand(10,2).astype(np.float32))
and
- ones_like
now create Tensors on the same device as the original Tensor
- zeros_like
on the CPU would reshape the input
- torch.multinomial
in-place. Fixed this to make sure the
- prob_dist
input's shape is unchanged after the call to
- prob_dist
- multinomial
and
- expand
allow expanding an empty Tensor to another empty Tensor
- expand_as
was given (i.e. newaxis placement in indexing was specified), PyTorch had different behavior from NumPy. This is made consistent with NumPy in all cases.
- [..., None, ...]
and
- numpy()
- torch.from_numpy
- torch.scatter
and
- torch.tril
on the GPU for storage-offset Tensors (would return incorrect result).
- torch.triu
- torch.topk
on CPU (which previously had a max value of 2^32) for DoubleTensor and LongTensor
- random_
when printing certain Tensors
- ZeroDivisionError: float division by zero
when
- torch.gels
had a truncation bug on the CPU and returned incorrect results. Fixed.
- m > n
- contiguous
and
- any
work on empty Tensors on the cpu (previously errored out)
- all
on CUDA for large matrices. The bug is that not enough space was being allocated for the workspace, causing some undefined behavior.
- symeig
and
- torch.var
by using Welford's algorithm
- torch.std
samples with inconsistent bounds (inconsistency in cpu implementation and running into a cublas bug).
- uniform
sampled numbers will return within the bounds
- uniform
, across all types and devices
- [0, 1)
to not segfault on large CUDA Tensors (fixed an overflow error in the magma bindings)
- torch.svd
(instead of erroring out)
- index_select
,
- eigenvector=False
returns some unknown value for the eigenvectors. Now we zero them out.
- symeig
not converting indices tensor.
- .type()
around non-default GPU input.
- type()
returned
- torch.norm
, the gradient was
- 0.0
. We now use the subgradient at
- NaN
, so the gradient is
- 0.0
.
- 0.0
's backward was failing on the GPU due to a type error, fixed.
- torch.prod
is now imported by default.
- torch.optim.lr_scheduler
is called, and self.foo already exists, then instead of silently failing, now raises a
- register_buffer("foo", ...)
- KeyError
attributes.
- _data_ptrs
had a hard error when using the
- nn.Embedding
option. This is fixed now.
- max_norm
option, the passed-in indices are written upon (by the underlying implementation). To fix this, pass a clone of the indices to the renorm kernel.
- max_norm
now can take non-contiguous inputs
- F.affine_grid
value per channel in total, raise an error in training mode.
- 1
was returned. Now this correctly returns
- -inf
- 0.0
when
- poisson_nll_loss
by adding a small epsilon
- log_input=False
- n = nn.DataParallel(Net()); out = n(input=i)
in DistributedDataParallel
- requires_grad=False
(previously raised incoherent error)
- buffers
to be functional in
- __get_state__
(was returning nothing)
- DistributedDataParallel
now first attempts to use the
- model.zoo.load_url
library if available, and then falls back to
- requests
- urllib
- numpy.str_
来源: https://juejin.im/entry/5a27c1166fb9a04527257811