linux/drivers/thermal/thermal_core.c

1805 lines
46 KiB
C
Raw Normal View History

// SPDX-License-Identifier: GPL-2.0
/*
* thermal.c - Generic Thermal Management Sysfs support.
*
* Copyright (C) 2008 Intel Corp
* Copyright (C) 2008 Zhang Rui <rui.zhang@intel.com>
* Copyright (C) 2008 Sujith Thomas <sujith.thomas@intel.com>
*/
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
#include <linux/device.h>
#include <linux/err.h>
#include <linux/export.h>
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 08:04:11 +00:00
#include <linux/slab.h>
#include <linux/kdev_t.h>
#include <linux/idr.h>
#include <linux/list_sort.h>
#include <linux/thermal.h>
#include <linux/reboot.h>
#include <linux/string.h>
#include <linux/of.h>
Thermal: handle thermal zone device properly during system sleep Current thermal code does not handle system sleep well because 1. the cooling device cooling state may be changed during suspend 2. the previous temperature reading becomes invalid after resumed because it is got before system sleep 3. updating thermal zone device during suspending/resuming is wrong because some devices may have already been suspended or may have not been resumed. Thus, the proper way to do this is to cancel all thermal zone device update requirements during suspend/resume, and after all the devices have been resumed, reset and update every registered thermal zone devices. This also fixes a regression introduced by: Commit 19593a1fb1f6 ("ACPI / fan: convert to platform driver") Because, with above commit applied, all the fan devices are attached to the acpi_general_pm_domain, and they are turned on by the pm_domain automatically after resume, without the awareness of thermal core. CC: <stable@vger.kernel.org> #3.18+ Reference: https://bugzilla.kernel.org/show_bug.cgi?id=78201 Reference: https://bugzilla.kernel.org/show_bug.cgi?id=91411 Tested-by: Manuel Krause <manuelkrause@netscape.net> Tested-by: szegad <szegadlo@poczta.onet.pl> Tested-by: prash <prash.n.rao@gmail.com> Tested-by: amish <ammdispose-arch@yahoo.com> Tested-by: Matthias <morpheusxyz123@yahoo.de> Reviewed-by: Javi Merino <javi.merino@arm.com> Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
2015-10-30 08:31:58 +00:00
#include <linux/suspend.h>
#define CREATE_TRACE_POINTS
#include "thermal_trace.h"
#include "thermal_core.h"
#include "thermal_hwmon.h"
static DEFINE_IDA(thermal_tz_ida);
static DEFINE_IDA(thermal_cdev_ida);
static LIST_HEAD(thermal_tz_list);
static LIST_HEAD(thermal_cdev_list);
static LIST_HEAD(thermal_governor_list);
static DEFINE_MUTEX(thermal_list_lock);
static DEFINE_MUTEX(thermal_governor_lock);
static struct thermal_governor *def_governor;
/*
* Governor section: set of functions to handle thermal governors
*
* Functions to help in the life cycle of thermal governors within
* the thermal core and by the thermal governor code.
*/
static struct thermal_governor *__find_governor(const char *name)
{
struct thermal_governor *pos;
if (!name || !name[0])
return def_governor;
list_for_each_entry(pos, &thermal_governor_list, governor_list)
if (!strncasecmp(name, pos->name, THERMAL_NAME_LENGTH))
return pos;
return NULL;
}
/**
* bind_previous_governor() - bind the previous governor of the thermal zone
* @tz: a valid pointer to a struct thermal_zone_device
* @failed_gov_name: the name of the governor that failed to register
*
* Register the previous governor of the thermal zone after a new
* governor has failed to be bound.
*/
static void bind_previous_governor(struct thermal_zone_device *tz,
const char *failed_gov_name)
{
if (tz->governor && tz->governor->bind_to_tz) {
if (tz->governor->bind_to_tz(tz)) {
dev_err(&tz->device,
"governor %s failed to bind and the previous one (%s) failed to bind again, thermal zone %s has no governor\n",
failed_gov_name, tz->governor->name, tz->type);
tz->governor = NULL;
}
}
}
/**
* thermal_set_governor() - Switch to another governor
* @tz: a valid pointer to a struct thermal_zone_device
* @new_gov: pointer to the new governor
*
* Change the governor of thermal zone @tz.
*
* Return: 0 on success, an error if the new governor's bind_to_tz() failed.
*/
static int thermal_set_governor(struct thermal_zone_device *tz,
struct thermal_governor *new_gov)
{
int ret = 0;
if (tz->governor && tz->governor->unbind_from_tz)
tz->governor->unbind_from_tz(tz);
if (new_gov && new_gov->bind_to_tz) {
ret = new_gov->bind_to_tz(tz);
if (ret) {
bind_previous_governor(tz, new_gov->name);
return ret;
}
}
tz->governor = new_gov;
return ret;
}
int thermal_register_governor(struct thermal_governor *governor)
{
int err;
const char *name;
struct thermal_zone_device *pos;
if (!governor)
return -EINVAL;
mutex_lock(&thermal_governor_lock);
err = -EBUSY;
if (!__find_governor(governor->name)) {
bool match_default;
err = 0;
list_add(&governor->governor_list, &thermal_governor_list);
match_default = !strncmp(governor->name,
DEFAULT_THERMAL_GOVERNOR,
THERMAL_NAME_LENGTH);
if (!def_governor && match_default)
def_governor = governor;
}
mutex_lock(&thermal_list_lock);
list_for_each_entry(pos, &thermal_tz_list, node) {
/*
* only thermal zones with specified tz->tzp->governor_name
* may run with tz->govenor unset
*/
if (pos->governor)
continue;
name = pos->tzp->governor_name;
if (!strncasecmp(name, governor->name, THERMAL_NAME_LENGTH)) {
int ret;
ret = thermal_set_governor(pos, governor);
if (ret)
dev_err(&pos->device,
"Failed to set governor %s for thermal zone %s: %d\n",
governor->name, pos->type, ret);
}
}
mutex_unlock(&thermal_list_lock);
mutex_unlock(&thermal_governor_lock);
return err;
}
void thermal_unregister_governor(struct thermal_governor *governor)
{
struct thermal_zone_device *pos;
if (!governor)
return;
mutex_lock(&thermal_governor_lock);
if (!__find_governor(governor->name))
goto exit;
mutex_lock(&thermal_list_lock);
list_for_each_entry(pos, &thermal_tz_list, node) {
if (!strncasecmp(pos->governor->name, governor->name,
THERMAL_NAME_LENGTH))
thermal_set_governor(pos, NULL);
}
mutex_unlock(&thermal_list_lock);
list_del(&governor->governor_list);
exit:
mutex_unlock(&thermal_governor_lock);
}
int thermal_zone_device_set_policy(struct thermal_zone_device *tz,
char *policy)
{
struct thermal_governor *gov;
int ret = -EINVAL;
mutex_lock(&thermal_governor_lock);
mutex_lock(&tz->lock);
gov = __find_governor(strim(policy));
if (!gov)
goto exit;
ret = thermal_set_governor(tz, gov);
exit:
mutex_unlock(&tz->lock);
mutex_unlock(&thermal_governor_lock);
thermal_notify_tz_gov_change(tz, policy);
return ret;
}
int thermal_build_list_of_policies(char *buf)
{
struct thermal_governor *pos;
ssize_t count = 0;
mutex_lock(&thermal_governor_lock);
list_for_each_entry(pos, &thermal_governor_list, governor_list) {
count += sysfs_emit_at(buf, count, "%s ", pos->name);
}
count += sysfs_emit_at(buf, count, "\n");
mutex_unlock(&thermal_governor_lock);
return count;
}
static void __init thermal_unregister_governors(void)
{
struct thermal_governor **governor;
for_each_governor_table(governor)
thermal_unregister_governor(*governor);
}
static int __init thermal_register_governors(void)
{
int ret = 0;
struct thermal_governor **governor;
for_each_governor_table(governor) {
ret = thermal_register_governor(*governor);
if (ret) {
pr_err("Failed to register governor: '%s'",
(*governor)->name);
break;
}
pr_info("Registered thermal governor '%s'",
(*governor)->name);
}
if (ret) {
struct thermal_governor **gov;
for_each_governor_table(gov) {
if (gov == governor)
break;
thermal_unregister_governor(*gov);
}
}
return ret;
}
static int __thermal_zone_device_set_mode(struct thermal_zone_device *tz,
enum thermal_device_mode mode)
{
if (tz->ops.change_mode) {
int ret;
ret = tz->ops.change_mode(tz, mode);
if (ret)
return ret;
}
tz->mode = mode;
return 0;
}
thermal: core: Back off when polling thermal zones on errors Commit a8a261774466 ("thermal: core: Call monitor_thermal_zone() if zone temperature is invalid") introduced a polling mechanism by which the thermal core attampts to get a valid temperature value for thermal zones where the .get_temp() callback returns errors to start with (for example, due to initialization ordering woes). However, this polling is carried out periodically ad infinitum and every iteration of it causes a message to be printed to the kernel log which means a lot of log noise on systems where there are thermal zones that never get ready for some reason. It is also not really useful to continuously poll thermal zones that never respond. To address this, modify the thermal core to increase the delay between consecutive thermal zone temperature checks after every check that fails until it reaches a certain maximum value. At that point, the thermal zone in question will be disabled, but user space will be able to reenable it if it believes that the failure is transient. Also change the code to print messages regarding failed temperature checks to the kernel log only twice, once when the thermal zone's .get_temp() callback returns an error for the first time and once when disabling the given thermal zone. In addition, a dev_crit() message will be printed at that point if the given thermal zone contains a critical trip point to notify the system operator about the situation. Fixes: a8a261774466 ("thermal: core: Call monitor_thermal_zone() if zone temperature is invalid") Link: https://lore.kernel.org/linux-acpi/CAGnHSE=RyPK++UG0-wAtVKgeJxe0uzFYgLxm+RUOKKoQquW=Ow@mail.gmail.com/ Reported-by: Tom Yan <tom.ty89@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://patch.msgid.link/2962033.e9J7NaK4W3@rjwysocki.net
2024-07-18 19:01:14 +00:00
static void thermal_zone_broken_disable(struct thermal_zone_device *tz)
{
struct thermal_trip_desc *td;
dev_err(&tz->device, "Unable to get temperature, disabling!\n");
/*
* This function only runs for enabled thermal zones, so no need to
* check for the current mode.
*/
__thermal_zone_device_set_mode(tz, THERMAL_DEVICE_DISABLED);
thermal_notify_tz_disable(tz);
for_each_trip_desc(tz, td) {
if (td->trip.type == THERMAL_TRIP_CRITICAL &&
td->trip.temperature > THERMAL_TEMP_INVALID) {
dev_crit(&tz->device,
"Disabled thermal zone with critical trip point\n");
return;
}
}
}
/*
* Zone update section: main control loop applied to each zone while monitoring
* in polling mode. The monitoring is done using a workqueue.
* Same update may be done on a zone by calling thermal_zone_device_update().
*
* An update means:
* - Non-critical trips will invoke the governor responsible for that zone;
* - Hot trips will produce a notification to userspace;
* - Critical trip point will cause a system shutdown.
*/
static void thermal_zone_device_set_polling(struct thermal_zone_device *tz,
unsigned long delay)
{
if (delay > HZ)
delay = round_jiffies_relative(delay);
mod_delayed_work(system_freezable_power_efficient_wq, &tz->poll_queue, delay);
}
thermal: core: Back off when polling thermal zones on errors Commit a8a261774466 ("thermal: core: Call monitor_thermal_zone() if zone temperature is invalid") introduced a polling mechanism by which the thermal core attampts to get a valid temperature value for thermal zones where the .get_temp() callback returns errors to start with (for example, due to initialization ordering woes). However, this polling is carried out periodically ad infinitum and every iteration of it causes a message to be printed to the kernel log which means a lot of log noise on systems where there are thermal zones that never get ready for some reason. It is also not really useful to continuously poll thermal zones that never respond. To address this, modify the thermal core to increase the delay between consecutive thermal zone temperature checks after every check that fails until it reaches a certain maximum value. At that point, the thermal zone in question will be disabled, but user space will be able to reenable it if it believes that the failure is transient. Also change the code to print messages regarding failed temperature checks to the kernel log only twice, once when the thermal zone's .get_temp() callback returns an error for the first time and once when disabling the given thermal zone. In addition, a dev_crit() message will be printed at that point if the given thermal zone contains a critical trip point to notify the system operator about the situation. Fixes: a8a261774466 ("thermal: core: Call monitor_thermal_zone() if zone temperature is invalid") Link: https://lore.kernel.org/linux-acpi/CAGnHSE=RyPK++UG0-wAtVKgeJxe0uzFYgLxm+RUOKKoQquW=Ow@mail.gmail.com/ Reported-by: Tom Yan <tom.ty89@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://patch.msgid.link/2962033.e9J7NaK4W3@rjwysocki.net
2024-07-18 19:01:14 +00:00
static void thermal_zone_recheck(struct thermal_zone_device *tz, int error)
{
if (error == -EAGAIN) {
thermal_zone_device_set_polling(tz, THERMAL_RECHECK_DELAY);
return;
}
/*
* Print the message once to reduce log noise. It will be followed by
* another one if the temperature cannot be determined after multiple
* attempts.
*/
if (tz->recheck_delay_jiffies == THERMAL_RECHECK_DELAY)
dev_info(&tz->device, "Temperature check failed (%d)\n", error);
thermal_zone_device_set_polling(tz, tz->recheck_delay_jiffies);
tz->recheck_delay_jiffies += max(tz->recheck_delay_jiffies >> 1, 1ULL);
if (tz->recheck_delay_jiffies > THERMAL_MAX_RECHECK_DELAY) {
thermal_zone_broken_disable(tz);
/*
* Restore the original recheck delay value to allow the thermal
* zone to try to recover when it is reenabled by user space.
*/
tz->recheck_delay_jiffies = THERMAL_RECHECK_DELAY;
}
}
static void monitor_thermal_zone(struct thermal_zone_device *tz)
{
if (tz->passive > 0 && tz->passive_delay_jiffies)
thermal_zone_device_set_polling(tz, tz->passive_delay_jiffies);
else if (tz->polling_delay_jiffies)
thermal_zone_device_set_polling(tz, tz->polling_delay_jiffies);
}
static struct thermal_governor *thermal_get_tz_governor(struct thermal_zone_device *tz)
{
if (tz->governor)
return tz->governor;
return def_governor;
}
void thermal_governor_update_tz(struct thermal_zone_device *tz,
enum thermal_notify_event reason)
{
if (!tz->governor || !tz->governor->update_tz)
return;
tz->governor->update_tz(tz, reason);
}
static void thermal_zone_device_halt(struct thermal_zone_device *tz, bool shutdown)
{
/*
* poweroff_delay_ms must be a carefully profiled positive value.
* Its a must for forced_emergency_poweroff_work to be scheduled.
*/
int poweroff_delay_ms = CONFIG_THERMAL_EMERGENCY_POWEROFF_DELAY_MS;
const char *msg = "Temperature too high";
dev_emerg(&tz->device, "%s: critical temperature reached\n", tz->type);
if (shutdown)
hw_protection_shutdown(msg, poweroff_delay_ms);
else
hw_protection_reboot(msg, poweroff_delay_ms);
}
void thermal_zone_device_critical(struct thermal_zone_device *tz)
{
thermal_zone_device_halt(tz, true);
}
EXPORT_SYMBOL(thermal_zone_device_critical);
void thermal_zone_device_critical_reboot(struct thermal_zone_device *tz)
{
thermal_zone_device_halt(tz, false);
}
static void handle_critical_trips(struct thermal_zone_device *tz,
const struct thermal_trip *trip)
{
trace_thermal_zone_trip(tz, thermal_zone_trip_id(tz, trip), trip->type);
if (trip->type == THERMAL_TRIP_CRITICAL)
tz->ops.critical(tz);
else if (tz->ops.hot)
tz->ops.hot(tz);
}
static void handle_thermal_trip(struct thermal_zone_device *tz,
struct thermal_trip_desc *td,
struct list_head *way_up_list,
struct list_head *way_down_list)
{
const struct thermal_trip *trip = &td->trip;
int old_threshold;
if (trip->temperature == THERMAL_TEMP_INVALID)
return;
/*
* If the trip temperature or hysteresis has been updated recently,
* the threshold needs to be computed again using the new values.
* However, its initial value still reflects the old ones and that
* is what needs to be compared with the previous zone temperature
* to decide which action to take.
*/
old_threshold = td->threshold;
td->threshold = trip->temperature;
if (tz->last_temperature >= old_threshold &&
thermal: core: Allow thermal zones to tell the core to ignore them The iwlwifi wireless driver registers a thermal zone that is only needed when the network interface handled by it is up and it wants that thermal zone to be effectively ignored by the core otherwise. Before commit a8a261774466 ("thermal: core: Call monitor_thermal_zone() if zone temperature is invalid") that could be achieved by returning an error code from the thermal zone's .get_temp() callback because the core did not really handle errors returned by it almost at all. However, commit a8a261774466 made the core attempt to recover from the situation in which the temperature of a thermal zone cannot be determined due to errors returned by its .get_temp() and is always invalid from the core's perspective. That was done because there are thermal zones in which .get_temp() returns errors to start with due to some difficulties related to the initialization ordering, but then it will start to produce valid temperature values at one point. Unfortunately, the simple approach taken by commit a8a261774466, which is to poll the thermal zone periodically until its .get_temp() callback starts to return valid temperature values, is at odds with the special thermal zone in iwlwifi in which .get_temp() may always return an error because its network interface may always be down. If that happens, every attempt to invoke the thermal zone's .get_temp() callback resulting in an error causes the thermal core to print a dev_warn() message to the kernel log which is super-noisy. To address this problem, make the core handle the case in which .get_temp() returns 0, but the temperature value returned by it is not actually valid, in a special way. Namely, make the core completely ignore the invalid temperature value coming from .get_temp() in that case, which requires folding in update_temperature() into its caller and a few related changes. On the iwlwifi side, modify iwl_mvm_tzone_get_temp() to return 0 and put THERMAL_TEMP_INVALID into the temperature return memory location instead of returning an error when the firmware is not running or it is not of the right type. Also, to clearly separate the handling of invalid temperature values from the thermal zone initialization, introduce a special THERMAL_TEMP_INIT value specifically for the latter purpose. Fixes: a8a261774466 ("thermal: core: Call monitor_thermal_zone() if zone temperature is invalid") Closes: https://lore.kernel.org/linux-pm/20240715044527.GA1544@sol.localdomain/ Reported-by: Eric Biggers <ebiggers@kernel.org> Reported-by: Stefan Lippers-Hollmann <s.l-h@gmx.de> Link: https://bugzilla.kernel.org/show_bug.cgi?id=201761 Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Stefan Lippers-Hollmann <s.l-h@gmx.de> Cc: 6.10+ <stable@vger.kernel.org> # 6.10+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://patch.msgid.link/4950004.31r3eYUQgx@rjwysocki.net [ rjw: Rebased on top of the current mainline ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-07-17 19:45:02 +00:00
tz->last_temperature != THERMAL_TEMP_INIT) {
thermal: core: Add trip thresholds for trip crossing detection The trip crossing detection in handle_thermal_trip() does not work correctly in the cases when a trip point is crossed on the way up and then the zone temperature stays above its low temperature (that is, its temperature decreased by its hysteresis). The trip temperature may be passed by the zone temperature subsequently in that case, even multiple times, but that does not count as the trip crossing as long as the zone temperature does not fall below the trip's low temperature or, in other words, until the trip is crossed on the way down. |-----------low--------high------------| |<--------->| | hyst | | | | -|--> crossed on the way up | <---|-- crossed on the way down However, handle_thermal_trip() will invoke thermal_notify_tz_trip_up() every time the trip temperature is passed by the zone temperature on the way up regardless of whether or not the trip has been crossed on the way down yet. Moreover, it will not call thermal_notify_tz_trip_down() if the last zone temperature was between the trip's temperature and its low temperature, so some "trip crossed on the way down" events may not be reported. To address this issue, introduce trip thresholds equal to either the temperature of the given trip, or its low temperature, such that if the trip's threshold is passed by the zone temperature on the way up, its value will be set to the trip's low temperature and thermal_notify_tz_trip_up() will be called, and if the trip's threshold is passed by the zone temperature on the way down, its value will be set to the trip's temperature (high) and thermal_notify_tz_trip_down() will be called. Accordingly, if the threshold is passed on the way up, it cannot be passed on the way up again until its passed on the way down and if it is passed on the way down, it cannot be passed on the way down again until it is passed on the way up which guarantees correct triggering of trip crossing notifications. If the last temperature of the zone is invalid, the trip's threshold will be set depending of the zone's current temperature: If that temperature is above the trip's temperature, its threshold will be set to its low temperature or otherwise its threshold will be set to its (high) temperature. Because the zone temperature is initially set to invalid and tz->last_temperature is only updated by update_temperature(), this is sufficient to set the correct initial threshold values for all trips. Link: https://lore.kernel.org/all/20220718145038.1114379-4-daniel.lezcano@linaro.org Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-11-09 15:01:48 +00:00
/*
* Mitigation is under way, so it needs to stop if the zone
* temperature falls below the low temperature of the trip.
* In that case, the trip temperature becomes the new threshold.
thermal: core: Add trip thresholds for trip crossing detection The trip crossing detection in handle_thermal_trip() does not work correctly in the cases when a trip point is crossed on the way up and then the zone temperature stays above its low temperature (that is, its temperature decreased by its hysteresis). The trip temperature may be passed by the zone temperature subsequently in that case, even multiple times, but that does not count as the trip crossing as long as the zone temperature does not fall below the trip's low temperature or, in other words, until the trip is crossed on the way down. |-----------low--------high------------| |<--------->| | hyst | | | | -|--> crossed on the way up | <---|-- crossed on the way down However, handle_thermal_trip() will invoke thermal_notify_tz_trip_up() every time the trip temperature is passed by the zone temperature on the way up regardless of whether or not the trip has been crossed on the way down yet. Moreover, it will not call thermal_notify_tz_trip_down() if the last zone temperature was between the trip's temperature and its low temperature, so some "trip crossed on the way down" events may not be reported. To address this issue, introduce trip thresholds equal to either the temperature of the given trip, or its low temperature, such that if the trip's threshold is passed by the zone temperature on the way up, its value will be set to the trip's low temperature and thermal_notify_tz_trip_up() will be called, and if the trip's threshold is passed by the zone temperature on the way down, its value will be set to the trip's temperature (high) and thermal_notify_tz_trip_down() will be called. Accordingly, if the threshold is passed on the way up, it cannot be passed on the way up again until its passed on the way down and if it is passed on the way down, it cannot be passed on the way down again until it is passed on the way up which guarantees correct triggering of trip crossing notifications. If the last temperature of the zone is invalid, the trip's threshold will be set depending of the zone's current temperature: If that temperature is above the trip's temperature, its threshold will be set to its low temperature or otherwise its threshold will be set to its (high) temperature. Because the zone temperature is initially set to invalid and tz->last_temperature is only updated by update_temperature(), this is sufficient to set the correct initial threshold values for all trips. Link: https://lore.kernel.org/all/20220718145038.1114379-4-daniel.lezcano@linaro.org Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-11-09 15:01:48 +00:00
*/
if (tz->temperature < trip->temperature - trip->hysteresis) {
list_add(&td->notify_list_node, way_down_list);
td->notify_temp = trip->temperature - trip->hysteresis;
thermal: core: Move passive polling management to the core Passive polling is enabled by setting the 'passive' field in struct thermal_zone_device to a positive value so long as the 'passive_delay_jiffies' field is greater than zero. It causes the thermal core to actively check the thermal zone temperature periodically which in theory should be done after crossing a passive trip point on the way up in order to allow governors to react more rapidly to temperature changes and adjust mitigation more precisely. However, the 'passive' field in struct thermal_zone_device is currently managed by governors which is quite problematic. First of all, only two governors, Step-Wise and Power Allocator, update that field at all, so the other governors do not benefit from passive polling, although in principle they should. Moreover, if the zone governor is changed from, say, Step-Wise to Fair-Share after 'passive' has been incremented by the former, it is not going to be reset back to zero by the latter even if the zone temperature falls down below all passive trip points. For this reason, make handle_thermal_trip() increment 'passive' to enable passive polling for the given thermal zone whenever a passive trip point is crossed on the way up and decrement it whenever a passive trip point is crossed on the way down. Also remove the 'passive' field updates from governors and additionally clear it in thermal_zone_device_init() to prevent passive polling from being enabled after a system resume just beacuse it was enabled before suspending the system. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Tested-by: Lukasz Luba <lukasz.luba@arm.com>
2024-04-30 15:52:33 +00:00
if (trip->type == THERMAL_TRIP_PASSIVE) {
tz->passive--;
WARN_ON(tz->passive < 0);
}
thermal: core: Add trip thresholds for trip crossing detection The trip crossing detection in handle_thermal_trip() does not work correctly in the cases when a trip point is crossed on the way up and then the zone temperature stays above its low temperature (that is, its temperature decreased by its hysteresis). The trip temperature may be passed by the zone temperature subsequently in that case, even multiple times, but that does not count as the trip crossing as long as the zone temperature does not fall below the trip's low temperature or, in other words, until the trip is crossed on the way down. |-----------low--------high------------| |<--------->| | hyst | | | | -|--> crossed on the way up | <---|-- crossed on the way down However, handle_thermal_trip() will invoke thermal_notify_tz_trip_up() every time the trip temperature is passed by the zone temperature on the way up regardless of whether or not the trip has been crossed on the way down yet. Moreover, it will not call thermal_notify_tz_trip_down() if the last zone temperature was between the trip's temperature and its low temperature, so some "trip crossed on the way down" events may not be reported. To address this issue, introduce trip thresholds equal to either the temperature of the given trip, or its low temperature, such that if the trip's threshold is passed by the zone temperature on the way up, its value will be set to the trip's low temperature and thermal_notify_tz_trip_up() will be called, and if the trip's threshold is passed by the zone temperature on the way down, its value will be set to the trip's temperature (high) and thermal_notify_tz_trip_down() will be called. Accordingly, if the threshold is passed on the way up, it cannot be passed on the way up again until its passed on the way down and if it is passed on the way down, it cannot be passed on the way down again until it is passed on the way up which guarantees correct triggering of trip crossing notifications. If the last temperature of the zone is invalid, the trip's threshold will be set depending of the zone's current temperature: If that temperature is above the trip's temperature, its threshold will be set to its low temperature or otherwise its threshold will be set to its (high) temperature. Because the zone temperature is initially set to invalid and tz->last_temperature is only updated by update_temperature(), this is sufficient to set the correct initial threshold values for all trips. Link: https://lore.kernel.org/all/20220718145038.1114379-4-daniel.lezcano@linaro.org Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-11-09 15:01:48 +00:00
} else {
td->threshold -= trip->hysteresis;
thermal: core: Add trip thresholds for trip crossing detection The trip crossing detection in handle_thermal_trip() does not work correctly in the cases when a trip point is crossed on the way up and then the zone temperature stays above its low temperature (that is, its temperature decreased by its hysteresis). The trip temperature may be passed by the zone temperature subsequently in that case, even multiple times, but that does not count as the trip crossing as long as the zone temperature does not fall below the trip's low temperature or, in other words, until the trip is crossed on the way down. |-----------low--------high------------| |<--------->| | hyst | | | | -|--> crossed on the way up | <---|-- crossed on the way down However, handle_thermal_trip() will invoke thermal_notify_tz_trip_up() every time the trip temperature is passed by the zone temperature on the way up regardless of whether or not the trip has been crossed on the way down yet. Moreover, it will not call thermal_notify_tz_trip_down() if the last zone temperature was between the trip's temperature and its low temperature, so some "trip crossed on the way down" events may not be reported. To address this issue, introduce trip thresholds equal to either the temperature of the given trip, or its low temperature, such that if the trip's threshold is passed by the zone temperature on the way up, its value will be set to the trip's low temperature and thermal_notify_tz_trip_up() will be called, and if the trip's threshold is passed by the zone temperature on the way down, its value will be set to the trip's temperature (high) and thermal_notify_tz_trip_down() will be called. Accordingly, if the threshold is passed on the way up, it cannot be passed on the way up again until its passed on the way down and if it is passed on the way down, it cannot be passed on the way down again until it is passed on the way up which guarantees correct triggering of trip crossing notifications. If the last temperature of the zone is invalid, the trip's threshold will be set depending of the zone's current temperature: If that temperature is above the trip's temperature, its threshold will be set to its low temperature or otherwise its threshold will be set to its (high) temperature. Because the zone temperature is initially set to invalid and tz->last_temperature is only updated by update_temperature(), this is sufficient to set the correct initial threshold values for all trips. Link: https://lore.kernel.org/all/20220718145038.1114379-4-daniel.lezcano@linaro.org Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-11-09 15:01:48 +00:00
}
} else if (tz->temperature >= trip->temperature) {
/*
* There is no mitigation under way, so it needs to be started
* if the zone temperature exceeds the trip one. The new
* threshold is then set to the low temperature of the trip.
*/
list_add_tail(&td->notify_list_node, way_up_list);
td->notify_temp = trip->temperature;
td->threshold -= trip->hysteresis;
thermal: core: Move passive polling management to the core Passive polling is enabled by setting the 'passive' field in struct thermal_zone_device to a positive value so long as the 'passive_delay_jiffies' field is greater than zero. It causes the thermal core to actively check the thermal zone temperature periodically which in theory should be done after crossing a passive trip point on the way up in order to allow governors to react more rapidly to temperature changes and adjust mitigation more precisely. However, the 'passive' field in struct thermal_zone_device is currently managed by governors which is quite problematic. First of all, only two governors, Step-Wise and Power Allocator, update that field at all, so the other governors do not benefit from passive polling, although in principle they should. Moreover, if the zone governor is changed from, say, Step-Wise to Fair-Share after 'passive' has been incremented by the former, it is not going to be reset back to zero by the latter even if the zone temperature falls down below all passive trip points. For this reason, make handle_thermal_trip() increment 'passive' to enable passive polling for the given thermal zone whenever a passive trip point is crossed on the way up and decrement it whenever a passive trip point is crossed on the way down. Also remove the 'passive' field updates from governors and additionally clear it in thermal_zone_device_init() to prevent passive polling from being enabled after a system resume just beacuse it was enabled before suspending the system. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Tested-by: Lukasz Luba <lukasz.luba@arm.com>
2024-04-30 15:52:33 +00:00
if (trip->type == THERMAL_TRIP_PASSIVE)
tz->passive++;
else if (trip->type == THERMAL_TRIP_CRITICAL ||
trip->type == THERMAL_TRIP_HOT)
handle_critical_trips(tz, trip);
}
}
static void thermal_zone_device_check(struct work_struct *work)
{
struct thermal_zone_device *tz = container_of(work, struct
thermal_zone_device,
poll_queue.work);
thermal_zone_device_update(tz, THERMAL_EVENT_UNSPECIFIED);
}
static void thermal_zone_device_init(struct thermal_zone_device *tz)
{
struct thermal_instance *pos;
INIT_DELAYED_WORK(&tz->poll_queue, thermal_zone_device_check);
thermal: core: Allow thermal zones to tell the core to ignore them The iwlwifi wireless driver registers a thermal zone that is only needed when the network interface handled by it is up and it wants that thermal zone to be effectively ignored by the core otherwise. Before commit a8a261774466 ("thermal: core: Call monitor_thermal_zone() if zone temperature is invalid") that could be achieved by returning an error code from the thermal zone's .get_temp() callback because the core did not really handle errors returned by it almost at all. However, commit a8a261774466 made the core attempt to recover from the situation in which the temperature of a thermal zone cannot be determined due to errors returned by its .get_temp() and is always invalid from the core's perspective. That was done because there are thermal zones in which .get_temp() returns errors to start with due to some difficulties related to the initialization ordering, but then it will start to produce valid temperature values at one point. Unfortunately, the simple approach taken by commit a8a261774466, which is to poll the thermal zone periodically until its .get_temp() callback starts to return valid temperature values, is at odds with the special thermal zone in iwlwifi in which .get_temp() may always return an error because its network interface may always be down. If that happens, every attempt to invoke the thermal zone's .get_temp() callback resulting in an error causes the thermal core to print a dev_warn() message to the kernel log which is super-noisy. To address this problem, make the core handle the case in which .get_temp() returns 0, but the temperature value returned by it is not actually valid, in a special way. Namely, make the core completely ignore the invalid temperature value coming from .get_temp() in that case, which requires folding in update_temperature() into its caller and a few related changes. On the iwlwifi side, modify iwl_mvm_tzone_get_temp() to return 0 and put THERMAL_TEMP_INVALID into the temperature return memory location instead of returning an error when the firmware is not running or it is not of the right type. Also, to clearly separate the handling of invalid temperature values from the thermal zone initialization, introduce a special THERMAL_TEMP_INIT value specifically for the latter purpose. Fixes: a8a261774466 ("thermal: core: Call monitor_thermal_zone() if zone temperature is invalid") Closes: https://lore.kernel.org/linux-pm/20240715044527.GA1544@sol.localdomain/ Reported-by: Eric Biggers <ebiggers@kernel.org> Reported-by: Stefan Lippers-Hollmann <s.l-h@gmx.de> Link: https://bugzilla.kernel.org/show_bug.cgi?id=201761 Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Stefan Lippers-Hollmann <s.l-h@gmx.de> Cc: 6.10+ <stable@vger.kernel.org> # 6.10+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://patch.msgid.link/4950004.31r3eYUQgx@rjwysocki.net [ rjw: Rebased on top of the current mainline ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-07-17 19:45:02 +00:00
tz->temperature = THERMAL_TEMP_INIT;
thermal: core: Move passive polling management to the core Passive polling is enabled by setting the 'passive' field in struct thermal_zone_device to a positive value so long as the 'passive_delay_jiffies' field is greater than zero. It causes the thermal core to actively check the thermal zone temperature periodically which in theory should be done after crossing a passive trip point on the way up in order to allow governors to react more rapidly to temperature changes and adjust mitigation more precisely. However, the 'passive' field in struct thermal_zone_device is currently managed by governors which is quite problematic. First of all, only two governors, Step-Wise and Power Allocator, update that field at all, so the other governors do not benefit from passive polling, although in principle they should. Moreover, if the zone governor is changed from, say, Step-Wise to Fair-Share after 'passive' has been incremented by the former, it is not going to be reset back to zero by the latter even if the zone temperature falls down below all passive trip points. For this reason, make handle_thermal_trip() increment 'passive' to enable passive polling for the given thermal zone whenever a passive trip point is crossed on the way up and decrement it whenever a passive trip point is crossed on the way down. Also remove the 'passive' field updates from governors and additionally clear it in thermal_zone_device_init() to prevent passive polling from being enabled after a system resume just beacuse it was enabled before suspending the system. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Tested-by: Lukasz Luba <lukasz.luba@arm.com>
2024-04-30 15:52:33 +00:00
tz->passive = 0;
tz->prev_low_trip = -INT_MAX;
tz->prev_high_trip = INT_MAX;
list_for_each_entry(pos, &tz->thermal_instances, tz_node)
pos->initialized = false;
}
static void thermal_governor_trip_crossed(struct thermal_governor *governor,
struct thermal_zone_device *tz,
const struct thermal_trip *trip,
bool crossed_up)
{
if (trip->type == THERMAL_TRIP_HOT || trip->type == THERMAL_TRIP_CRITICAL)
return;
if (governor->trip_crossed)
governor->trip_crossed(tz, trip, crossed_up);
}
static void thermal_trip_crossed(struct thermal_zone_device *tz,
const struct thermal_trip *trip,
struct thermal_governor *governor,
bool crossed_up)
{
if (crossed_up) {
thermal_notify_tz_trip_up(tz, trip);
thermal_debug_tz_trip_up(tz, trip);
} else {
thermal_notify_tz_trip_down(tz, trip);
thermal_debug_tz_trip_down(tz, trip);
}
thermal_governor_trip_crossed(governor, tz, trip, crossed_up);
}
static int thermal_trip_notify_cmp(void *not_used, const struct list_head *a,
const struct list_head *b)
{
struct thermal_trip_desc *tda = container_of(a, struct thermal_trip_desc,
notify_list_node);
struct thermal_trip_desc *tdb = container_of(b, struct thermal_trip_desc,
notify_list_node);
return tda->notify_temp - tdb->notify_temp;
}
void __thermal_zone_device_update(struct thermal_zone_device *tz,
enum thermal_notify_event event)
{
struct thermal_governor *governor = thermal_get_tz_governor(tz);
struct thermal_trip_desc *td;
LIST_HEAD(way_down_list);
LIST_HEAD(way_up_list);
int low = -INT_MAX, high = INT_MAX;
thermal: core: Allow thermal zones to tell the core to ignore them The iwlwifi wireless driver registers a thermal zone that is only needed when the network interface handled by it is up and it wants that thermal zone to be effectively ignored by the core otherwise. Before commit a8a261774466 ("thermal: core: Call monitor_thermal_zone() if zone temperature is invalid") that could be achieved by returning an error code from the thermal zone's .get_temp() callback because the core did not really handle errors returned by it almost at all. However, commit a8a261774466 made the core attempt to recover from the situation in which the temperature of a thermal zone cannot be determined due to errors returned by its .get_temp() and is always invalid from the core's perspective. That was done because there are thermal zones in which .get_temp() returns errors to start with due to some difficulties related to the initialization ordering, but then it will start to produce valid temperature values at one point. Unfortunately, the simple approach taken by commit a8a261774466, which is to poll the thermal zone periodically until its .get_temp() callback starts to return valid temperature values, is at odds with the special thermal zone in iwlwifi in which .get_temp() may always return an error because its network interface may always be down. If that happens, every attempt to invoke the thermal zone's .get_temp() callback resulting in an error causes the thermal core to print a dev_warn() message to the kernel log which is super-noisy. To address this problem, make the core handle the case in which .get_temp() returns 0, but the temperature value returned by it is not actually valid, in a special way. Namely, make the core completely ignore the invalid temperature value coming from .get_temp() in that case, which requires folding in update_temperature() into its caller and a few related changes. On the iwlwifi side, modify iwl_mvm_tzone_get_temp() to return 0 and put THERMAL_TEMP_INVALID into the temperature return memory location instead of returning an error when the firmware is not running or it is not of the right type. Also, to clearly separate the handling of invalid temperature values from the thermal zone initialization, introduce a special THERMAL_TEMP_INIT value specifically for the latter purpose. Fixes: a8a261774466 ("thermal: core: Call monitor_thermal_zone() if zone temperature is invalid") Closes: https://lore.kernel.org/linux-pm/20240715044527.GA1544@sol.localdomain/ Reported-by: Eric Biggers <ebiggers@kernel.org> Reported-by: Stefan Lippers-Hollmann <s.l-h@gmx.de> Link: https://bugzilla.kernel.org/show_bug.cgi?id=201761 Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Stefan Lippers-Hollmann <s.l-h@gmx.de> Cc: 6.10+ <stable@vger.kernel.org> # 6.10+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://patch.msgid.link/4950004.31r3eYUQgx@rjwysocki.net [ rjw: Rebased on top of the current mainline ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-07-17 19:45:02 +00:00
int temp, ret;
if (tz->suspended || tz->mode != THERMAL_DEVICE_ENABLED)
return;
thermal: core: Allow thermal zones to tell the core to ignore them The iwlwifi wireless driver registers a thermal zone that is only needed when the network interface handled by it is up and it wants that thermal zone to be effectively ignored by the core otherwise. Before commit a8a261774466 ("thermal: core: Call monitor_thermal_zone() if zone temperature is invalid") that could be achieved by returning an error code from the thermal zone's .get_temp() callback because the core did not really handle errors returned by it almost at all. However, commit a8a261774466 made the core attempt to recover from the situation in which the temperature of a thermal zone cannot be determined due to errors returned by its .get_temp() and is always invalid from the core's perspective. That was done because there are thermal zones in which .get_temp() returns errors to start with due to some difficulties related to the initialization ordering, but then it will start to produce valid temperature values at one point. Unfortunately, the simple approach taken by commit a8a261774466, which is to poll the thermal zone periodically until its .get_temp() callback starts to return valid temperature values, is at odds with the special thermal zone in iwlwifi in which .get_temp() may always return an error because its network interface may always be down. If that happens, every attempt to invoke the thermal zone's .get_temp() callback resulting in an error causes the thermal core to print a dev_warn() message to the kernel log which is super-noisy. To address this problem, make the core handle the case in which .get_temp() returns 0, but the temperature value returned by it is not actually valid, in a special way. Namely, make the core completely ignore the invalid temperature value coming from .get_temp() in that case, which requires folding in update_temperature() into its caller and a few related changes. On the iwlwifi side, modify iwl_mvm_tzone_get_temp() to return 0 and put THERMAL_TEMP_INVALID into the temperature return memory location instead of returning an error when the firmware is not running or it is not of the right type. Also, to clearly separate the handling of invalid temperature values from the thermal zone initialization, introduce a special THERMAL_TEMP_INIT value specifically for the latter purpose. Fixes: a8a261774466 ("thermal: core: Call monitor_thermal_zone() if zone temperature is invalid") Closes: https://lore.kernel.org/linux-pm/20240715044527.GA1544@sol.localdomain/ Reported-by: Eric Biggers <ebiggers@kernel.org> Reported-by: Stefan Lippers-Hollmann <s.l-h@gmx.de> Link: https://bugzilla.kernel.org/show_bug.cgi?id=201761 Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Stefan Lippers-Hollmann <s.l-h@gmx.de> Cc: 6.10+ <stable@vger.kernel.org> # 6.10+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://patch.msgid.link/4950004.31r3eYUQgx@rjwysocki.net [ rjw: Rebased on top of the current mainline ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-07-17 19:45:02 +00:00
ret = __thermal_zone_get_temp(tz, &temp);
if (ret) {
thermal: core: Back off when polling thermal zones on errors Commit a8a261774466 ("thermal: core: Call monitor_thermal_zone() if zone temperature is invalid") introduced a polling mechanism by which the thermal core attampts to get a valid temperature value for thermal zones where the .get_temp() callback returns errors to start with (for example, due to initialization ordering woes). However, this polling is carried out periodically ad infinitum and every iteration of it causes a message to be printed to the kernel log which means a lot of log noise on systems where there are thermal zones that never get ready for some reason. It is also not really useful to continuously poll thermal zones that never respond. To address this, modify the thermal core to increase the delay between consecutive thermal zone temperature checks after every check that fails until it reaches a certain maximum value. At that point, the thermal zone in question will be disabled, but user space will be able to reenable it if it believes that the failure is transient. Also change the code to print messages regarding failed temperature checks to the kernel log only twice, once when the thermal zone's .get_temp() callback returns an error for the first time and once when disabling the given thermal zone. In addition, a dev_crit() message will be printed at that point if the given thermal zone contains a critical trip point to notify the system operator about the situation. Fixes: a8a261774466 ("thermal: core: Call monitor_thermal_zone() if zone temperature is invalid") Link: https://lore.kernel.org/linux-acpi/CAGnHSE=RyPK++UG0-wAtVKgeJxe0uzFYgLxm+RUOKKoQquW=Ow@mail.gmail.com/ Reported-by: Tom Yan <tom.ty89@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://patch.msgid.link/2962033.e9J7NaK4W3@rjwysocki.net
2024-07-18 19:01:14 +00:00
thermal_zone_recheck(tz, ret);
thermal: core: Allow thermal zones to tell the core to ignore them The iwlwifi wireless driver registers a thermal zone that is only needed when the network interface handled by it is up and it wants that thermal zone to be effectively ignored by the core otherwise. Before commit a8a261774466 ("thermal: core: Call monitor_thermal_zone() if zone temperature is invalid") that could be achieved by returning an error code from the thermal zone's .get_temp() callback because the core did not really handle errors returned by it almost at all. However, commit a8a261774466 made the core attempt to recover from the situation in which the temperature of a thermal zone cannot be determined due to errors returned by its .get_temp() and is always invalid from the core's perspective. That was done because there are thermal zones in which .get_temp() returns errors to start with due to some difficulties related to the initialization ordering, but then it will start to produce valid temperature values at one point. Unfortunately, the simple approach taken by commit a8a261774466, which is to poll the thermal zone periodically until its .get_temp() callback starts to return valid temperature values, is at odds with the special thermal zone in iwlwifi in which .get_temp() may always return an error because its network interface may always be down. If that happens, every attempt to invoke the thermal zone's .get_temp() callback resulting in an error causes the thermal core to print a dev_warn() message to the kernel log which is super-noisy. To address this problem, make the core handle the case in which .get_temp() returns 0, but the temperature value returned by it is not actually valid, in a special way. Namely, make the core completely ignore the invalid temperature value coming from .get_temp() in that case, which requires folding in update_temperature() into its caller and a few related changes. On the iwlwifi side, modify iwl_mvm_tzone_get_temp() to return 0 and put THERMAL_TEMP_INVALID into the temperature return memory location instead of returning an error when the firmware is not running or it is not of the right type. Also, to clearly separate the handling of invalid temperature values from the thermal zone initialization, introduce a special THERMAL_TEMP_INIT value specifically for the latter purpose. Fixes: a8a261774466 ("thermal: core: Call monitor_thermal_zone() if zone temperature is invalid") Closes: https://lore.kernel.org/linux-pm/20240715044527.GA1544@sol.localdomain/ Reported-by: Eric Biggers <ebiggers@kernel.org> Reported-by: Stefan Lippers-Hollmann <s.l-h@gmx.de> Link: https://bugzilla.kernel.org/show_bug.cgi?id=201761 Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Stefan Lippers-Hollmann <s.l-h@gmx.de> Cc: 6.10+ <stable@vger.kernel.org> # 6.10+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://patch.msgid.link/4950004.31r3eYUQgx@rjwysocki.net [ rjw: Rebased on top of the current mainline ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-07-17 19:45:02 +00:00
return;
} else if (temp <= THERMAL_TEMP_INVALID) {
/*
* Special case: No valid temperature value is available, but
* the zone owner does not want the core to do anything about
* it. Continue regular zone polling if needed, so that this
* function can be called again, but skip everything else.
*/
thermal: core: Call monitor_thermal_zone() if zone temperature is invalid Commit 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip() if zone temperature is invalid") caused __thermal_zone_device_update() to return early if the current thermal zone temperature was invalid. This was done to avoid running handle_thermal_trip() and governor callbacks in that case which led to confusion. However, it went too far because monitor_thermal_zone() still needs to be called even when the zone temperature is invalid to ensure that it will be updated eventually in case thermal polling is enabled and the driver has no other means to notify the core of zone temperature changes (for example, it does not register an interrupt handler or ACPI notifier). Also if the .set_trips() zone callback is expected to set up monitoring interrupts for a thermal zone, it has to be provided with valid boundaries and that can only happen if the zone temperature is known. Accordingly, to ensure that __thermal_zone_device_update() will run again after a failing zone temperature check, make it call monitor_thermal_zone() regardless of whether or not the zone temperature is valid and make the latter schedule a thermal zone temperature update if the zone temperature is invalid even if polling is not enabled for the thermal zone. Fixes: 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip() if zone temperature is invalid") Reported-by: Daniel Lezcano <daniel.lezcano@linaro.org> Tested-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://patch.msgid.link/2764814.mvXUDI8C0e@rjwysocki.net [ rjw: Changed THERMAL_RECHECK_DELAY_MS to 250 ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-06-28 12:10:03 +00:00
goto monitor;
thermal: core: Allow thermal zones to tell the core to ignore them The iwlwifi wireless driver registers a thermal zone that is only needed when the network interface handled by it is up and it wants that thermal zone to be effectively ignored by the core otherwise. Before commit a8a261774466 ("thermal: core: Call monitor_thermal_zone() if zone temperature is invalid") that could be achieved by returning an error code from the thermal zone's .get_temp() callback because the core did not really handle errors returned by it almost at all. However, commit a8a261774466 made the core attempt to recover from the situation in which the temperature of a thermal zone cannot be determined due to errors returned by its .get_temp() and is always invalid from the core's perspective. That was done because there are thermal zones in which .get_temp() returns errors to start with due to some difficulties related to the initialization ordering, but then it will start to produce valid temperature values at one point. Unfortunately, the simple approach taken by commit a8a261774466, which is to poll the thermal zone periodically until its .get_temp() callback starts to return valid temperature values, is at odds with the special thermal zone in iwlwifi in which .get_temp() may always return an error because its network interface may always be down. If that happens, every attempt to invoke the thermal zone's .get_temp() callback resulting in an error causes the thermal core to print a dev_warn() message to the kernel log which is super-noisy. To address this problem, make the core handle the case in which .get_temp() returns 0, but the temperature value returned by it is not actually valid, in a special way. Namely, make the core completely ignore the invalid temperature value coming from .get_temp() in that case, which requires folding in update_temperature() into its caller and a few related changes. On the iwlwifi side, modify iwl_mvm_tzone_get_temp() to return 0 and put THERMAL_TEMP_INVALID into the temperature return memory location instead of returning an error when the firmware is not running or it is not of the right type. Also, to clearly separate the handling of invalid temperature values from the thermal zone initialization, introduce a special THERMAL_TEMP_INIT value specifically for the latter purpose. Fixes: a8a261774466 ("thermal: core: Call monitor_thermal_zone() if zone temperature is invalid") Closes: https://lore.kernel.org/linux-pm/20240715044527.GA1544@sol.localdomain/ Reported-by: Eric Biggers <ebiggers@kernel.org> Reported-by: Stefan Lippers-Hollmann <s.l-h@gmx.de> Link: https://bugzilla.kernel.org/show_bug.cgi?id=201761 Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Stefan Lippers-Hollmann <s.l-h@gmx.de> Cc: 6.10+ <stable@vger.kernel.org> # 6.10+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://patch.msgid.link/4950004.31r3eYUQgx@rjwysocki.net [ rjw: Rebased on top of the current mainline ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-07-17 19:45:02 +00:00
}
thermal: core: Back off when polling thermal zones on errors Commit a8a261774466 ("thermal: core: Call monitor_thermal_zone() if zone temperature is invalid") introduced a polling mechanism by which the thermal core attampts to get a valid temperature value for thermal zones where the .get_temp() callback returns errors to start with (for example, due to initialization ordering woes). However, this polling is carried out periodically ad infinitum and every iteration of it causes a message to be printed to the kernel log which means a lot of log noise on systems where there are thermal zones that never get ready for some reason. It is also not really useful to continuously poll thermal zones that never respond. To address this, modify the thermal core to increase the delay between consecutive thermal zone temperature checks after every check that fails until it reaches a certain maximum value. At that point, the thermal zone in question will be disabled, but user space will be able to reenable it if it believes that the failure is transient. Also change the code to print messages regarding failed temperature checks to the kernel log only twice, once when the thermal zone's .get_temp() callback returns an error for the first time and once when disabling the given thermal zone. In addition, a dev_crit() message will be printed at that point if the given thermal zone contains a critical trip point to notify the system operator about the situation. Fixes: a8a261774466 ("thermal: core: Call monitor_thermal_zone() if zone temperature is invalid") Link: https://lore.kernel.org/linux-acpi/CAGnHSE=RyPK++UG0-wAtVKgeJxe0uzFYgLxm+RUOKKoQquW=Ow@mail.gmail.com/ Reported-by: Tom Yan <tom.ty89@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://patch.msgid.link/2962033.e9J7NaK4W3@rjwysocki.net
2024-07-18 19:01:14 +00:00
tz->recheck_delay_jiffies = THERMAL_RECHECK_DELAY;
thermal: core: Allow thermal zones to tell the core to ignore them The iwlwifi wireless driver registers a thermal zone that is only needed when the network interface handled by it is up and it wants that thermal zone to be effectively ignored by the core otherwise. Before commit a8a261774466 ("thermal: core: Call monitor_thermal_zone() if zone temperature is invalid") that could be achieved by returning an error code from the thermal zone's .get_temp() callback because the core did not really handle errors returned by it almost at all. However, commit a8a261774466 made the core attempt to recover from the situation in which the temperature of a thermal zone cannot be determined due to errors returned by its .get_temp() and is always invalid from the core's perspective. That was done because there are thermal zones in which .get_temp() returns errors to start with due to some difficulties related to the initialization ordering, but then it will start to produce valid temperature values at one point. Unfortunately, the simple approach taken by commit a8a261774466, which is to poll the thermal zone periodically until its .get_temp() callback starts to return valid temperature values, is at odds with the special thermal zone in iwlwifi in which .get_temp() may always return an error because its network interface may always be down. If that happens, every attempt to invoke the thermal zone's .get_temp() callback resulting in an error causes the thermal core to print a dev_warn() message to the kernel log which is super-noisy. To address this problem, make the core handle the case in which .get_temp() returns 0, but the temperature value returned by it is not actually valid, in a special way. Namely, make the core completely ignore the invalid temperature value coming from .get_temp() in that case, which requires folding in update_temperature() into its caller and a few related changes. On the iwlwifi side, modify iwl_mvm_tzone_get_temp() to return 0 and put THERMAL_TEMP_INVALID into the temperature return memory location instead of returning an error when the firmware is not running or it is not of the right type. Also, to clearly separate the handling of invalid temperature values from the thermal zone initialization, introduce a special THERMAL_TEMP_INIT value specifically for the latter purpose. Fixes: a8a261774466 ("thermal: core: Call monitor_thermal_zone() if zone temperature is invalid") Closes: https://lore.kernel.org/linux-pm/20240715044527.GA1544@sol.localdomain/ Reported-by: Eric Biggers <ebiggers@kernel.org> Reported-by: Stefan Lippers-Hollmann <s.l-h@gmx.de> Link: https://bugzilla.kernel.org/show_bug.cgi?id=201761 Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Stefan Lippers-Hollmann <s.l-h@gmx.de> Cc: 6.10+ <stable@vger.kernel.org> # 6.10+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://patch.msgid.link/4950004.31r3eYUQgx@rjwysocki.net [ rjw: Rebased on top of the current mainline ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-07-17 19:45:02 +00:00
tz->last_temperature = tz->temperature;
tz->temperature = temp;
trace_thermal_temperature(tz);
thermal_genl_sampling_temp(tz->id, temp);
tz->notify_event = event;
for_each_trip_desc(tz, td) {
handle_thermal_trip(tz, td, &way_up_list, &way_down_list);
if (td->threshold <= tz->temperature && td->threshold > low)
low = td->threshold;
if (td->threshold >= tz->temperature && td->threshold < high)
high = td->threshold;
}
thermal_zone_set_trips(tz, low, high);
list_sort(NULL, &way_up_list, thermal_trip_notify_cmp);
list_for_each_entry(td, &way_up_list, notify_list_node)
thermal_trip_crossed(tz, &td->trip, governor, true);
list_sort(NULL, &way_down_list, thermal_trip_notify_cmp);
list_for_each_entry_reverse(td, &way_down_list, notify_list_node)
thermal_trip_crossed(tz, &td->trip, governor, false);
if (governor->manage)
governor->manage(tz);
thermal_debug_update_trip_stats(tz);
thermal: core: Call monitor_thermal_zone() if zone temperature is invalid Commit 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip() if zone temperature is invalid") caused __thermal_zone_device_update() to return early if the current thermal zone temperature was invalid. This was done to avoid running handle_thermal_trip() and governor callbacks in that case which led to confusion. However, it went too far because monitor_thermal_zone() still needs to be called even when the zone temperature is invalid to ensure that it will be updated eventually in case thermal polling is enabled and the driver has no other means to notify the core of zone temperature changes (for example, it does not register an interrupt handler or ACPI notifier). Also if the .set_trips() zone callback is expected to set up monitoring interrupts for a thermal zone, it has to be provided with valid boundaries and that can only happen if the zone temperature is known. Accordingly, to ensure that __thermal_zone_device_update() will run again after a failing zone temperature check, make it call monitor_thermal_zone() regardless of whether or not the zone temperature is valid and make the latter schedule a thermal zone temperature update if the zone temperature is invalid even if polling is not enabled for the thermal zone. Fixes: 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip() if zone temperature is invalid") Reported-by: Daniel Lezcano <daniel.lezcano@linaro.org> Tested-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://patch.msgid.link/2764814.mvXUDI8C0e@rjwysocki.net [ rjw: Changed THERMAL_RECHECK_DELAY_MS to 250 ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-06-28 12:10:03 +00:00
monitor:
monitor_thermal_zone(tz);
}
static int thermal_zone_device_set_mode(struct thermal_zone_device *tz,
enum thermal_device_mode mode)
{
int ret;
mutex_lock(&tz->lock);
/* do nothing if mode isn't changing */
if (mode == tz->mode) {
mutex_unlock(&tz->lock);
return 0;
}
ret = __thermal_zone_device_set_mode(tz, mode);
if (ret) {
mutex_unlock(&tz->lock);
return ret;
}
__thermal_zone_device_update(tz, THERMAL_EVENT_UNSPECIFIED);
mutex_unlock(&tz->lock);
if (mode == THERMAL_DEVICE_ENABLED)
thermal_notify_tz_enable(tz);
else
thermal_notify_tz_disable(tz);
return 0;
}
int thermal_zone_device_enable(struct thermal_zone_device *tz)
{
return thermal_zone_device_set_mode(tz, THERMAL_DEVICE_ENABLED);
}
EXPORT_SYMBOL_GPL(thermal_zone_device_enable);
int thermal_zone_device_disable(struct thermal_zone_device *tz)
{
return thermal_zone_device_set_mode(tz, THERMAL_DEVICE_DISABLED);
}
EXPORT_SYMBOL_GPL(thermal_zone_device_disable);
thermal: core: Rework thermal zone availability check In order to avoid running __thermal_zone_device_update() for thermal zones going away, the thermal zone lock is held around device_del() in thermal_zone_device_unregister() and thermal_zone_device_update() passes the given thermal zone device to device_is_registered(). This allows thermal_zone_device_update() to skip the __thermal_zone_device_update() if device_del() has already run for the thermal zone at hand. However, instead of looking at driver core internals, the thermal subsystem may as well rely on its own data structures for this purpose. Namely, if the thermal zone is not present in thermal_tz_list, it can be regarded as unavailable, which in fact is already the case in thermal_zone_device_unregister(). Accordingly, the device_is_registered() check in thermal_zone_device_update() can be replaced with checking whether or not the node list_head in struct thermal_zone_device is empty, in which case it is not there in thermal_tz_list. To make this work, though, it is necessary to initialize tz->node in thermal_zone_device_register_with_trips() before registering the thermal zone device and it needs to be added to thermal_tz_list and deleted from it under its zone lock. After the above modifications, the zone lock does not need to be held around device_del() in thermal_zone_device_unregister() any more. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-and-tested-by: Lukasz Luba <lukasz.luba@arm.com> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
2023-12-08 19:20:00 +00:00
static bool thermal_zone_is_present(struct thermal_zone_device *tz)
{
return !list_empty(&tz->node);
}
void thermal_zone_device_update(struct thermal_zone_device *tz,
enum thermal_notify_event event)
{
mutex_lock(&tz->lock);
thermal: core: Rework thermal zone availability check In order to avoid running __thermal_zone_device_update() for thermal zones going away, the thermal zone lock is held around device_del() in thermal_zone_device_unregister() and thermal_zone_device_update() passes the given thermal zone device to device_is_registered(). This allows thermal_zone_device_update() to skip the __thermal_zone_device_update() if device_del() has already run for the thermal zone at hand. However, instead of looking at driver core internals, the thermal subsystem may as well rely on its own data structures for this purpose. Namely, if the thermal zone is not present in thermal_tz_list, it can be regarded as unavailable, which in fact is already the case in thermal_zone_device_unregister(). Accordingly, the device_is_registered() check in thermal_zone_device_update() can be replaced with checking whether or not the node list_head in struct thermal_zone_device is empty, in which case it is not there in thermal_tz_list. To make this work, though, it is necessary to initialize tz->node in thermal_zone_device_register_with_trips() before registering the thermal zone device and it needs to be added to thermal_tz_list and deleted from it under its zone lock. After the above modifications, the zone lock does not need to be held around device_del() in thermal_zone_device_unregister() any more. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-and-tested-by: Lukasz Luba <lukasz.luba@arm.com> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
2023-12-08 19:20:00 +00:00
if (thermal_zone_is_present(tz))
__thermal_zone_device_update(tz, event);
mutex_unlock(&tz->lock);
}
EXPORT_SYMBOL_GPL(thermal_zone_device_update);
void thermal_zone_trip_down(struct thermal_zone_device *tz,
const struct thermal_trip *trip)
{
thermal_trip_crossed(tz, trip, thermal_get_tz_governor(tz), false);
}
int for_each_thermal_governor(int (*cb)(struct thermal_governor *, void *),
void *data)
{
struct thermal_governor *gov;
int ret = 0;
mutex_lock(&thermal_governor_lock);
list_for_each_entry(gov, &thermal_governor_list, governor_list) {
ret = cb(gov, data);
if (ret)
break;
}
mutex_unlock(&thermal_governor_lock);
return ret;
}
int for_each_thermal_cooling_device(int (*cb)(struct thermal_cooling_device *,
void *), void *data)
{
struct thermal_cooling_device *cdev;
int ret = 0;
mutex_lock(&thermal_list_lock);
list_for_each_entry(cdev, &thermal_cdev_list, node) {
ret = cb(cdev, data);
if (ret)
break;
}
mutex_unlock(&thermal_list_lock);
return ret;
}
int for_each_thermal_zone(int (*cb)(struct thermal_zone_device *, void *),
void *data)
{
struct thermal_zone_device *tz;
int ret = 0;
mutex_lock(&thermal_list_lock);
list_for_each_entry(tz, &thermal_tz_list, node) {
ret = cb(tz, data);
if (ret)
break;
}
mutex_unlock(&thermal_list_lock);
return ret;
}
struct thermal_zone_device *thermal_zone_get_by_id(int id)
{
struct thermal_zone_device *tz, *match = NULL;
mutex_lock(&thermal_list_lock);
list_for_each_entry(tz, &thermal_tz_list, node) {
if (tz->id == id) {
get_device(&tz->device);
match = tz;
break;
}
}
mutex_unlock(&thermal_list_lock);
return match;
}
/*
* Device management section: cooling devices, zones devices, and binding
*
* Set of functions provided by the thermal core for:
* - cooling devices lifecycle: registration, unregistration,
* binding, and unbinding.
* - thermal zone devices lifecycle: registration, unregistration,
* binding, and unbinding.
*/
/**
* thermal_bind_cdev_to_trip - bind a cooling device to a thermal zone
* @tz: pointer to struct thermal_zone_device
* @trip: trip point the cooling devices is associated with in this zone.
* @cdev: pointer to struct thermal_cooling_device
* @cool_spec: cooling specification for @trip and @cdev
*
* This interface function bind a thermal cooling device to the certain trip
* point of a thermal zone device.
* This function is usually called in the thermal zone device .bind callback.
*
* Return: 0 on success, the proper error value otherwise.
*/
static int thermal_bind_cdev_to_trip(struct thermal_zone_device *tz,
const struct thermal_trip *trip,
struct thermal_cooling_device *cdev,
struct cooling_spec *cool_spec)
{
struct thermal_instance *dev;
struct thermal_instance *pos;
bool upper_no_limit;
int result;
/* lower default 0, upper default max_state */
if (cool_spec->lower == THERMAL_NO_LIMIT)
cool_spec->lower = 0;
if (cool_spec->upper == THERMAL_NO_LIMIT) {
cool_spec->upper = cdev->max_state;
upper_no_limit = true;
} else {
upper_no_limit = false;
}
if (cool_spec->lower > cool_spec->upper || cool_spec->upper > cdev->max_state)
return -EINVAL;
dev = kzalloc(sizeof(*dev), GFP_KERNEL);
if (!dev)
return -ENOMEM;
dev->cdev = cdev;
dev->trip = trip;
dev->upper = cool_spec->upper;
dev->upper_no_limit = upper_no_limit;
dev->lower = cool_spec->lower;
dev->target = THERMAL_NO_TARGET;
dev->weight = cool_spec->weight;
result = ida_alloc(&tz->ida, GFP_KERNEL);
if (result < 0)
goto free_mem;
dev->id = result;
sprintf(dev->name, "cdev%d", dev->id);
result =
sysfs_create_link(&tz->device.kobj, &cdev->device.kobj, dev->name);
if (result)
goto release_ida;
snprintf(dev->attr_name, sizeof(dev->attr_name), "cdev%d_trip_point",
dev->id);
sysfs_attr_init(&dev->attr.attr);
dev->attr.attr.name = dev->attr_name;
dev->attr.attr.mode = 0444;
dev->attr.show = trip_point_show;
result = device_create_file(&tz->device, &dev->attr);
if (result)
goto remove_symbol_link;
snprintf(dev->weight_attr_name, sizeof(dev->weight_attr_name),
"cdev%d_weight", dev->id);
sysfs_attr_init(&dev->weight_attr.attr);
dev->weight_attr.attr.name = dev->weight_attr_name;
dev->weight_attr.attr.mode = S_IWUSR | S_IRUGO;
dev->weight_attr.show = weight_show;
dev->weight_attr.store = weight_store;
result = device_create_file(&tz->device, &dev->weight_attr);
if (result)
goto remove_trip_file;
mutex_lock(&cdev->lock);
list_for_each_entry(pos, &tz->thermal_instances, tz_node)
if (pos->trip == trip && pos->cdev == cdev) {
result = -EEXIST;
break;
}
if (!result) {
list_add_tail(&dev->tz_node, &tz->thermal_instances);
list_add_tail(&dev->cdev_node, &cdev->thermal_instances);
atomic_set(&tz->need_update, 1);
thermal_governor_update_tz(tz, THERMAL_TZ_BIND_CDEV);
}
mutex_unlock(&cdev->lock);
if (!result)
return 0;
device_remove_file(&tz->device, &dev->weight_attr);
remove_trip_file:
device_remove_file(&tz->device, &dev->attr);
remove_symbol_link:
sysfs_remove_link(&tz->device.kobj, dev->name);
release_ida:
ida_free(&tz->ida, dev->id);
free_mem:
kfree(dev);
return result;
}
/**
* thermal_unbind_cdev_from_trip - unbind a cooling device from a thermal zone.
* @tz: pointer to a struct thermal_zone_device.
* @trip: trip point the cooling devices is associated with in this zone.
* @cdev: pointer to a struct thermal_cooling_device.
*
* This interface function unbind a thermal cooling device from the certain
* trip point of a thermal zone device.
* This function is usually called in the thermal zone device .unbind callback.
*/
static void thermal_unbind_cdev_from_trip(struct thermal_zone_device *tz,
const struct thermal_trip *trip,
struct thermal_cooling_device *cdev)
{
struct thermal_instance *pos, *next;
mutex_lock(&cdev->lock);
list_for_each_entry_safe(pos, next, &tz->thermal_instances, tz_node) {
if (pos->trip == trip && pos->cdev == cdev) {
list_del(&pos->tz_node);
list_del(&pos->cdev_node);
thermal_governor_update_tz(tz, THERMAL_TZ_UNBIND_CDEV);
mutex_unlock(&cdev->lock);
goto unbind;
}
}
mutex_unlock(&cdev->lock);
return;
unbind:
thermal: remove dangling 'weight_attr' device file This file isn't getting removed while we unbind a device from thermal zone. And this causes following messages when the device is registered again: WARNING: CPU: 0 PID: 2228 at /home/viresh/linux/fs/sysfs/dir.c:31 sysfs_warn_dup+0x60/0x70() sysfs: cannot create duplicate filename '/devices/virtual/thermal/thermal_zone0/cdev0_weight' Modules linked in: cpufreq_dt(+) [last unloaded: cpufreq_dt] CPU: 0 PID: 2228 Comm: insmod Not tainted 4.2.0-rc3-00059-g44fffd9473eb #272 Hardware name: SAMSUNG EXYNOS (Flattened Device Tree) [<c00153e8>] (unwind_backtrace) from [<c0012368>] (show_stack+0x10/0x14) [<c0012368>] (show_stack) from [<c053a684>] (dump_stack+0x84/0xc4) [<c053a684>] (dump_stack) from [<c002284c>] (warn_slowpath_common+0x80/0xb0) [<c002284c>] (warn_slowpath_common) from [<c00228ac>] (warn_slowpath_fmt+0x30/0x40) [<c00228ac>] (warn_slowpath_fmt) from [<c012d524>] (sysfs_warn_dup+0x60/0x70) [<c012d524>] (sysfs_warn_dup) from [<c012d244>] (sysfs_add_file_mode_ns+0x13c/0x190) [<c012d244>] (sysfs_add_file_mode_ns) from [<c012d2d4>] (sysfs_create_file_ns+0x3c/0x48) [<c012d2d4>] (sysfs_create_file_ns) from [<c03c04a8>] (thermal_zone_bind_cooling_device+0x260/0x358) [<c03c04a8>] (thermal_zone_bind_cooling_device) from [<c03c2e70>] (of_thermal_bind+0x88/0xb4) [<c03c2e70>] (of_thermal_bind) from [<c03c10d0>] (__thermal_cooling_device_register+0x17c/0x2e0) [<c03c10d0>] (__thermal_cooling_device_register) from [<c03c3f50>] (__cpufreq_cooling_register+0x3a0/0x51c) [<c03c3f50>] (__cpufreq_cooling_register) from [<bf00505c>] (cpufreq_ready+0x44/0x88 [cpufreq_dt]) [<bf00505c>] (cpufreq_ready [cpufreq_dt]) from [<c03d6c30>] (cpufreq_add_dev+0x4a0/0x7dc) [<c03d6c30>] (cpufreq_add_dev) from [<c02cd3ec>] (subsys_interface_register+0x94/0xd8) [<c02cd3ec>] (subsys_interface_register) from [<c03d785c>] (cpufreq_register_driver+0x10c/0x1f0) [<c03d785c>] (cpufreq_register_driver) from [<bf0057d4>] (dt_cpufreq_probe+0x60/0x8c [cpufreq_dt]) [<bf0057d4>] (dt_cpufreq_probe [cpufreq_dt]) from [<c02d03e4>] (platform_drv_probe+0x44/0xa4) [<c02d03e4>] (platform_drv_probe) from [<c02cead8>] (driver_probe_device+0x174/0x2b4) [<c02cead8>] (driver_probe_device) from [<c02ceca4>] (__driver_attach+0x8c/0x90) [<c02ceca4>] (__driver_attach) from [<c02cd078>] (bus_for_each_dev+0x68/0x9c) [<c02cd078>] (bus_for_each_dev) from [<c02ce2f0>] (bus_add_driver+0x19c/0x214) [<c02ce2f0>] (bus_add_driver) from [<c02cf490>] (driver_register+0x78/0xf8) [<c02cf490>] (driver_register) from [<c0009710>] (do_one_initcall+0x8c/0x1d4) [<c0009710>] (do_one_initcall) from [<c05396b0>] (do_init_module+0x5c/0x1b8) [<c05396b0>] (do_init_module) from [<c0086490>] (load_module+0xd34/0xed8) [<c0086490>] (load_module) from [<c0086704>] (SyS_init_module+0xd0/0x120) [<c0086704>] (SyS_init_module) from [<c000f480>] (ret_fast_syscall+0x0/0x3c) ---[ end trace 3be0e7b7dc6e3c4f ]--- Fixes: db91651311c8 ("thermal: export weight to sysfs") Acked-by: Javi Merino <javi.merino@arm.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Eduardo Valentin <edubezval@gmail.com>
2015-07-23 09:02:32 +00:00
device_remove_file(&tz->device, &pos->weight_attr);
device_remove_file(&tz->device, &pos->attr);
sysfs_remove_link(&tz->device.kobj, pos->name);
ida_free(&tz->ida, pos->id);
kfree(pos);
}
static void thermal_release(struct device *dev)
{
struct thermal_zone_device *tz;
struct thermal_cooling_device *cdev;
if (!strncmp(dev_name(dev), "thermal_zone",
sizeof("thermal_zone") - 1)) {
tz = to_thermal_zone(dev);
thermal_zone_destroy_device_groups(tz);
mutex_destroy(&tz->lock);
complete(&tz->removal);
} else if (!strncmp(dev_name(dev), "cooling_device",
sizeof("cooling_device") - 1)) {
cdev = to_cooling_device(dev);
thermal_cooling_device_destroy_sysfs(cdev);
kfree_const(cdev->type);
ida_free(&thermal_cdev_ida, cdev->id);
kfree(cdev);
}
}
static struct class *thermal_class;
static inline
void print_bind_err_msg(struct thermal_zone_device *tz,
thermal: core: Introduce .should_bind() thermal zone callback The current design of the code binding cooling devices to trip points in thermal zones is convoluted and hard to follow. Namely, a driver that registers a thermal zone can provide .bind() and .unbind() operations for it, which are required to call either thermal_bind_cdev_to_trip() and thermal_unbind_cdev_from_trip(), respectively, or thermal_zone_bind_cooling_device() and thermal_zone_unbind_cooling_device(), respectively, for every relevant trip point and the given cooling device. Moreover, if .bind() is provided and .unbind() is not, the cleanup necessary during the removal of a thermal zone or a cooling device may not be carried out. In other words, the core relies on the thermal zone owners to do the right thing, which is error prone and far from obvious, even though all of that is not really necessary. Specifically, if the core could ask the thermal zone owner, through a special thermal zone callback, whether or not a given cooling device should be bound to a given trip point in the given thermal zone, it might as well carry out all of the binding and unbinding by itself. In particular, the unbinding can be done automatically without involving the thermal zone owner at all because all of the thermal instances associated with a thermal zone or cooling device going away must be deleted regardless. Accordingly, introduce a new thermal zone operation, .should_bind(), that can be invoked by the thermal core for a given thermal zone, trip point and cooling device combination in order to check whether or not the cooling device should be bound to the trip point at hand. It takes an additional cooling_spec argument allowing the thermal zone owner to specify the highest and lowest cooling states of the cooling device and its weight for the given trip point binding. Make the thermal core use this operation, if present, in the absence of .bind() and .unbind(). Note that .should_bind() will be called under the thermal zone lock. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Zhang Rui <rui.zhang@intel.com> Acked-by: Huisong Li <lihuisong@huawei.com> Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org> Link: https://patch.msgid.link/9334403.CDJkKcVGEf@rjwysocki.net
2024-08-19 16:00:19 +00:00
const struct thermal_trip *trip,
struct thermal_cooling_device *cdev, int ret)
{
dev_err(&tz->device, "binding cdev %s to trip %d failed: %d\n",
cdev->type, thermal_zone_trip_id(tz, trip), ret);
}
static void thermal_zone_cdev_bind(struct thermal_zone_device *tz,
struct thermal_cooling_device *cdev)
thermal: core: Introduce .should_bind() thermal zone callback The current design of the code binding cooling devices to trip points in thermal zones is convoluted and hard to follow. Namely, a driver that registers a thermal zone can provide .bind() and .unbind() operations for it, which are required to call either thermal_bind_cdev_to_trip() and thermal_unbind_cdev_from_trip(), respectively, or thermal_zone_bind_cooling_device() and thermal_zone_unbind_cooling_device(), respectively, for every relevant trip point and the given cooling device. Moreover, if .bind() is provided and .unbind() is not, the cleanup necessary during the removal of a thermal zone or a cooling device may not be carried out. In other words, the core relies on the thermal zone owners to do the right thing, which is error prone and far from obvious, even though all of that is not really necessary. Specifically, if the core could ask the thermal zone owner, through a special thermal zone callback, whether or not a given cooling device should be bound to a given trip point in the given thermal zone, it might as well carry out all of the binding and unbinding by itself. In particular, the unbinding can be done automatically without involving the thermal zone owner at all because all of the thermal instances associated with a thermal zone or cooling device going away must be deleted regardless. Accordingly, introduce a new thermal zone operation, .should_bind(), that can be invoked by the thermal core for a given thermal zone, trip point and cooling device combination in order to check whether or not the cooling device should be bound to the trip point at hand. It takes an additional cooling_spec argument allowing the thermal zone owner to specify the highest and lowest cooling states of the cooling device and its weight for the given trip point binding. Make the thermal core use this operation, if present, in the absence of .bind() and .unbind(). Note that .should_bind() will be called under the thermal zone lock. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Zhang Rui <rui.zhang@intel.com> Acked-by: Huisong Li <lihuisong@huawei.com> Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org> Link: https://patch.msgid.link/9334403.CDJkKcVGEf@rjwysocki.net
2024-08-19 16:00:19 +00:00
{
struct thermal_trip_desc *td;
if (!tz->ops.should_bind)
return;
mutex_lock(&tz->lock);
for_each_trip_desc(tz, td) {
struct thermal_trip *trip = &td->trip;
struct cooling_spec c = {
.upper = THERMAL_NO_LIMIT,
.lower = THERMAL_NO_LIMIT,
.weight = THERMAL_WEIGHT_DEFAULT
};
int ret;
thermal: core: Introduce .should_bind() thermal zone callback The current design of the code binding cooling devices to trip points in thermal zones is convoluted and hard to follow. Namely, a driver that registers a thermal zone can provide .bind() and .unbind() operations for it, which are required to call either thermal_bind_cdev_to_trip() and thermal_unbind_cdev_from_trip(), respectively, or thermal_zone_bind_cooling_device() and thermal_zone_unbind_cooling_device(), respectively, for every relevant trip point and the given cooling device. Moreover, if .bind() is provided and .unbind() is not, the cleanup necessary during the removal of a thermal zone or a cooling device may not be carried out. In other words, the core relies on the thermal zone owners to do the right thing, which is error prone and far from obvious, even though all of that is not really necessary. Specifically, if the core could ask the thermal zone owner, through a special thermal zone callback, whether or not a given cooling device should be bound to a given trip point in the given thermal zone, it might as well carry out all of the binding and unbinding by itself. In particular, the unbinding can be done automatically without involving the thermal zone owner at all because all of the thermal instances associated with a thermal zone or cooling device going away must be deleted regardless. Accordingly, introduce a new thermal zone operation, .should_bind(), that can be invoked by the thermal core for a given thermal zone, trip point and cooling device combination in order to check whether or not the cooling device should be bound to the trip point at hand. It takes an additional cooling_spec argument allowing the thermal zone owner to specify the highest and lowest cooling states of the cooling device and its weight for the given trip point binding. Make the thermal core use this operation, if present, in the absence of .bind() and .unbind(). Note that .should_bind() will be called under the thermal zone lock. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Zhang Rui <rui.zhang@intel.com> Acked-by: Huisong Li <lihuisong@huawei.com> Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org> Link: https://patch.msgid.link/9334403.CDJkKcVGEf@rjwysocki.net
2024-08-19 16:00:19 +00:00
if (!tz->ops.should_bind(tz, trip, cdev, &c))
continue;
ret = thermal_bind_cdev_to_trip(tz, trip, cdev, &c);
if (ret)
print_bind_err_msg(tz, trip, cdev, ret);
thermal: core: Introduce .should_bind() thermal zone callback The current design of the code binding cooling devices to trip points in thermal zones is convoluted and hard to follow. Namely, a driver that registers a thermal zone can provide .bind() and .unbind() operations for it, which are required to call either thermal_bind_cdev_to_trip() and thermal_unbind_cdev_from_trip(), respectively, or thermal_zone_bind_cooling_device() and thermal_zone_unbind_cooling_device(), respectively, for every relevant trip point and the given cooling device. Moreover, if .bind() is provided and .unbind() is not, the cleanup necessary during the removal of a thermal zone or a cooling device may not be carried out. In other words, the core relies on the thermal zone owners to do the right thing, which is error prone and far from obvious, even though all of that is not really necessary. Specifically, if the core could ask the thermal zone owner, through a special thermal zone callback, whether or not a given cooling device should be bound to a given trip point in the given thermal zone, it might as well carry out all of the binding and unbinding by itself. In particular, the unbinding can be done automatically without involving the thermal zone owner at all because all of the thermal instances associated with a thermal zone or cooling device going away must be deleted regardless. Accordingly, introduce a new thermal zone operation, .should_bind(), that can be invoked by the thermal core for a given thermal zone, trip point and cooling device combination in order to check whether or not the cooling device should be bound to the trip point at hand. It takes an additional cooling_spec argument allowing the thermal zone owner to specify the highest and lowest cooling states of the cooling device and its weight for the given trip point binding. Make the thermal core use this operation, if present, in the absence of .bind() and .unbind(). Note that .should_bind() will be called under the thermal zone lock. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Zhang Rui <rui.zhang@intel.com> Acked-by: Huisong Li <lihuisong@huawei.com> Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org> Link: https://patch.msgid.link/9334403.CDJkKcVGEf@rjwysocki.net
2024-08-19 16:00:19 +00:00
}
mutex_unlock(&tz->lock);
}
/**
* __thermal_cooling_device_register() - register a new thermal cooling device
* @np: a pointer to a device tree node.
* @type: the thermal cooling device type.
* @devdata: device private data.
* @ops: standard thermal cooling devices callbacks.
*
* This interface function adds a new thermal cooling device (fan/processor/...)
* to /sys/class/thermal/ folder as cooling_device[0-*]. It tries to bind itself
* to all the thermal zone devices registered at the same time.
* It also gives the opportunity to link the cooling device to a device tree
* node, so that it can be bound to a thermal zone created out of device tree.
*
* Return: a pointer to the created struct thermal_cooling_device or an
* ERR_PTR. Caller must check return value with IS_ERR*() helpers.
*/
static struct thermal_cooling_device *
__thermal_cooling_device_register(struct device_node *np,
const char *type, void *devdata,
const struct thermal_cooling_device_ops *ops)
{
struct thermal_cooling_device *cdev;
struct thermal_zone_device *pos = NULL;
unsigned long current_state;
thermal/core: fix a UAF bug in __thermal_cooling_device_register() When device_register() return failed, program will goto out_kfree_type to release 'cdev->device' by put_device(). That will call thermal_release() to free 'cdev'. But the follow-up processes access 'cdev' continually. That trggers the UAF bug. ==================================================================== BUG: KASAN: use-after-free in __thermal_cooling_device_register+0x75b/0xa90 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 Call Trace: dump_stack_lvl+0xe2/0x152 print_address_description.constprop.0+0x21/0x140 ? __thermal_cooling_device_register+0x75b/0xa90 kasan_report.cold+0x7f/0x11b ? __thermal_cooling_device_register+0x75b/0xa90 __thermal_cooling_device_register+0x75b/0xa90 ? memset+0x20/0x40 ? __sanitizer_cov_trace_pc+0x1d/0x50 ? __devres_alloc_node+0x130/0x180 devm_thermal_of_cooling_device_register+0x67/0xf0 max6650_probe.cold+0x557/0x6aa ...... Freed by task 258: kasan_save_stack+0x1b/0x40 kasan_set_track+0x1c/0x30 kasan_set_free_info+0x20/0x30 __kasan_slab_free+0x109/0x140 kfree+0x117/0x4c0 thermal_release+0xa0/0x110 device_release+0xa7/0x240 kobject_put+0x1ce/0x540 put_device+0x20/0x30 __thermal_cooling_device_register+0x731/0xa90 devm_thermal_of_cooling_device_register+0x67/0xf0 max6650_probe.cold+0x557/0x6aa [max6650] Do not use 'cdev' again after put_device() to fix the problem like doing in thermal_zone_device_register(). [dlezcano]: as requested by Rafael, change the affectation into two statements. Fixes: 584837618100 ("thermal/drivers/core: Use a char pointer for the cooling device name") Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/r/20211015024504.947520-1-william.xuanziyang@huawei.com Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
2021-10-15 02:45:04 +00:00
int id, ret;
if (!ops || !ops->get_max_state || !ops->get_cur_state ||
!ops->set_cur_state)
return ERR_PTR(-EINVAL);
if (!thermal_class)
return ERR_PTR(-ENODEV);
cdev = kzalloc(sizeof(*cdev), GFP_KERNEL);
if (!cdev)
return ERR_PTR(-ENOMEM);
ret = ida_alloc(&thermal_cdev_ida, GFP_KERNEL);
if (ret < 0)
goto out_kfree_cdev;
cdev->id = ret;
thermal/core: fix a UAF bug in __thermal_cooling_device_register() When device_register() return failed, program will goto out_kfree_type to release 'cdev->device' by put_device(). That will call thermal_release() to free 'cdev'. But the follow-up processes access 'cdev' continually. That trggers the UAF bug. ==================================================================== BUG: KASAN: use-after-free in __thermal_cooling_device_register+0x75b/0xa90 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 Call Trace: dump_stack_lvl+0xe2/0x152 print_address_description.constprop.0+0x21/0x140 ? __thermal_cooling_device_register+0x75b/0xa90 kasan_report.cold+0x7f/0x11b ? __thermal_cooling_device_register+0x75b/0xa90 __thermal_cooling_device_register+0x75b/0xa90 ? memset+0x20/0x40 ? __sanitizer_cov_trace_pc+0x1d/0x50 ? __devres_alloc_node+0x130/0x180 devm_thermal_of_cooling_device_register+0x67/0xf0 max6650_probe.cold+0x557/0x6aa ...... Freed by task 258: kasan_save_stack+0x1b/0x40 kasan_set_track+0x1c/0x30 kasan_set_free_info+0x20/0x30 __kasan_slab_free+0x109/0x140 kfree+0x117/0x4c0 thermal_release+0xa0/0x110 device_release+0xa7/0x240 kobject_put+0x1ce/0x540 put_device+0x20/0x30 __thermal_cooling_device_register+0x731/0xa90 devm_thermal_of_cooling_device_register+0x67/0xf0 max6650_probe.cold+0x557/0x6aa [max6650] Do not use 'cdev' again after put_device() to fix the problem like doing in thermal_zone_device_register(). [dlezcano]: as requested by Rafael, change the affectation into two statements. Fixes: 584837618100 ("thermal/drivers/core: Use a char pointer for the cooling device name") Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/r/20211015024504.947520-1-william.xuanziyang@huawei.com Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
2021-10-15 02:45:04 +00:00
id = ret;
cdev->type = kstrdup_const(type ? type : "", GFP_KERNEL);
if (!cdev->type) {
ret = -ENOMEM;
goto out_ida_remove;
}
mutex_init(&cdev->lock);
INIT_LIST_HEAD(&cdev->thermal_instances);
cdev->np = np;
cdev->ops = ops;
cdev->updated = false;
cdev->device.class = thermal_class;
cdev->devdata = devdata;
ret = cdev->ops->get_max_state(cdev, &cdev->max_state);
if (ret)
goto out_cdev_type;
/*
* The cooling device's current state is only needed for debug
* initialization below, so a failure to get it does not cause
* the entire cooling device initialization to fail. However,
* the debug will not work for the device if its initial state
* cannot be determined and drivers are responsible for ensuring
* that this will not happen.
*/
ret = cdev->ops->get_cur_state(cdev, &current_state);
if (ret)
current_state = ULONG_MAX;
thermal: Add cooling device's statistics in sysfs This extends the sysfs interface for thermal cooling devices and exposes some pretty useful statistics. These statistics have proven to be quite useful specially while doing benchmarks related to the task scheduler, where we want to make sure that nothing has disrupted the test, specially the cooling device which may have put constraints on the CPUs. The information exposed here tells us to what extent the CPUs were constrained by the thermal framework. The write-only "reset" file is used to reset the statistics. The read-only "time_in_state_ms" file shows the time (in msec) spent by the device in the respective cooling states, and it prints one line per cooling state. The read-only "total_trans" file shows single positive integer value showing the total number of cooling state transitions the device has gone through since the time the cooling device is registered or the time when statistics were reset last. The read-only "trans_table" file shows a two dimensional matrix, where an entry <i,j> (row i, column j) represents the number of transitions from State_i to State_j. This is how the directory structure looks like for a single cooling device: $ ls -R /sys/class/thermal/cooling_device0/ /sys/class/thermal/cooling_device0/: cur_state max_state power stats subsystem type uevent /sys/class/thermal/cooling_device0/power: autosuspend_delay_ms runtime_active_time runtime_suspended_time control runtime_status /sys/class/thermal/cooling_device0/stats: reset time_in_state_ms total_trans trans_table This is tested on ARM 64-bit Hisilicon hikey620 board running Ubuntu and ARM 64-bit Hisilicon hikey960 board running Android. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Zhang Rui <rui.zhang@intel.com>
2018-04-02 10:56:25 +00:00
thermal_cooling_device_setup_sysfs(cdev);
ret = dev_set_name(&cdev->device, "cooling_device%d", cdev->id);
if (ret)
goto out_cooling_dev;
ret = device_register(&cdev->device);
if (ret) {
/* thermal_release() handles rest of the cleanup */
put_device(&cdev->device);
return ERR_PTR(ret);
}
if (current_state <= cdev->max_state)
thermal_debug_cdev_add(cdev, current_state);
/* Add 'this' new cdev to the global cdev list */
mutex_lock(&thermal_list_lock);
list_add(&cdev->node, &thermal_cdev_list);
/* Update binding information for 'this' new cdev */
thermal: core: Introduce .should_bind() thermal zone callback The current design of the code binding cooling devices to trip points in thermal zones is convoluted and hard to follow. Namely, a driver that registers a thermal zone can provide .bind() and .unbind() operations for it, which are required to call either thermal_bind_cdev_to_trip() and thermal_unbind_cdev_from_trip(), respectively, or thermal_zone_bind_cooling_device() and thermal_zone_unbind_cooling_device(), respectively, for every relevant trip point and the given cooling device. Moreover, if .bind() is provided and .unbind() is not, the cleanup necessary during the removal of a thermal zone or a cooling device may not be carried out. In other words, the core relies on the thermal zone owners to do the right thing, which is error prone and far from obvious, even though all of that is not really necessary. Specifically, if the core could ask the thermal zone owner, through a special thermal zone callback, whether or not a given cooling device should be bound to a given trip point in the given thermal zone, it might as well carry out all of the binding and unbinding by itself. In particular, the unbinding can be done automatically without involving the thermal zone owner at all because all of the thermal instances associated with a thermal zone or cooling device going away must be deleted regardless. Accordingly, introduce a new thermal zone operation, .should_bind(), that can be invoked by the thermal core for a given thermal zone, trip point and cooling device combination in order to check whether or not the cooling device should be bound to the trip point at hand. It takes an additional cooling_spec argument allowing the thermal zone owner to specify the highest and lowest cooling states of the cooling device and its weight for the given trip point binding. Make the thermal core use this operation, if present, in the absence of .bind() and .unbind(). Note that .should_bind() will be called under the thermal zone lock. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Zhang Rui <rui.zhang@intel.com> Acked-by: Huisong Li <lihuisong@huawei.com> Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org> Link: https://patch.msgid.link/9334403.CDJkKcVGEf@rjwysocki.net
2024-08-19 16:00:19 +00:00
list_for_each_entry(pos, &thermal_tz_list, node)
thermal_zone_cdev_bind(pos, cdev);
list_for_each_entry(pos, &thermal_tz_list, node)
if (atomic_cmpxchg(&pos->need_update, 1, 0))
thermal_zone_device_update(pos,
THERMAL_EVENT_UNSPECIFIED);
mutex_unlock(&thermal_list_lock);
return cdev;
out_cooling_dev:
2022-05-11 02:06:05 +00:00
thermal_cooling_device_destroy_sysfs(cdev);
out_cdev_type:
kfree_const(cdev->type);
out_ida_remove:
ida_free(&thermal_cdev_ida, id);
out_kfree_cdev:
kfree(cdev);
return ERR_PTR(ret);
}
/**
* thermal_cooling_device_register() - register a new thermal cooling device
* @type: the thermal cooling device type.
* @devdata: device private data.
* @ops: standard thermal cooling devices callbacks.
*
* This interface function adds a new thermal cooling device (fan/processor/...)
* to /sys/class/thermal/ folder as cooling_device[0-*]. It tries to bind itself
* to all the thermal zone devices registered at the same time.
*
* Return: a pointer to the created struct thermal_cooling_device or an
* ERR_PTR. Caller must check return value with IS_ERR*() helpers.
*/
struct thermal_cooling_device *
thermal_cooling_device_register(const char *type, void *devdata,
const struct thermal_cooling_device_ops *ops)
{
return __thermal_cooling_device_register(NULL, type, devdata, ops);
}
EXPORT_SYMBOL_GPL(thermal_cooling_device_register);
/**
* thermal_of_cooling_device_register() - register an OF thermal cooling device
* @np: a pointer to a device tree node.
* @type: the thermal cooling device type.
* @devdata: device private data.
* @ops: standard thermal cooling devices callbacks.
*
* This function will register a cooling device with device tree node reference.
* This interface function adds a new thermal cooling device (fan/processor/...)
* to /sys/class/thermal/ folder as cooling_device[0-*]. It tries to bind itself
* to all the thermal zone devices registered at the same time.
*
* Return: a pointer to the created struct thermal_cooling_device or an
* ERR_PTR. Caller must check return value with IS_ERR*() helpers.
*/
struct thermal_cooling_device *
thermal_of_cooling_device_register(struct device_node *np,
const char *type, void *devdata,
const struct thermal_cooling_device_ops *ops)
{
return __thermal_cooling_device_register(np, type, devdata, ops);
}
EXPORT_SYMBOL_GPL(thermal_of_cooling_device_register);
static void thermal_cooling_device_release(struct device *dev, void *res)
{
thermal_cooling_device_unregister(
*(struct thermal_cooling_device **)res);
}
/**
* devm_thermal_of_cooling_device_register() - register an OF thermal cooling
* device
* @dev: a valid struct device pointer of a sensor device.
* @np: a pointer to a device tree node.
* @type: the thermal cooling device type.
* @devdata: device private data.
* @ops: standard thermal cooling devices callbacks.
*
* This function will register a cooling device with device tree node reference.
* This interface function adds a new thermal cooling device (fan/processor/...)
* to /sys/class/thermal/ folder as cooling_device[0-*]. It tries to bind itself
* to all the thermal zone devices registered at the same time.
*
* Return: a pointer to the created struct thermal_cooling_device or an
* ERR_PTR. Caller must check return value with IS_ERR*() helpers.
*/
struct thermal_cooling_device *
devm_thermal_of_cooling_device_register(struct device *dev,
struct device_node *np,
const char *type, void *devdata,
const struct thermal_cooling_device_ops *ops)
{
struct thermal_cooling_device **ptr, *tcd;
ptr = devres_alloc(thermal_cooling_device_release, sizeof(*ptr),
GFP_KERNEL);
if (!ptr)
return ERR_PTR(-ENOMEM);
tcd = __thermal_cooling_device_register(np, type, devdata, ops);
if (IS_ERR(tcd)) {
devres_free(ptr);
return tcd;
}
*ptr = tcd;
devres_add(dev, ptr);
return tcd;
}
EXPORT_SYMBOL_GPL(devm_thermal_of_cooling_device_register);
static bool thermal_cooling_device_present(struct thermal_cooling_device *cdev)
{
struct thermal_cooling_device *pos = NULL;
list_for_each_entry(pos, &thermal_cdev_list, node) {
if (pos == cdev)
return true;
}
return false;
}
/**
* thermal_cooling_device_update - Update a cooling device object
* @cdev: Target cooling device.
*
* Update @cdev to reflect a change of the underlying hardware or platform.
*
* Must be called when the maximum cooling state of @cdev becomes invalid and so
* its .get_max_state() callback needs to be run to produce the new maximum
* cooling state value.
*/
void thermal_cooling_device_update(struct thermal_cooling_device *cdev)
{
struct thermal_instance *ti;
unsigned long state;
if (IS_ERR_OR_NULL(cdev))
return;
/*
* Hold thermal_list_lock throughout the update to prevent the device
* from going away while being updated.
*/
mutex_lock(&thermal_list_lock);
if (!thermal_cooling_device_present(cdev))
goto unlock_list;
/*
* Update under the cdev lock to prevent the state from being set beyond
* the new limit concurrently.
*/
mutex_lock(&cdev->lock);
if (cdev->ops->get_max_state(cdev, &cdev->max_state))
goto unlock;
thermal_cooling_device_stats_reinit(cdev);
list_for_each_entry(ti, &cdev->thermal_instances, cdev_node) {
if (ti->upper == cdev->max_state)
continue;
if (ti->upper < cdev->max_state) {
if (ti->upper_no_limit)
ti->upper = cdev->max_state;
continue;
}
ti->upper = cdev->max_state;
if (ti->lower > ti->upper)
ti->lower = ti->upper;
if (ti->target == THERMAL_NO_TARGET)
continue;
if (ti->target > ti->upper)
ti->target = ti->upper;
}
if (cdev->ops->get_cur_state(cdev, &state) || state > cdev->max_state)
goto unlock;
thermal_cooling_device_stats_update(cdev, state);
unlock:
mutex_unlock(&cdev->lock);
unlock_list:
mutex_unlock(&thermal_list_lock);
}
EXPORT_SYMBOL_GPL(thermal_cooling_device_update);
static void thermal_zone_cdev_unbind(struct thermal_zone_device *tz,
struct thermal_cooling_device *cdev)
thermal: core: Introduce .should_bind() thermal zone callback The current design of the code binding cooling devices to trip points in thermal zones is convoluted and hard to follow. Namely, a driver that registers a thermal zone can provide .bind() and .unbind() operations for it, which are required to call either thermal_bind_cdev_to_trip() and thermal_unbind_cdev_from_trip(), respectively, or thermal_zone_bind_cooling_device() and thermal_zone_unbind_cooling_device(), respectively, for every relevant trip point and the given cooling device. Moreover, if .bind() is provided and .unbind() is not, the cleanup necessary during the removal of a thermal zone or a cooling device may not be carried out. In other words, the core relies on the thermal zone owners to do the right thing, which is error prone and far from obvious, even though all of that is not really necessary. Specifically, if the core could ask the thermal zone owner, through a special thermal zone callback, whether or not a given cooling device should be bound to a given trip point in the given thermal zone, it might as well carry out all of the binding and unbinding by itself. In particular, the unbinding can be done automatically without involving the thermal zone owner at all because all of the thermal instances associated with a thermal zone or cooling device going away must be deleted regardless. Accordingly, introduce a new thermal zone operation, .should_bind(), that can be invoked by the thermal core for a given thermal zone, trip point and cooling device combination in order to check whether or not the cooling device should be bound to the trip point at hand. It takes an additional cooling_spec argument allowing the thermal zone owner to specify the highest and lowest cooling states of the cooling device and its weight for the given trip point binding. Make the thermal core use this operation, if present, in the absence of .bind() and .unbind(). Note that .should_bind() will be called under the thermal zone lock. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Zhang Rui <rui.zhang@intel.com> Acked-by: Huisong Li <lihuisong@huawei.com> Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org> Link: https://patch.msgid.link/9334403.CDJkKcVGEf@rjwysocki.net
2024-08-19 16:00:19 +00:00
{
struct thermal_trip_desc *td;
mutex_lock(&tz->lock);
for_each_trip_desc(tz, td)
thermal_unbind_cdev_from_trip(tz, &td->trip, cdev);
mutex_unlock(&tz->lock);
}
/**
* thermal_cooling_device_unregister - removes a thermal cooling device
* @cdev: the thermal cooling device to remove.
*
* thermal_cooling_device_unregister() must be called when a registered
* thermal cooling device is no longer needed.
*/
void thermal_cooling_device_unregister(struct thermal_cooling_device *cdev)
{
struct thermal_zone_device *tz;
if (!cdev)
return;
thermal/debugfs: Add thermal cooling device debugfs information The thermal framework does not have any debug information except a sysfs stat which is a bit controversial. This one allocates big chunks of memory for every cooling devices with a high number of states and could represent on some systems in production several megabytes of memory for just a portion of it. As the sysfs is limited to a page size, the output is not exploitable with large data array and gets truncated. The patch provides the same information than sysfs except the transitions are dynamically allocated, thus they won't show more events than the ones which actually occurred. There is no longer a size limitation and it opens the field for more debugging information where the debugfs is designed for, not sysfs. The thermal debugfs directory structure tries to stay consistent with the sysfs one but in a very simplified way: thermal/ -- cooling_devices |-- 0 | |-- clear | |-- time_in_state_ms | |-- total_trans | `-- trans_table |-- 1 | |-- clear | |-- time_in_state_ms | |-- total_trans | `-- trans_table |-- 2 | |-- clear | |-- time_in_state_ms | |-- total_trans | `-- trans_table |-- 3 | |-- clear | |-- time_in_state_ms | |-- total_trans | `-- trans_table `-- 4 |-- clear |-- time_in_state_ms |-- total_trans `-- trans_table The content of the files in the cooling devices directory is the same as the sysfs one except for the trans_table which has the following format: Transition Hits 1->0 246 0->1 246 2->1 632 1->2 632 3->2 98 2->3 98 Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> [ rjw: White space fixups, rebase ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-01-09 09:41:11 +00:00
thermal_debug_cdev_remove(cdev);
mutex_lock(&thermal_list_lock);
if (!thermal_cooling_device_present(cdev)) {
mutex_unlock(&thermal_list_lock);
return;
}
list_del(&cdev->node);
/* Unbind all thermal zones associated with 'this' cdev */
thermal: core: Introduce .should_bind() thermal zone callback The current design of the code binding cooling devices to trip points in thermal zones is convoluted and hard to follow. Namely, a driver that registers a thermal zone can provide .bind() and .unbind() operations for it, which are required to call either thermal_bind_cdev_to_trip() and thermal_unbind_cdev_from_trip(), respectively, or thermal_zone_bind_cooling_device() and thermal_zone_unbind_cooling_device(), respectively, for every relevant trip point and the given cooling device. Moreover, if .bind() is provided and .unbind() is not, the cleanup necessary during the removal of a thermal zone or a cooling device may not be carried out. In other words, the core relies on the thermal zone owners to do the right thing, which is error prone and far from obvious, even though all of that is not really necessary. Specifically, if the core could ask the thermal zone owner, through a special thermal zone callback, whether or not a given cooling device should be bound to a given trip point in the given thermal zone, it might as well carry out all of the binding and unbinding by itself. In particular, the unbinding can be done automatically without involving the thermal zone owner at all because all of the thermal instances associated with a thermal zone or cooling device going away must be deleted regardless. Accordingly, introduce a new thermal zone operation, .should_bind(), that can be invoked by the thermal core for a given thermal zone, trip point and cooling device combination in order to check whether or not the cooling device should be bound to the trip point at hand. It takes an additional cooling_spec argument allowing the thermal zone owner to specify the highest and lowest cooling states of the cooling device and its weight for the given trip point binding. Make the thermal core use this operation, if present, in the absence of .bind() and .unbind(). Note that .should_bind() will be called under the thermal zone lock. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Zhang Rui <rui.zhang@intel.com> Acked-by: Huisong Li <lihuisong@huawei.com> Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org> Link: https://patch.msgid.link/9334403.CDJkKcVGEf@rjwysocki.net
2024-08-19 16:00:19 +00:00
list_for_each_entry(tz, &thermal_tz_list, node)
thermal_zone_cdev_unbind(tz, cdev);
mutex_unlock(&thermal_list_lock);
device_unregister(&cdev->device);
}
EXPORT_SYMBOL_GPL(thermal_cooling_device_unregister);
thermal/core: Add a generic thermal_zone_get_trip() function The thermal_zone_device_ops structure defines a set of ops family, get_trip_temp(), get_trip_hyst(), get_trip_type(). Each of them is returning a property of a trip point. The result is the code is calling the ops everywhere to get a trip point which is supposed to be defined in the backend driver. It is a non-sense as a thermal trip can be generic and used by the backend driver to declare its trip points. Part of the thermal framework has been changed and all the OF thermal drivers are using the same definition for the trip point and use a thermal zone registration variant to pass those trip points which are part of the thermal zone device structure. Consequently, we can use a generic function to get the trip points when they are stored in the thermal zone device structure. This approach can be generalized to all the drivers and we can get rid of the ops->get_trip_*. That will result to a much more simpler code and make possible to rework how the thermal trip are handled in the thermal core framework as discussed previously. This change adds a function thermal_zone_get_trip() where we get the thermal trip point structure which contains all the properties (type, temp, hyst) instead of doing multiple calls to ops->get_trip_*. That opens the door for trip point extension with more attributes. For instance, replacing the trip points disabled bitmask with a 'disabled' field in the structure. Here we replace all the calls to ops->get_trip_* in the thermal core code with a call to the thermal_zone_get_trip() function. The thermal zone ops defines a callback to retrieve the critical temperature. As the trip handling is being reworked, all the trip points will be the same whatever the driver and consequently finding the critical trip temperature will be just a loop to search for a critical trip point type. Provide such a generic function, so we encapsulate the ops get_crit_temp() which can be removed when all the backend drivers are using the generic trip points handling. While at it, add the thermal_zone_get_num_trips() to encapsulate the code more and reduce the grip with the thermal framework internals. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Zhang Rui <rui.zhang@intel.com> Link: https://lore.kernel.org/r/20221003092602.1323944-2-daniel.lezcano@linaro.org
2022-10-03 09:25:34 +00:00
int thermal_zone_get_crit_temp(struct thermal_zone_device *tz, int *temp)
{
const struct thermal_trip_desc *td;
int ret = -EINVAL;
thermal/core: Add a generic thermal_zone_get_trip() function The thermal_zone_device_ops structure defines a set of ops family, get_trip_temp(), get_trip_hyst(), get_trip_type(). Each of them is returning a property of a trip point. The result is the code is calling the ops everywhere to get a trip point which is supposed to be defined in the backend driver. It is a non-sense as a thermal trip can be generic and used by the backend driver to declare its trip points. Part of the thermal framework has been changed and all the OF thermal drivers are using the same definition for the trip point and use a thermal zone registration variant to pass those trip points which are part of the thermal zone device structure. Consequently, we can use a generic function to get the trip points when they are stored in the thermal zone device structure. This approach can be generalized to all the drivers and we can get rid of the ops->get_trip_*. That will result to a much more simpler code and make possible to rework how the thermal trip are handled in the thermal core framework as discussed previously. This change adds a function thermal_zone_get_trip() where we get the thermal trip point structure which contains all the properties (type, temp, hyst) instead of doing multiple calls to ops->get_trip_*. That opens the door for trip point extension with more attributes. For instance, replacing the trip points disabled bitmask with a 'disabled' field in the structure. Here we replace all the calls to ops->get_trip_* in the thermal core code with a call to the thermal_zone_get_trip() function. The thermal zone ops defines a callback to retrieve the critical temperature. As the trip handling is being reworked, all the trip points will be the same whatever the driver and consequently finding the critical trip temperature will be just a loop to search for a critical trip point type. Provide such a generic function, so we encapsulate the ops get_crit_temp() which can be removed when all the backend drivers are using the generic trip points handling. While at it, add the thermal_zone_get_num_trips() to encapsulate the code more and reduce the grip with the thermal framework internals. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Zhang Rui <rui.zhang@intel.com> Link: https://lore.kernel.org/r/20221003092602.1323944-2-daniel.lezcano@linaro.org
2022-10-03 09:25:34 +00:00
if (tz->ops.get_crit_temp)
return tz->ops.get_crit_temp(tz, temp);
thermal/core: Add a generic thermal_zone_get_trip() function The thermal_zone_device_ops structure defines a set of ops family, get_trip_temp(), get_trip_hyst(), get_trip_type(). Each of them is returning a property of a trip point. The result is the code is calling the ops everywhere to get a trip point which is supposed to be defined in the backend driver. It is a non-sense as a thermal trip can be generic and used by the backend driver to declare its trip points. Part of the thermal framework has been changed and all the OF thermal drivers are using the same definition for the trip point and use a thermal zone registration variant to pass those trip points which are part of the thermal zone device structure. Consequently, we can use a generic function to get the trip points when they are stored in the thermal zone device structure. This approach can be generalized to all the drivers and we can get rid of the ops->get_trip_*. That will result to a much more simpler code and make possible to rework how the thermal trip are handled in the thermal core framework as discussed previously. This change adds a function thermal_zone_get_trip() where we get the thermal trip point structure which contains all the properties (type, temp, hyst) instead of doing multiple calls to ops->get_trip_*. That opens the door for trip point extension with more attributes. For instance, replacing the trip points disabled bitmask with a 'disabled' field in the structure. Here we replace all the calls to ops->get_trip_* in the thermal core code with a call to the thermal_zone_get_trip() function. The thermal zone ops defines a callback to retrieve the critical temperature. As the trip handling is being reworked, all the trip points will be the same whatever the driver and consequently finding the critical trip temperature will be just a loop to search for a critical trip point type. Provide such a generic function, so we encapsulate the ops get_crit_temp() which can be removed when all the backend drivers are using the generic trip points handling. While at it, add the thermal_zone_get_num_trips() to encapsulate the code more and reduce the grip with the thermal framework internals. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Zhang Rui <rui.zhang@intel.com> Link: https://lore.kernel.org/r/20221003092602.1323944-2-daniel.lezcano@linaro.org
2022-10-03 09:25:34 +00:00
mutex_lock(&tz->lock);
for_each_trip_desc(tz, td) {
const struct thermal_trip *trip = &td->trip;
if (trip->type == THERMAL_TRIP_CRITICAL) {
*temp = trip->temperature;
thermal/core: Add a generic thermal_zone_get_trip() function The thermal_zone_device_ops structure defines a set of ops family, get_trip_temp(), get_trip_hyst(), get_trip_type(). Each of them is returning a property of a trip point. The result is the code is calling the ops everywhere to get a trip point which is supposed to be defined in the backend driver. It is a non-sense as a thermal trip can be generic and used by the backend driver to declare its trip points. Part of the thermal framework has been changed and all the OF thermal drivers are using the same definition for the trip point and use a thermal zone registration variant to pass those trip points which are part of the thermal zone device structure. Consequently, we can use a generic function to get the trip points when they are stored in the thermal zone device structure. This approach can be generalized to all the drivers and we can get rid of the ops->get_trip_*. That will result to a much more simpler code and make possible to rework how the thermal trip are handled in the thermal core framework as discussed previously. This change adds a function thermal_zone_get_trip() where we get the thermal trip point structure which contains all the properties (type, temp, hyst) instead of doing multiple calls to ops->get_trip_*. That opens the door for trip point extension with more attributes. For instance, replacing the trip points disabled bitmask with a 'disabled' field in the structure. Here we replace all the calls to ops->get_trip_* in the thermal core code with a call to the thermal_zone_get_trip() function. The thermal zone ops defines a callback to retrieve the critical temperature. As the trip handling is being reworked, all the trip points will be the same whatever the driver and consequently finding the critical trip temperature will be just a loop to search for a critical trip point type. Provide such a generic function, so we encapsulate the ops get_crit_temp() which can be removed when all the backend drivers are using the generic trip points handling. While at it, add the thermal_zone_get_num_trips() to encapsulate the code more and reduce the grip with the thermal framework internals. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Zhang Rui <rui.zhang@intel.com> Link: https://lore.kernel.org/r/20221003092602.1323944-2-daniel.lezcano@linaro.org
2022-10-03 09:25:34 +00:00
ret = 0;
break;
}
}
mutex_unlock(&tz->lock);
return ret;
}
EXPORT_SYMBOL_GPL(thermal_zone_get_crit_temp);
/**
* thermal_zone_device_register_with_trips() - register a new thermal zone device
* @type: the thermal zone device type
* @trips: a pointer to an array of thermal trips
* @num_trips: the number of trip points the thermal zone support
* @devdata: private device data
* @ops: standard thermal zone device callbacks
* @tzp: thermal zone platform parameters
* @passive_delay: number of milliseconds to wait between polls when
* performing passive cooling
* @polling_delay: number of milliseconds to wait between polls when checking
* whether trip points have been crossed (0 for interrupt
* driven systems)
*
* This interface function adds a new thermal zone device (sensor) to
* /sys/class/thermal folder as thermal_zone[0-*]. It tries to bind all the
* thermal cooling devices registered at the same time.
* thermal_zone_device_unregister() must be called when the device is no
* longer needed. The passive cooling depends on the .get_trend() return value.
*
* Return: a pointer to the created struct thermal_zone_device or an
* in case of error, an ERR_PTR. Caller must check return value with
* IS_ERR*() helpers.
*/
struct thermal_zone_device *
thermal_zone_device_register_with_trips(const char *type,
const struct thermal_trip *trips,
int num_trips, void *devdata,
const struct thermal_zone_device_ops *ops,
const struct thermal_zone_params *tzp,
unsigned int passive_delay,
unsigned int polling_delay)
{
const struct thermal_trip *trip = trips;
thermal: core: Introduce .should_bind() thermal zone callback The current design of the code binding cooling devices to trip points in thermal zones is convoluted and hard to follow. Namely, a driver that registers a thermal zone can provide .bind() and .unbind() operations for it, which are required to call either thermal_bind_cdev_to_trip() and thermal_unbind_cdev_from_trip(), respectively, or thermal_zone_bind_cooling_device() and thermal_zone_unbind_cooling_device(), respectively, for every relevant trip point and the given cooling device. Moreover, if .bind() is provided and .unbind() is not, the cleanup necessary during the removal of a thermal zone or a cooling device may not be carried out. In other words, the core relies on the thermal zone owners to do the right thing, which is error prone and far from obvious, even though all of that is not really necessary. Specifically, if the core could ask the thermal zone owner, through a special thermal zone callback, whether or not a given cooling device should be bound to a given trip point in the given thermal zone, it might as well carry out all of the binding and unbinding by itself. In particular, the unbinding can be done automatically without involving the thermal zone owner at all because all of the thermal instances associated with a thermal zone or cooling device going away must be deleted regardless. Accordingly, introduce a new thermal zone operation, .should_bind(), that can be invoked by the thermal core for a given thermal zone, trip point and cooling device combination in order to check whether or not the cooling device should be bound to the trip point at hand. It takes an additional cooling_spec argument allowing the thermal zone owner to specify the highest and lowest cooling states of the cooling device and its weight for the given trip point binding. Make the thermal core use this operation, if present, in the absence of .bind() and .unbind(). Note that .should_bind() will be called under the thermal zone lock. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Zhang Rui <rui.zhang@intel.com> Acked-by: Huisong Li <lihuisong@huawei.com> Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org> Link: https://patch.msgid.link/9334403.CDJkKcVGEf@rjwysocki.net
2024-08-19 16:00:19 +00:00
struct thermal_cooling_device *cdev;
struct thermal_zone_device *tz;
struct thermal_trip_desc *td;
int id;
int result;
struct thermal_governor *governor;
if (!type || strlen(type) == 0) {
pr_err("No thermal zone type defined\n");
return ERR_PTR(-EINVAL);
}
if (strlen(type) >= THERMAL_NAME_LENGTH) {
pr_err("Thermal zone name (%s) too long, should be under %d chars\n",
type, THERMAL_NAME_LENGTH);
return ERR_PTR(-EINVAL);
}
if (num_trips < 0) {
pr_err("Incorrect number of thermal trips\n");
return ERR_PTR(-EINVAL);
}
if (!ops || !ops->get_temp) {
thermal: core: Introduce .should_bind() thermal zone callback The current design of the code binding cooling devices to trip points in thermal zones is convoluted and hard to follow. Namely, a driver that registers a thermal zone can provide .bind() and .unbind() operations for it, which are required to call either thermal_bind_cdev_to_trip() and thermal_unbind_cdev_from_trip(), respectively, or thermal_zone_bind_cooling_device() and thermal_zone_unbind_cooling_device(), respectively, for every relevant trip point and the given cooling device. Moreover, if .bind() is provided and .unbind() is not, the cleanup necessary during the removal of a thermal zone or a cooling device may not be carried out. In other words, the core relies on the thermal zone owners to do the right thing, which is error prone and far from obvious, even though all of that is not really necessary. Specifically, if the core could ask the thermal zone owner, through a special thermal zone callback, whether or not a given cooling device should be bound to a given trip point in the given thermal zone, it might as well carry out all of the binding and unbinding by itself. In particular, the unbinding can be done automatically without involving the thermal zone owner at all because all of the thermal instances associated with a thermal zone or cooling device going away must be deleted regardless. Accordingly, introduce a new thermal zone operation, .should_bind(), that can be invoked by the thermal core for a given thermal zone, trip point and cooling device combination in order to check whether or not the cooling device should be bound to the trip point at hand. It takes an additional cooling_spec argument allowing the thermal zone owner to specify the highest and lowest cooling states of the cooling device and its weight for the given trip point binding. Make the thermal core use this operation, if present, in the absence of .bind() and .unbind(). Note that .should_bind() will be called under the thermal zone lock. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Zhang Rui <rui.zhang@intel.com> Acked-by: Huisong Li <lihuisong@huawei.com> Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org> Link: https://patch.msgid.link/9334403.CDJkKcVGEf@rjwysocki.net
2024-08-19 16:00:19 +00:00
pr_err("Thermal zone device ops not defined or invalid\n");
return ERR_PTR(-EINVAL);
}
if (num_trips > 0 && !trips)
return ERR_PTR(-EINVAL);
if (polling_delay && passive_delay > polling_delay)
return ERR_PTR(-EINVAL);
if (!thermal_class)
return ERR_PTR(-ENODEV);
tz = kzalloc(struct_size(tz, trips, num_trips), GFP_KERNEL);
if (!tz)
return ERR_PTR(-ENOMEM);
if (tzp) {
tz->tzp = kmemdup(tzp, sizeof(*tzp), GFP_KERNEL);
if (!tz->tzp) {
result = -ENOMEM;
goto free_tz;
}
}
INIT_LIST_HEAD(&tz->thermal_instances);
thermal: core: Rework thermal zone availability check In order to avoid running __thermal_zone_device_update() for thermal zones going away, the thermal zone lock is held around device_del() in thermal_zone_device_unregister() and thermal_zone_device_update() passes the given thermal zone device to device_is_registered(). This allows thermal_zone_device_update() to skip the __thermal_zone_device_update() if device_del() has already run for the thermal zone at hand. However, instead of looking at driver core internals, the thermal subsystem may as well rely on its own data structures for this purpose. Namely, if the thermal zone is not present in thermal_tz_list, it can be regarded as unavailable, which in fact is already the case in thermal_zone_device_unregister(). Accordingly, the device_is_registered() check in thermal_zone_device_update() can be replaced with checking whether or not the node list_head in struct thermal_zone_device is empty, in which case it is not there in thermal_tz_list. To make this work, though, it is necessary to initialize tz->node in thermal_zone_device_register_with_trips() before registering the thermal zone device and it needs to be added to thermal_tz_list and deleted from it under its zone lock. After the above modifications, the zone lock does not need to be held around device_del() in thermal_zone_device_unregister() any more. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-and-tested-by: Lukasz Luba <lukasz.luba@arm.com> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
2023-12-08 19:20:00 +00:00
INIT_LIST_HEAD(&tz->node);
ida_init(&tz->ida);
mutex_init(&tz->lock);
init_completion(&tz->removal);
init_completion(&tz->resume);
id = ida_alloc(&thermal_tz_ida, GFP_KERNEL);
if (id < 0) {
result = id;
goto free_tzp;
}
tz->id = id;
strscpy(tz->type, type, sizeof(tz->type));
tz->ops = *ops;
if (!tz->ops.critical)
tz->ops.critical = thermal_zone_device_critical;
tz->device.class = thermal_class;
tz->devdata = devdata;
tz->num_trips = num_trips;
for_each_trip_desc(tz, td) {
td->trip = *trip++;
/*
* Mark all thresholds as invalid to start with even though
* this only matters for the trips that start as invalid and
* become valid later.
*/
td->threshold = INT_MAX;
}
tz->polling_delay_jiffies = msecs_to_jiffies(polling_delay);
tz->passive_delay_jiffies = msecs_to_jiffies(passive_delay);
thermal: core: Back off when polling thermal zones on errors Commit a8a261774466 ("thermal: core: Call monitor_thermal_zone() if zone temperature is invalid") introduced a polling mechanism by which the thermal core attampts to get a valid temperature value for thermal zones where the .get_temp() callback returns errors to start with (for example, due to initialization ordering woes). However, this polling is carried out periodically ad infinitum and every iteration of it causes a message to be printed to the kernel log which means a lot of log noise on systems where there are thermal zones that never get ready for some reason. It is also not really useful to continuously poll thermal zones that never respond. To address this, modify the thermal core to increase the delay between consecutive thermal zone temperature checks after every check that fails until it reaches a certain maximum value. At that point, the thermal zone in question will be disabled, but user space will be able to reenable it if it believes that the failure is transient. Also change the code to print messages regarding failed temperature checks to the kernel log only twice, once when the thermal zone's .get_temp() callback returns an error for the first time and once when disabling the given thermal zone. In addition, a dev_crit() message will be printed at that point if the given thermal zone contains a critical trip point to notify the system operator about the situation. Fixes: a8a261774466 ("thermal: core: Call monitor_thermal_zone() if zone temperature is invalid") Link: https://lore.kernel.org/linux-acpi/CAGnHSE=RyPK++UG0-wAtVKgeJxe0uzFYgLxm+RUOKKoQquW=Ow@mail.gmail.com/ Reported-by: Tom Yan <tom.ty89@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://patch.msgid.link/2962033.e9J7NaK4W3@rjwysocki.net
2024-07-18 19:01:14 +00:00
tz->recheck_delay_jiffies = THERMAL_RECHECK_DELAY;
/* sys I/F */
/* Add nodes that are always present via .groups */
result = thermal_zone_create_device_groups(tz);
if (result)
goto remove_id;
/* A new thermal zone needs to be updated anyway. */
atomic_set(&tz->need_update, 1);
result = dev_set_name(&tz->device, "thermal_zone%d", tz->id);
if (result) {
thermal_zone_destroy_device_groups(tz);
goto remove_id;
}
result = device_register(&tz->device);
if (result)
goto release_device;
/* Update 'this' zone's governor information */
mutex_lock(&thermal_governor_lock);
if (tz->tzp)
governor = __find_governor(tz->tzp->governor_name);
else
governor = def_governor;
result = thermal_set_governor(tz, governor);
if (result) {
mutex_unlock(&thermal_governor_lock);
goto unregister;
}
mutex_unlock(&thermal_governor_lock);
if (!tz->tzp || !tz->tzp->no_hwmon) {
result = thermal_add_hwmon_sysfs(tz);
if (result)
goto unregister;
}
mutex_lock(&thermal_list_lock);
thermal: core: Rework thermal zone availability check In order to avoid running __thermal_zone_device_update() for thermal zones going away, the thermal zone lock is held around device_del() in thermal_zone_device_unregister() and thermal_zone_device_update() passes the given thermal zone device to device_is_registered(). This allows thermal_zone_device_update() to skip the __thermal_zone_device_update() if device_del() has already run for the thermal zone at hand. However, instead of looking at driver core internals, the thermal subsystem may as well rely on its own data structures for this purpose. Namely, if the thermal zone is not present in thermal_tz_list, it can be regarded as unavailable, which in fact is already the case in thermal_zone_device_unregister(). Accordingly, the device_is_registered() check in thermal_zone_device_update() can be replaced with checking whether or not the node list_head in struct thermal_zone_device is empty, in which case it is not there in thermal_tz_list. To make this work, though, it is necessary to initialize tz->node in thermal_zone_device_register_with_trips() before registering the thermal zone device and it needs to be added to thermal_tz_list and deleted from it under its zone lock. After the above modifications, the zone lock does not need to be held around device_del() in thermal_zone_device_unregister() any more. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-and-tested-by: Lukasz Luba <lukasz.luba@arm.com> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
2023-12-08 19:20:00 +00:00
mutex_lock(&tz->lock);
list_add_tail(&tz->node, &thermal_tz_list);
thermal: core: Rework thermal zone availability check In order to avoid running __thermal_zone_device_update() for thermal zones going away, the thermal zone lock is held around device_del() in thermal_zone_device_unregister() and thermal_zone_device_update() passes the given thermal zone device to device_is_registered(). This allows thermal_zone_device_update() to skip the __thermal_zone_device_update() if device_del() has already run for the thermal zone at hand. However, instead of looking at driver core internals, the thermal subsystem may as well rely on its own data structures for this purpose. Namely, if the thermal zone is not present in thermal_tz_list, it can be regarded as unavailable, which in fact is already the case in thermal_zone_device_unregister(). Accordingly, the device_is_registered() check in thermal_zone_device_update() can be replaced with checking whether or not the node list_head in struct thermal_zone_device is empty, in which case it is not there in thermal_tz_list. To make this work, though, it is necessary to initialize tz->node in thermal_zone_device_register_with_trips() before registering the thermal zone device and it needs to be added to thermal_tz_list and deleted from it under its zone lock. After the above modifications, the zone lock does not need to be held around device_del() in thermal_zone_device_unregister() any more. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-and-tested-by: Lukasz Luba <lukasz.luba@arm.com> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
2023-12-08 19:20:00 +00:00
mutex_unlock(&tz->lock);
/* Bind cooling devices for this zone */
thermal: core: Introduce .should_bind() thermal zone callback The current design of the code binding cooling devices to trip points in thermal zones is convoluted and hard to follow. Namely, a driver that registers a thermal zone can provide .bind() and .unbind() operations for it, which are required to call either thermal_bind_cdev_to_trip() and thermal_unbind_cdev_from_trip(), respectively, or thermal_zone_bind_cooling_device() and thermal_zone_unbind_cooling_device(), respectively, for every relevant trip point and the given cooling device. Moreover, if .bind() is provided and .unbind() is not, the cleanup necessary during the removal of a thermal zone or a cooling device may not be carried out. In other words, the core relies on the thermal zone owners to do the right thing, which is error prone and far from obvious, even though all of that is not really necessary. Specifically, if the core could ask the thermal zone owner, through a special thermal zone callback, whether or not a given cooling device should be bound to a given trip point in the given thermal zone, it might as well carry out all of the binding and unbinding by itself. In particular, the unbinding can be done automatically without involving the thermal zone owner at all because all of the thermal instances associated with a thermal zone or cooling device going away must be deleted regardless. Accordingly, introduce a new thermal zone operation, .should_bind(), that can be invoked by the thermal core for a given thermal zone, trip point and cooling device combination in order to check whether or not the cooling device should be bound to the trip point at hand. It takes an additional cooling_spec argument allowing the thermal zone owner to specify the highest and lowest cooling states of the cooling device and its weight for the given trip point binding. Make the thermal core use this operation, if present, in the absence of .bind() and .unbind(). Note that .should_bind() will be called under the thermal zone lock. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Zhang Rui <rui.zhang@intel.com> Acked-by: Huisong Li <lihuisong@huawei.com> Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org> Link: https://patch.msgid.link/9334403.CDJkKcVGEf@rjwysocki.net
2024-08-19 16:00:19 +00:00
list_for_each_entry(cdev, &thermal_cdev_list, node)
thermal_zone_cdev_bind(tz, cdev);
mutex_unlock(&thermal_list_lock);
thermal_zone_device_init(tz);
/* Update the new thermal zone and mark it as already updated. */
if (atomic_cmpxchg(&tz->need_update, 1, 0))
thermal_zone_device_update(tz, THERMAL_EVENT_UNSPECIFIED);
thermal_notify_tz_create(tz);
thermal/debugfs: Add thermal debugfs information for mitigation episodes The mitigation episodes are recorded. A mitigation episode happens when the first trip point is crossed the way up and then the way down. During this episode other trip points can be crossed also and are accounted for this mitigation episode. The interesting information is the average temperature at the trip point, the undershot and the overshot. The standard deviation of the mitigated temperature will be added later. The thermal debugfs directory structure tries to stay consistent with the sysfs one but in a very simplified way: thermal/ `-- thermal_zones |-- 0 | `-- mitigations `-- 1 `-- mitigations The content of the mitigations file has the following format: ,-Mitigation at 349988258us, duration=130136ms | trip | type | temp(°mC) | hyst(°mC) | duration | avg(°mC) | min(°mC) | max(°mC) | | 0 | passive | 65000 | 2000 | 130136 | 68227 | 62500 | 75625 | | 1 | passive | 75000 | 2000 | 104209 | 74857 | 71666 | 77500 | ,-Mitigation at 272451637us, duration=75000ms | trip | type | temp(°mC) | hyst(°mC) | duration | avg(°mC) | min(°mC) | max(°mC) | | 0 | passive | 65000 | 2000 | 75000 | 68561 | 62500 | 75000 | | 1 | passive | 75000 | 2000 | 60714 | 74820 | 70555 | 77500 | ,-Mitigation at 238184119us, duration=27316ms | trip | type | temp(°mC) | hyst(°mC) | duration | avg(°mC) | min(°mC) | max(°mC) | | 0 | passive | 65000 | 2000 | 27316 | 73377 | 62500 | 75000 | | 1 | passive | 75000 | 2000 | 19468 | 75284 | 69444 | 77500 | ,-Mitigation at 39863713us, duration=136196ms | trip | type | temp(°mC) | hyst(°mC) | duration | avg(°mC) | min(°mC) | max(°mC) | | 0 | passive | 65000 | 2000 | 136196 | 73922 | 62500 | 75000 | | 1 | passive | 75000 | 2000 | 91721 | 74386 | 69444 | 78125 | More information for a better understanding of the thermal behavior will be added after. The idea is to give detailed statistics information about the undershots and overshots, the temperature speed, etc... As all the information in a single file is too much, the idea would be to create a directory named with the mitigation timestamp where all data could be added. Please note this code is immune against trip ordering but not against a trip temperature change while a mitigation is happening. However, this situation should be extremely rare, perhaps not happening and we might question ourselves if something should be done in the core framework for other components first. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> [ rjw: White space fixups, rebase ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-01-09 09:41:12 +00:00
thermal_debug_tz_add(tz);
return tz;
unregister:
device_del(&tz->device);
release_device:
put_device(&tz->device);
remove_id:
ida_free(&thermal_tz_ida, id);
free_tzp:
kfree(tz->tzp);
free_tz:
kfree(tz);
return ERR_PTR(result);
}
EXPORT_SYMBOL_GPL(thermal_zone_device_register_with_trips);
struct thermal_zone_device *thermal_tripless_zone_device_register(
const char *type,
void *devdata,
const struct thermal_zone_device_ops *ops,
const struct thermal_zone_params *tzp)
{
return thermal_zone_device_register_with_trips(type, NULL, 0, devdata,
ops, tzp, 0, 0);
}
EXPORT_SYMBOL_GPL(thermal_tripless_zone_device_register);
void *thermal_zone_device_priv(struct thermal_zone_device *tzd)
{
return tzd->devdata;
}
EXPORT_SYMBOL_GPL(thermal_zone_device_priv);
const char *thermal_zone_device_type(struct thermal_zone_device *tzd)
{
return tzd->type;
}
EXPORT_SYMBOL_GPL(thermal_zone_device_type);
int thermal_zone_device_id(struct thermal_zone_device *tzd)
{
return tzd->id;
}
EXPORT_SYMBOL_GPL(thermal_zone_device_id);
struct device *thermal_zone_device(struct thermal_zone_device *tzd)
{
return &tzd->device;
}
EXPORT_SYMBOL_GPL(thermal_zone_device);
/**
* thermal_zone_device_unregister - removes the registered thermal zone device
* @tz: the thermal zone device to remove
*/
void thermal_zone_device_unregister(struct thermal_zone_device *tz)
{
struct thermal_cooling_device *cdev;
struct thermal_zone_device *pos = NULL;
if (!tz)
return;
thermal/debugfs: Add thermal debugfs information for mitigation episodes The mitigation episodes are recorded. A mitigation episode happens when the first trip point is crossed the way up and then the way down. During this episode other trip points can be crossed also and are accounted for this mitigation episode. The interesting information is the average temperature at the trip point, the undershot and the overshot. The standard deviation of the mitigated temperature will be added later. The thermal debugfs directory structure tries to stay consistent with the sysfs one but in a very simplified way: thermal/ `-- thermal_zones |-- 0 | `-- mitigations `-- 1 `-- mitigations The content of the mitigations file has the following format: ,-Mitigation at 349988258us, duration=130136ms | trip | type | temp(°mC) | hyst(°mC) | duration | avg(°mC) | min(°mC) | max(°mC) | | 0 | passive | 65000 | 2000 | 130136 | 68227 | 62500 | 75625 | | 1 | passive | 75000 | 2000 | 104209 | 74857 | 71666 | 77500 | ,-Mitigation at 272451637us, duration=75000ms | trip | type | temp(°mC) | hyst(°mC) | duration | avg(°mC) | min(°mC) | max(°mC) | | 0 | passive | 65000 | 2000 | 75000 | 68561 | 62500 | 75000 | | 1 | passive | 75000 | 2000 | 60714 | 74820 | 70555 | 77500 | ,-Mitigation at 238184119us, duration=27316ms | trip | type | temp(°mC) | hyst(°mC) | duration | avg(°mC) | min(°mC) | max(°mC) | | 0 | passive | 65000 | 2000 | 27316 | 73377 | 62500 | 75000 | | 1 | passive | 75000 | 2000 | 19468 | 75284 | 69444 | 77500 | ,-Mitigation at 39863713us, duration=136196ms | trip | type | temp(°mC) | hyst(°mC) | duration | avg(°mC) | min(°mC) | max(°mC) | | 0 | passive | 65000 | 2000 | 136196 | 73922 | 62500 | 75000 | | 1 | passive | 75000 | 2000 | 91721 | 74386 | 69444 | 78125 | More information for a better understanding of the thermal behavior will be added after. The idea is to give detailed statistics information about the undershots and overshots, the temperature speed, etc... As all the information in a single file is too much, the idea would be to create a directory named with the mitigation timestamp where all data could be added. Please note this code is immune against trip ordering but not against a trip temperature change while a mitigation is happening. However, this situation should be extremely rare, perhaps not happening and we might question ourselves if something should be done in the core framework for other components first. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> [ rjw: White space fixups, rebase ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-01-09 09:41:12 +00:00
thermal_debug_tz_remove(tz);
mutex_lock(&thermal_list_lock);
list_for_each_entry(pos, &thermal_tz_list, node)
if (pos == tz)
break;
if (pos != tz) {
/* thermal zone device not found */
mutex_unlock(&thermal_list_lock);
return;
}
thermal: core: Rework thermal zone availability check In order to avoid running __thermal_zone_device_update() for thermal zones going away, the thermal zone lock is held around device_del() in thermal_zone_device_unregister() and thermal_zone_device_update() passes the given thermal zone device to device_is_registered(). This allows thermal_zone_device_update() to skip the __thermal_zone_device_update() if device_del() has already run for the thermal zone at hand. However, instead of looking at driver core internals, the thermal subsystem may as well rely on its own data structures for this purpose. Namely, if the thermal zone is not present in thermal_tz_list, it can be regarded as unavailable, which in fact is already the case in thermal_zone_device_unregister(). Accordingly, the device_is_registered() check in thermal_zone_device_update() can be replaced with checking whether or not the node list_head in struct thermal_zone_device is empty, in which case it is not there in thermal_tz_list. To make this work, though, it is necessary to initialize tz->node in thermal_zone_device_register_with_trips() before registering the thermal zone device and it needs to be added to thermal_tz_list and deleted from it under its zone lock. After the above modifications, the zone lock does not need to be held around device_del() in thermal_zone_device_unregister() any more. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-and-tested-by: Lukasz Luba <lukasz.luba@arm.com> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
2023-12-08 19:20:00 +00:00
mutex_lock(&tz->lock);
list_del(&tz->node);
thermal: core: Rework thermal zone availability check In order to avoid running __thermal_zone_device_update() for thermal zones going away, the thermal zone lock is held around device_del() in thermal_zone_device_unregister() and thermal_zone_device_update() passes the given thermal zone device to device_is_registered(). This allows thermal_zone_device_update() to skip the __thermal_zone_device_update() if device_del() has already run for the thermal zone at hand. However, instead of looking at driver core internals, the thermal subsystem may as well rely on its own data structures for this purpose. Namely, if the thermal zone is not present in thermal_tz_list, it can be regarded as unavailable, which in fact is already the case in thermal_zone_device_unregister(). Accordingly, the device_is_registered() check in thermal_zone_device_update() can be replaced with checking whether or not the node list_head in struct thermal_zone_device is empty, in which case it is not there in thermal_tz_list. To make this work, though, it is necessary to initialize tz->node in thermal_zone_device_register_with_trips() before registering the thermal zone device and it needs to be added to thermal_tz_list and deleted from it under its zone lock. After the above modifications, the zone lock does not need to be held around device_del() in thermal_zone_device_unregister() any more. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-and-tested-by: Lukasz Luba <lukasz.luba@arm.com> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
2023-12-08 19:20:00 +00:00
mutex_unlock(&tz->lock);
/* Unbind all cdevs associated with 'this' thermal zone */
list_for_each_entry(cdev, &thermal_cdev_list, node)
thermal_zone_cdev_unbind(tz, cdev);
mutex_unlock(&thermal_list_lock);
thermal: Fix deadlock in thermal thermal_zone_device_check 1851799e1d29 ("thermal: Fix use-after-free when unregistering thermal zone device") changed cancel_delayed_work to cancel_delayed_work_sync to avoid a use-after-free issue. However, cancel_delayed_work_sync could be called insides the WQ causing deadlock. [54109.642398] c0 1162 kworker/u17:1 D 0 11030 2 0x00000000 [54109.642437] c0 1162 Workqueue: thermal_passive_wq thermal_zone_device_check [54109.642447] c0 1162 Call trace: [54109.642456] c0 1162 __switch_to+0x138/0x158 [54109.642467] c0 1162 __schedule+0xba4/0x1434 [54109.642480] c0 1162 schedule_timeout+0xa0/0xb28 [54109.642492] c0 1162 wait_for_common+0x138/0x2e8 [54109.642511] c0 1162 flush_work+0x348/0x40c [54109.642522] c0 1162 __cancel_work_timer+0x180/0x218 [54109.642544] c0 1162 handle_thermal_trip+0x2c4/0x5a4 [54109.642553] c0 1162 thermal_zone_device_update+0x1b4/0x25c [54109.642563] c0 1162 thermal_zone_device_check+0x18/0x24 [54109.642574] c0 1162 process_one_work+0x3cc/0x69c [54109.642583] c0 1162 worker_thread+0x49c/0x7c0 [54109.642593] c0 1162 kthread+0x17c/0x1b0 [54109.642602] c0 1162 ret_from_fork+0x10/0x18 [54109.643051] c0 1162 kworker/u17:2 D 0 16245 2 0x00000000 [54109.643067] c0 1162 Workqueue: thermal_passive_wq thermal_zone_device_check [54109.643077] c0 1162 Call trace: [54109.643085] c0 1162 __switch_to+0x138/0x158 [54109.643095] c0 1162 __schedule+0xba4/0x1434 [54109.643104] c0 1162 schedule_timeout+0xa0/0xb28 [54109.643114] c0 1162 wait_for_common+0x138/0x2e8 [54109.643122] c0 1162 flush_work+0x348/0x40c [54109.643131] c0 1162 __cancel_work_timer+0x180/0x218 [54109.643141] c0 1162 handle_thermal_trip+0x2c4/0x5a4 [54109.643150] c0 1162 thermal_zone_device_update+0x1b4/0x25c [54109.643159] c0 1162 thermal_zone_device_check+0x18/0x24 [54109.643167] c0 1162 process_one_work+0x3cc/0x69c [54109.643177] c0 1162 worker_thread+0x49c/0x7c0 [54109.643186] c0 1162 kthread+0x17c/0x1b0 [54109.643195] c0 1162 ret_from_fork+0x10/0x18 [54109.644500] c0 1162 cat D 0 7766 1 0x00000001 [54109.644515] c0 1162 Call trace: [54109.644524] c0 1162 __switch_to+0x138/0x158 [54109.644536] c0 1162 __schedule+0xba4/0x1434 [54109.644546] c0 1162 schedule_preempt_disabled+0x80/0xb0 [54109.644555] c0 1162 __mutex_lock+0x3a8/0x7f0 [54109.644563] c0 1162 __mutex_lock_slowpath+0x14/0x20 [54109.644575] c0 1162 thermal_zone_get_temp+0x84/0x360 [54109.644586] c0 1162 temp_show+0x30/0x78 [54109.644609] c0 1162 dev_attr_show+0x5c/0xf0 [54109.644628] c0 1162 sysfs_kf_seq_show+0xcc/0x1a4 [54109.644636] c0 1162 kernfs_seq_show+0x48/0x88 [54109.644656] c0 1162 seq_read+0x1f4/0x73c [54109.644664] c0 1162 kernfs_fop_read+0x84/0x318 [54109.644683] c0 1162 __vfs_read+0x50/0x1bc [54109.644692] c0 1162 vfs_read+0xa4/0x140 [54109.644701] c0 1162 SyS_read+0xbc/0x144 [54109.644708] c0 1162 el0_svc_naked+0x34/0x38 [54109.845800] c0 1162 D 720.000s 1->7766->7766 cat [panic] Fixes: 1851799e1d29 ("thermal: Fix use-after-free when unregistering thermal zone device") Cc: stable@vger.kernel.org Signed-off-by: Wei Wang <wvw@google.com> Signed-off-by: Zhang Rui <rui.zhang@intel.com>
2019-11-12 20:42:23 +00:00
cancel_delayed_work_sync(&tz->poll_queue);
thermal_set_governor(tz, NULL);
thermal_remove_hwmon_sysfs(tz);
ida_free(&thermal_tz_ida, tz->id);
ida_destroy(&tz->ida);
device_del(&tz->device);
put_device(&tz->device);
thermal_notify_tz_delete(tz);
wait_for_completion(&tz->removal);
kfree(tz->tzp);
kfree(tz);
}
EXPORT_SYMBOL_GPL(thermal_zone_device_unregister);
/**
* thermal_zone_get_zone_by_name() - search for a zone and returns its ref
* @name: thermal zone name to fetch the temperature
*
* When only one zone is found with the passed name, returns a reference to it.
*
* Return: On success returns a reference to an unique thermal zone with
* matching name equals to @name, an ERR_PTR otherwise (-EINVAL for invalid
* paramenters, -ENODEV for not found and -EEXIST for multiple matches).
*/
struct thermal_zone_device *thermal_zone_get_zone_by_name(const char *name)
{
struct thermal_zone_device *pos = NULL, *ref = ERR_PTR(-EINVAL);
unsigned int found = 0;
if (!name)
goto exit;
mutex_lock(&thermal_list_lock);
list_for_each_entry(pos, &thermal_tz_list, node)
if (!strncasecmp(name, pos->type, THERMAL_NAME_LENGTH)) {
found++;
ref = pos;
}
mutex_unlock(&thermal_list_lock);
/* nothing has been found, thus an error code for it */
if (found == 0)
ref = ERR_PTR(-ENODEV);
else if (found > 1)
/* Success only when an unique zone is found */
ref = ERR_PTR(-EEXIST);
exit:
return ref;
}
EXPORT_SYMBOL_GPL(thermal_zone_get_zone_by_name);
static void thermal_zone_device_resume(struct work_struct *work)
{
struct thermal_zone_device *tz;
tz = container_of(work, struct thermal_zone_device, poll_queue.work);
mutex_lock(&tz->lock);
tz->suspended = false;
thermal_debug_tz_resume(tz);
thermal_zone_device_init(tz);
thermal_governor_update_tz(tz, THERMAL_TZ_RESUME);
__thermal_zone_device_update(tz, THERMAL_TZ_RESUME);
complete(&tz->resume);
tz->resuming = false;
mutex_unlock(&tz->lock);
}
Thermal: handle thermal zone device properly during system sleep Current thermal code does not handle system sleep well because 1. the cooling device cooling state may be changed during suspend 2. the previous temperature reading becomes invalid after resumed because it is got before system sleep 3. updating thermal zone device during suspending/resuming is wrong because some devices may have already been suspended or may have not been resumed. Thus, the proper way to do this is to cancel all thermal zone device update requirements during suspend/resume, and after all the devices have been resumed, reset and update every registered thermal zone devices. This also fixes a regression introduced by: Commit 19593a1fb1f6 ("ACPI / fan: convert to platform driver") Because, with above commit applied, all the fan devices are attached to the acpi_general_pm_domain, and they are turned on by the pm_domain automatically after resume, without the awareness of thermal core. CC: <stable@vger.kernel.org> #3.18+ Reference: https://bugzilla.kernel.org/show_bug.cgi?id=78201 Reference: https://bugzilla.kernel.org/show_bug.cgi?id=91411 Tested-by: Manuel Krause <manuelkrause@netscape.net> Tested-by: szegad <szegadlo@poczta.onet.pl> Tested-by: prash <prash.n.rao@gmail.com> Tested-by: amish <ammdispose-arch@yahoo.com> Tested-by: Matthias <morpheusxyz123@yahoo.de> Reviewed-by: Javi Merino <javi.merino@arm.com> Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
2015-10-30 08:31:58 +00:00
static int thermal_pm_notify(struct notifier_block *nb,
unsigned long mode, void *_unused)
Thermal: handle thermal zone device properly during system sleep Current thermal code does not handle system sleep well because 1. the cooling device cooling state may be changed during suspend 2. the previous temperature reading becomes invalid after resumed because it is got before system sleep 3. updating thermal zone device during suspending/resuming is wrong because some devices may have already been suspended or may have not been resumed. Thus, the proper way to do this is to cancel all thermal zone device update requirements during suspend/resume, and after all the devices have been resumed, reset and update every registered thermal zone devices. This also fixes a regression introduced by: Commit 19593a1fb1f6 ("ACPI / fan: convert to platform driver") Because, with above commit applied, all the fan devices are attached to the acpi_general_pm_domain, and they are turned on by the pm_domain automatically after resume, without the awareness of thermal core. CC: <stable@vger.kernel.org> #3.18+ Reference: https://bugzilla.kernel.org/show_bug.cgi?id=78201 Reference: https://bugzilla.kernel.org/show_bug.cgi?id=91411 Tested-by: Manuel Krause <manuelkrause@netscape.net> Tested-by: szegad <szegadlo@poczta.onet.pl> Tested-by: prash <prash.n.rao@gmail.com> Tested-by: amish <ammdispose-arch@yahoo.com> Tested-by: Matthias <morpheusxyz123@yahoo.de> Reviewed-by: Javi Merino <javi.merino@arm.com> Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
2015-10-30 08:31:58 +00:00
{
struct thermal_zone_device *tz;
switch (mode) {
case PM_HIBERNATION_PREPARE:
case PM_RESTORE_PREPARE:
case PM_SUSPEND_PREPARE:
thermal: core: Fix thermal zone suspend-resume synchronization There are 3 synchronization issues with thermal zone suspend-resume during system-wide transitions: 1. The resume code runs in a PM notifier which is invoked after user space has been thawed, so it can run concurrently with user space which can trigger a thermal zone device removal. If that happens, the thermal zone resume code may use a stale pointer to the next list element and crash, because it does not hold thermal_list_lock while walking thermal_tz_list. 2. The thermal zone resume code calls thermal_zone_device_init() outside the zone lock, so user space or an update triggered by the platform firmware may see an inconsistent state of a thermal zone leading to unexpected behavior. 3. Clearing the in_suspend global variable in thermal_pm_notify() allows __thermal_zone_device_update() to continue for all thermal zones and it may as well run before the thermal_tz_list walk (or at any point during the list walk for that matter) and attempt to operate on a thermal zone that has not been resumed yet. It may also race destructively with thermal_zone_device_init(). To address these issues, add thermal_list_lock locking to thermal_pm_notify(), especially arount the thermal_tz_list, make it call thermal_zone_device_init() back-to-back with __thermal_zone_device_update() under the zone lock and replace in_suspend with per-zone bool "suspend" indicators set and unset under the given zone's lock. Link: https://lore.kernel.org/linux-pm/20231218162348.69101-1-bo.ye@mediatek.com/ Reported-by: Bo Ye <bo.ye@mediatek.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-12-18 19:25:02 +00:00
mutex_lock(&thermal_list_lock);
list_for_each_entry(tz, &thermal_tz_list, node) {
mutex_lock(&tz->lock);
if (tz->resuming) {
/*
* thermal_zone_device_resume() queued up for
* this zone has not acquired the lock yet, so
* release it to let the function run and wait
* util it has done the work.
*/
mutex_unlock(&tz->lock);
wait_for_completion(&tz->resume);
mutex_lock(&tz->lock);
}
thermal: core: Fix thermal zone suspend-resume synchronization There are 3 synchronization issues with thermal zone suspend-resume during system-wide transitions: 1. The resume code runs in a PM notifier which is invoked after user space has been thawed, so it can run concurrently with user space which can trigger a thermal zone device removal. If that happens, the thermal zone resume code may use a stale pointer to the next list element and crash, because it does not hold thermal_list_lock while walking thermal_tz_list. 2. The thermal zone resume code calls thermal_zone_device_init() outside the zone lock, so user space or an update triggered by the platform firmware may see an inconsistent state of a thermal zone leading to unexpected behavior. 3. Clearing the in_suspend global variable in thermal_pm_notify() allows __thermal_zone_device_update() to continue for all thermal zones and it may as well run before the thermal_tz_list walk (or at any point during the list walk for that matter) and attempt to operate on a thermal zone that has not been resumed yet. It may also race destructively with thermal_zone_device_init(). To address these issues, add thermal_list_lock locking to thermal_pm_notify(), especially arount the thermal_tz_list, make it call thermal_zone_device_init() back-to-back with __thermal_zone_device_update() under the zone lock and replace in_suspend with per-zone bool "suspend" indicators set and unset under the given zone's lock. Link: https://lore.kernel.org/linux-pm/20231218162348.69101-1-bo.ye@mediatek.com/ Reported-by: Bo Ye <bo.ye@mediatek.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-12-18 19:25:02 +00:00
tz->suspended = true;
mutex_unlock(&tz->lock);
}
mutex_unlock(&thermal_list_lock);
Thermal: handle thermal zone device properly during system sleep Current thermal code does not handle system sleep well because 1. the cooling device cooling state may be changed during suspend 2. the previous temperature reading becomes invalid after resumed because it is got before system sleep 3. updating thermal zone device during suspending/resuming is wrong because some devices may have already been suspended or may have not been resumed. Thus, the proper way to do this is to cancel all thermal zone device update requirements during suspend/resume, and after all the devices have been resumed, reset and update every registered thermal zone devices. This also fixes a regression introduced by: Commit 19593a1fb1f6 ("ACPI / fan: convert to platform driver") Because, with above commit applied, all the fan devices are attached to the acpi_general_pm_domain, and they are turned on by the pm_domain automatically after resume, without the awareness of thermal core. CC: <stable@vger.kernel.org> #3.18+ Reference: https://bugzilla.kernel.org/show_bug.cgi?id=78201 Reference: https://bugzilla.kernel.org/show_bug.cgi?id=91411 Tested-by: Manuel Krause <manuelkrause@netscape.net> Tested-by: szegad <szegadlo@poczta.onet.pl> Tested-by: prash <prash.n.rao@gmail.com> Tested-by: amish <ammdispose-arch@yahoo.com> Tested-by: Matthias <morpheusxyz123@yahoo.de> Reviewed-by: Javi Merino <javi.merino@arm.com> Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
2015-10-30 08:31:58 +00:00
break;
case PM_POST_HIBERNATION:
case PM_POST_RESTORE:
case PM_POST_SUSPEND:
thermal: core: Fix thermal zone suspend-resume synchronization There are 3 synchronization issues with thermal zone suspend-resume during system-wide transitions: 1. The resume code runs in a PM notifier which is invoked after user space has been thawed, so it can run concurrently with user space which can trigger a thermal zone device removal. If that happens, the thermal zone resume code may use a stale pointer to the next list element and crash, because it does not hold thermal_list_lock while walking thermal_tz_list. 2. The thermal zone resume code calls thermal_zone_device_init() outside the zone lock, so user space or an update triggered by the platform firmware may see an inconsistent state of a thermal zone leading to unexpected behavior. 3. Clearing the in_suspend global variable in thermal_pm_notify() allows __thermal_zone_device_update() to continue for all thermal zones and it may as well run before the thermal_tz_list walk (or at any point during the list walk for that matter) and attempt to operate on a thermal zone that has not been resumed yet. It may also race destructively with thermal_zone_device_init(). To address these issues, add thermal_list_lock locking to thermal_pm_notify(), especially arount the thermal_tz_list, make it call thermal_zone_device_init() back-to-back with __thermal_zone_device_update() under the zone lock and replace in_suspend with per-zone bool "suspend" indicators set and unset under the given zone's lock. Link: https://lore.kernel.org/linux-pm/20231218162348.69101-1-bo.ye@mediatek.com/ Reported-by: Bo Ye <bo.ye@mediatek.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-12-18 19:25:02 +00:00
mutex_lock(&thermal_list_lock);
Thermal: handle thermal zone device properly during system sleep Current thermal code does not handle system sleep well because 1. the cooling device cooling state may be changed during suspend 2. the previous temperature reading becomes invalid after resumed because it is got before system sleep 3. updating thermal zone device during suspending/resuming is wrong because some devices may have already been suspended or may have not been resumed. Thus, the proper way to do this is to cancel all thermal zone device update requirements during suspend/resume, and after all the devices have been resumed, reset and update every registered thermal zone devices. This also fixes a regression introduced by: Commit 19593a1fb1f6 ("ACPI / fan: convert to platform driver") Because, with above commit applied, all the fan devices are attached to the acpi_general_pm_domain, and they are turned on by the pm_domain automatically after resume, without the awareness of thermal core. CC: <stable@vger.kernel.org> #3.18+ Reference: https://bugzilla.kernel.org/show_bug.cgi?id=78201 Reference: https://bugzilla.kernel.org/show_bug.cgi?id=91411 Tested-by: Manuel Krause <manuelkrause@netscape.net> Tested-by: szegad <szegadlo@poczta.onet.pl> Tested-by: prash <prash.n.rao@gmail.com> Tested-by: amish <ammdispose-arch@yahoo.com> Tested-by: Matthias <morpheusxyz123@yahoo.de> Reviewed-by: Javi Merino <javi.merino@arm.com> Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
2015-10-30 08:31:58 +00:00
list_for_each_entry(tz, &thermal_tz_list, node) {
thermal: core: Fix thermal zone suspend-resume synchronization There are 3 synchronization issues with thermal zone suspend-resume during system-wide transitions: 1. The resume code runs in a PM notifier which is invoked after user space has been thawed, so it can run concurrently with user space which can trigger a thermal zone device removal. If that happens, the thermal zone resume code may use a stale pointer to the next list element and crash, because it does not hold thermal_list_lock while walking thermal_tz_list. 2. The thermal zone resume code calls thermal_zone_device_init() outside the zone lock, so user space or an update triggered by the platform firmware may see an inconsistent state of a thermal zone leading to unexpected behavior. 3. Clearing the in_suspend global variable in thermal_pm_notify() allows __thermal_zone_device_update() to continue for all thermal zones and it may as well run before the thermal_tz_list walk (or at any point during the list walk for that matter) and attempt to operate on a thermal zone that has not been resumed yet. It may also race destructively with thermal_zone_device_init(). To address these issues, add thermal_list_lock locking to thermal_pm_notify(), especially arount the thermal_tz_list, make it call thermal_zone_device_init() back-to-back with __thermal_zone_device_update() under the zone lock and replace in_suspend with per-zone bool "suspend" indicators set and unset under the given zone's lock. Link: https://lore.kernel.org/linux-pm/20231218162348.69101-1-bo.ye@mediatek.com/ Reported-by: Bo Ye <bo.ye@mediatek.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-12-18 19:25:02 +00:00
mutex_lock(&tz->lock);
cancel_delayed_work(&tz->poll_queue);
reinit_completion(&tz->resume);
tz->resuming = true;
/*
* Replace the work function with the resume one, which
* will restore the original work function and schedule
* the polling work if needed.
*/
INIT_DELAYED_WORK(&tz->poll_queue,
thermal_zone_device_resume);
/* Queue up the work without a delay. */
mod_delayed_work(system_freezable_power_efficient_wq,
&tz->poll_queue, 0);
thermal: core: Fix thermal zone suspend-resume synchronization There are 3 synchronization issues with thermal zone suspend-resume during system-wide transitions: 1. The resume code runs in a PM notifier which is invoked after user space has been thawed, so it can run concurrently with user space which can trigger a thermal zone device removal. If that happens, the thermal zone resume code may use a stale pointer to the next list element and crash, because it does not hold thermal_list_lock while walking thermal_tz_list. 2. The thermal zone resume code calls thermal_zone_device_init() outside the zone lock, so user space or an update triggered by the platform firmware may see an inconsistent state of a thermal zone leading to unexpected behavior. 3. Clearing the in_suspend global variable in thermal_pm_notify() allows __thermal_zone_device_update() to continue for all thermal zones and it may as well run before the thermal_tz_list walk (or at any point during the list walk for that matter) and attempt to operate on a thermal zone that has not been resumed yet. It may also race destructively with thermal_zone_device_init(). To address these issues, add thermal_list_lock locking to thermal_pm_notify(), especially arount the thermal_tz_list, make it call thermal_zone_device_init() back-to-back with __thermal_zone_device_update() under the zone lock and replace in_suspend with per-zone bool "suspend" indicators set and unset under the given zone's lock. Link: https://lore.kernel.org/linux-pm/20231218162348.69101-1-bo.ye@mediatek.com/ Reported-by: Bo Ye <bo.ye@mediatek.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-12-18 19:25:02 +00:00
mutex_unlock(&tz->lock);
Thermal: handle thermal zone device properly during system sleep Current thermal code does not handle system sleep well because 1. the cooling device cooling state may be changed during suspend 2. the previous temperature reading becomes invalid after resumed because it is got before system sleep 3. updating thermal zone device during suspending/resuming is wrong because some devices may have already been suspended or may have not been resumed. Thus, the proper way to do this is to cancel all thermal zone device update requirements during suspend/resume, and after all the devices have been resumed, reset and update every registered thermal zone devices. This also fixes a regression introduced by: Commit 19593a1fb1f6 ("ACPI / fan: convert to platform driver") Because, with above commit applied, all the fan devices are attached to the acpi_general_pm_domain, and they are turned on by the pm_domain automatically after resume, without the awareness of thermal core. CC: <stable@vger.kernel.org> #3.18+ Reference: https://bugzilla.kernel.org/show_bug.cgi?id=78201 Reference: https://bugzilla.kernel.org/show_bug.cgi?id=91411 Tested-by: Manuel Krause <manuelkrause@netscape.net> Tested-by: szegad <szegadlo@poczta.onet.pl> Tested-by: prash <prash.n.rao@gmail.com> Tested-by: amish <ammdispose-arch@yahoo.com> Tested-by: Matthias <morpheusxyz123@yahoo.de> Reviewed-by: Javi Merino <javi.merino@arm.com> Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
2015-10-30 08:31:58 +00:00
}
thermal: core: Fix thermal zone suspend-resume synchronization There are 3 synchronization issues with thermal zone suspend-resume during system-wide transitions: 1. The resume code runs in a PM notifier which is invoked after user space has been thawed, so it can run concurrently with user space which can trigger a thermal zone device removal. If that happens, the thermal zone resume code may use a stale pointer to the next list element and crash, because it does not hold thermal_list_lock while walking thermal_tz_list. 2. The thermal zone resume code calls thermal_zone_device_init() outside the zone lock, so user space or an update triggered by the platform firmware may see an inconsistent state of a thermal zone leading to unexpected behavior. 3. Clearing the in_suspend global variable in thermal_pm_notify() allows __thermal_zone_device_update() to continue for all thermal zones and it may as well run before the thermal_tz_list walk (or at any point during the list walk for that matter) and attempt to operate on a thermal zone that has not been resumed yet. It may also race destructively with thermal_zone_device_init(). To address these issues, add thermal_list_lock locking to thermal_pm_notify(), especially arount the thermal_tz_list, make it call thermal_zone_device_init() back-to-back with __thermal_zone_device_update() under the zone lock and replace in_suspend with per-zone bool "suspend" indicators set and unset under the given zone's lock. Link: https://lore.kernel.org/linux-pm/20231218162348.69101-1-bo.ye@mediatek.com/ Reported-by: Bo Ye <bo.ye@mediatek.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-12-18 19:25:02 +00:00
mutex_unlock(&thermal_list_lock);
Thermal: handle thermal zone device properly during system sleep Current thermal code does not handle system sleep well because 1. the cooling device cooling state may be changed during suspend 2. the previous temperature reading becomes invalid after resumed because it is got before system sleep 3. updating thermal zone device during suspending/resuming is wrong because some devices may have already been suspended or may have not been resumed. Thus, the proper way to do this is to cancel all thermal zone device update requirements during suspend/resume, and after all the devices have been resumed, reset and update every registered thermal zone devices. This also fixes a regression introduced by: Commit 19593a1fb1f6 ("ACPI / fan: convert to platform driver") Because, with above commit applied, all the fan devices are attached to the acpi_general_pm_domain, and they are turned on by the pm_domain automatically after resume, without the awareness of thermal core. CC: <stable@vger.kernel.org> #3.18+ Reference: https://bugzilla.kernel.org/show_bug.cgi?id=78201 Reference: https://bugzilla.kernel.org/show_bug.cgi?id=91411 Tested-by: Manuel Krause <manuelkrause@netscape.net> Tested-by: szegad <szegadlo@poczta.onet.pl> Tested-by: prash <prash.n.rao@gmail.com> Tested-by: amish <ammdispose-arch@yahoo.com> Tested-by: Matthias <morpheusxyz123@yahoo.de> Reviewed-by: Javi Merino <javi.merino@arm.com> Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
2015-10-30 08:31:58 +00:00
break;
default:
break;
}
return 0;
}
static struct notifier_block thermal_pm_nb = {
.notifier_call = thermal_pm_notify,
/*
* Run at the lowest priority to avoid interference between the thermal
* zone resume work items spawned by thermal_pm_notify() and the other
* PM notifiers.
*/
.priority = INT_MIN,
Thermal: handle thermal zone device properly during system sleep Current thermal code does not handle system sleep well because 1. the cooling device cooling state may be changed during suspend 2. the previous temperature reading becomes invalid after resumed because it is got before system sleep 3. updating thermal zone device during suspending/resuming is wrong because some devices may have already been suspended or may have not been resumed. Thus, the proper way to do this is to cancel all thermal zone device update requirements during suspend/resume, and after all the devices have been resumed, reset and update every registered thermal zone devices. This also fixes a regression introduced by: Commit 19593a1fb1f6 ("ACPI / fan: convert to platform driver") Because, with above commit applied, all the fan devices are attached to the acpi_general_pm_domain, and they are turned on by the pm_domain automatically after resume, without the awareness of thermal core. CC: <stable@vger.kernel.org> #3.18+ Reference: https://bugzilla.kernel.org/show_bug.cgi?id=78201 Reference: https://bugzilla.kernel.org/show_bug.cgi?id=91411 Tested-by: Manuel Krause <manuelkrause@netscape.net> Tested-by: szegad <szegadlo@poczta.onet.pl> Tested-by: prash <prash.n.rao@gmail.com> Tested-by: amish <ammdispose-arch@yahoo.com> Tested-by: Matthias <morpheusxyz123@yahoo.de> Reviewed-by: Javi Merino <javi.merino@arm.com> Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
2015-10-30 08:31:58 +00:00
};
static int __init thermal_init(void)
{
int result;
thermal/debugfs: Add thermal cooling device debugfs information The thermal framework does not have any debug information except a sysfs stat which is a bit controversial. This one allocates big chunks of memory for every cooling devices with a high number of states and could represent on some systems in production several megabytes of memory for just a portion of it. As the sysfs is limited to a page size, the output is not exploitable with large data array and gets truncated. The patch provides the same information than sysfs except the transitions are dynamically allocated, thus they won't show more events than the ones which actually occurred. There is no longer a size limitation and it opens the field for more debugging information where the debugfs is designed for, not sysfs. The thermal debugfs directory structure tries to stay consistent with the sysfs one but in a very simplified way: thermal/ -- cooling_devices |-- 0 | |-- clear | |-- time_in_state_ms | |-- total_trans | `-- trans_table |-- 1 | |-- clear | |-- time_in_state_ms | |-- total_trans | `-- trans_table |-- 2 | |-- clear | |-- time_in_state_ms | |-- total_trans | `-- trans_table |-- 3 | |-- clear | |-- time_in_state_ms | |-- total_trans | `-- trans_table `-- 4 |-- clear |-- time_in_state_ms |-- total_trans `-- trans_table The content of the files in the cooling devices directory is the same as the sysfs one except for the trans_table which has the following format: Transition Hits 1->0 246 0->1 246 2->1 632 1->2 632 3->2 98 2->3 98 Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> [ rjw: White space fixups, rebase ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-01-09 09:41:11 +00:00
thermal_debug_init();
result = thermal_netlink_init();
if (result)
goto error;
result = thermal_register_governors();
if (result)
goto unregister_netlink;
thermal_class = kzalloc(sizeof(*thermal_class), GFP_KERNEL);
if (!thermal_class) {
result = -ENOMEM;
goto unregister_governors;
}
thermal_class->name = "thermal";
thermal_class->dev_release = thermal_release;
result = class_register(thermal_class);
if (result) {
kfree(thermal_class);
thermal_class = NULL;
goto unregister_governors;
}
Thermal: handle thermal zone device properly during system sleep Current thermal code does not handle system sleep well because 1. the cooling device cooling state may be changed during suspend 2. the previous temperature reading becomes invalid after resumed because it is got before system sleep 3. updating thermal zone device during suspending/resuming is wrong because some devices may have already been suspended or may have not been resumed. Thus, the proper way to do this is to cancel all thermal zone device update requirements during suspend/resume, and after all the devices have been resumed, reset and update every registered thermal zone devices. This also fixes a regression introduced by: Commit 19593a1fb1f6 ("ACPI / fan: convert to platform driver") Because, with above commit applied, all the fan devices are attached to the acpi_general_pm_domain, and they are turned on by the pm_domain automatically after resume, without the awareness of thermal core. CC: <stable@vger.kernel.org> #3.18+ Reference: https://bugzilla.kernel.org/show_bug.cgi?id=78201 Reference: https://bugzilla.kernel.org/show_bug.cgi?id=91411 Tested-by: Manuel Krause <manuelkrause@netscape.net> Tested-by: szegad <szegadlo@poczta.onet.pl> Tested-by: prash <prash.n.rao@gmail.com> Tested-by: amish <ammdispose-arch@yahoo.com> Tested-by: Matthias <morpheusxyz123@yahoo.de> Reviewed-by: Javi Merino <javi.merino@arm.com> Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
2015-10-30 08:31:58 +00:00
result = register_pm_notifier(&thermal_pm_nb);
if (result)
pr_warn("Thermal: Can not register suspend notifier, return %d\n",
result);
return 0;
unregister_governors:
thermal_unregister_governors();
unregister_netlink:
thermal_netlink_exit();
error:
mutex_destroy(&thermal_list_lock);
mutex_destroy(&thermal_governor_lock);
return result;
}
postcore_initcall(thermal_init);