PDF Presentation - Uplinq
PDF Presentation - Uplinq
PDF Presentation - Uplinq
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Brew MP<br />
<br />
Performance Tips<br />
Rich Stewart, Sr. Director, Technology<br />
QUALCOMM
Objective<br />
• Provide a general overview of performance<br />
considerations<br />
• Share BMP specific performance tips<br />
2
Agenda<br />
• Performance Concepts<br />
• BMP Architectural Overview<br />
• CPU Utilization Considerations<br />
• Memory Utilization Considerations<br />
• Power Utilization Considerations<br />
• Tools<br />
3
4<br />
Performance Concepts
General Performance<br />
• What is optimal performance<br />
– Execution time<br />
– Response time<br />
– Memory utilization<br />
– Power consumption<br />
• Optimal performance involves achieving a balance of all<br />
performance factors<br />
• Optimal performance involves sharing system resources<br />
• Optimal performance enables an optimal User Experience<br />
• The performance of a ‘crashed’ system is ZERO<br />
– user experience == 0<br />
5
Trading Off Throughput vs. Latency<br />
Signal + Worker Thread Example<br />
• Your application must balance work throughput with latency<br />
– Real-time apps (example: audio/video streams) must process small<br />
amounts of data regularly; low latency at the cost of context-switch<br />
overhead<br />
– Best-effort apps (example: downloads, file compression) are only<br />
concerned with reaching the “finish line” as soon as possible; low<br />
processing overhead at the cost of increased latency<br />
6
Throughput vs. Latency (cont.)<br />
• Example: A TCP throughput test application with configurable delay<br />
• Peak throughput occurs on this system at 10–12ms<br />
– Read too frequently and burning excess CPU<br />
– Read too infrequently then we leave throughput on the table<br />
• TCP Window is filled after 20ms; throughput drops drastically<br />
7
Throughput vs. Latency (cont.)<br />
• Delay as much rendering/processing/parsing as long<br />
as reasonable<br />
– Application context t has higher h priority it than Packet<br />
Services context<br />
– Crunching incoming data before the whole transfer is<br />
done has user-experience penalties<br />
– Try to manage in-transfer screen updates (e.g., show<br />
progress bar instead of progressive rendering)<br />
8
Trading Off CPU vs. Memory Utilization<br />
• Your application must balance CPU and memory usage<br />
(space-time tradeoff)<br />
• Example: data compression<br />
• Advantage―lower filesystem usage<br />
• Disadvantage―higher g resource load times<br />
• The nature of your application determines the critical<br />
resource<br />
• “Is 5x data compression worth a 2x time penalty”<br />
• “Is a 2x time speedup worth a 4x larger data file”<br />
9
Data Cache Performance Implications<br />
• Write cache-friendly code by using temporal and spatial<br />
locality; keep as much data in your caches as possible<br />
• Improves execution time and saves power!<br />
• Example: Process a two-dimensional array<br />
row-by-row (good spatial locality)<br />
for (i = 0; i < numRows; i++)<br />
column-by-column (bad spatial locality)<br />
for (j = 0; j < numColumns; j++)<br />
for (j = 0; j < numColumns; j++)<br />
for (i = 0; i < numRows; i++)<br />
process(array[i][j]);<br />
process(array[i][j]);<br />
10
Event Handling<br />
• A backlog in the event queue translates into perceived delay for the user<br />
– Validate that the event arrival rate of your design, combined with the other system events can be<br />
processed by the system without a queue becoming so deep that it impacts the users perception<br />
of responsiveness, or that the system elects to drop events<br />
– When exceeding this threshold one needs to consider if the "event" being processed is more<br />
optimally treated as data. Touch screen events are a common example. Handwriting recognition<br />
improves with the number of samples per second. Treating every sample as an event can flood<br />
the even handling mechanism. Grouping down events into a list of data, with a time derived event,<br />
can improve performance.<br />
• Event prioritization and/or creating independent UI tasks can have an advantage<br />
– Take the example of a single threaded UI downloading a web page. If single threaded, and polling<br />
a queue, the page download would not be interruptible until the download is complete. This is an<br />
extreme hypothetical, example.<br />
• Do not use the event queue as a polling mechanism<br />
– For example, in Brew ® , do not post an event to yourself, check some conditions in the event<br />
handler, and then re-post the same event if the condition hasn't been satisfied yet. If you must<br />
periodically check some condition, use a timer with a non-zero delay and check the conditions<br />
in the timer callback handler.<br />
11
Strings<br />
• Use AEEstd.h in lieu of AEEStdLib.h<br />
• Unnecessary function calls (empty string check):<br />
– Bad:<br />
• if (0 == std_strlen(psz)) {...}<br />
– Good:<br />
• ('\0' == *psz) {...}<br />
• Buffer string lengths when you need them multiple times<br />
– Bad:<br />
• std_ strlcpy(dest, string1, MAX_ STRING_ LEN);<br />
• std_strlcat(dest, string2, MAX_STRING_LEN - std_strlen(string1));<br />
• std_strlcat(dest, string3, MAX_STRING_LEN - (std_strlen(string1) + std_strlen(string2)));<br />
– Good:<br />
• int length = std_strlcpy(dest, string1, MAX_STRING_LEN);<br />
• length = std_strlcat(dest, t t(d t string2, MAX_STRING_LEN - length);<br />
• std_strlcat(dest, string3, MAX_STRING_LEN - length);<br />
• Concatenation, use strcat(), not strcpy():<br />
– Bad:<br />
• std_strlcpy(dest + sstd_trlen(dest), source, MAX_STRING_LEN - std_strlen(dest));<br />
– Good:<br />
• std_strlcat(dest, source, MAX_STRING_LEN);<br />
12
Strings (cont.)<br />
• Don't call strlen() on #defined strings or strings that are<br />
in-lined<br />
– Use (int)(sizeof(string)-1) or an appropriately defined macro<br />
• See if a built in string function exists to meet your<br />
use case<br />
– There are many built in string functions, some obscure<br />
• Iterate once through the string for complex operations<br />
– Each pass through a string can be viewed as a linear search in cost<br />
– For highly complex string operations consider implementing a<br />
state machine<br />
13
Diagnostic<br />
code<br />
• Code which does not directly contribute to the user’s s experience within<br />
an application reduces the overall user experience of the ‘system’<br />
14
15<br />
BMP Architectural Overview
Device Platform Overview<br />
Brew MP provides an extensive set of APIs into features at all<br />
layers of the device, including:<br />
• OS Services<br />
• General, Multimedia and Modem Services<br />
• Application, Widget, Windowing and UI Services<br />
Brew MP is both modular and extensible. Many of the services<br />
are offered as demand-loaded, digitally signed binary modules.<br />
This serves to reduce static RAM overhead, bolster compatibility<br />
between devices and support flexible configurations.<br />
The platform supports both cooperative and pre-emptive<br />
threading. It also supports process-based memory protection<br />
which complements the already robust least-privileged execution<br />
model of the platform.<br />
Brew MP has been optimized to leverage chipset modem<br />
and multimedia services. These services are exposed through h<br />
immutable object-oriented APIs which support seamless access<br />
to hardware accelerated implementations when available.<br />
A key principal of Brew MP is platform extensibility. New features<br />
and APIs can be added without Qualcomm’s involvement. These<br />
features can be exposed through any supported language.<br />
Applications<br />
Application Environment<br />
General<br />
Services<br />
OS Services<br />
Kernel<br />
3 rd Party Software<br />
Brew MP Software<br />
Chipset Software<br />
3 rd Party Apps<br />
Graphics &<br />
Multimedia<br />
(HW<br />
Accelerated)<br />
Modem &<br />
Networking<br />
16
Leverage the Functionality Provided by<br />
the BMP System<br />
• Components in the Brew MP<br />
System are optimized for<br />
– Power<br />
– Execution performance<br />
– Memory utilization<br />
stem✔Do IT!<br />
• Optimization of Brew MP System<br />
Components is chipset specific<br />
– Using these components will utilize<br />
the underlying hardware where<br />
possible<br />
✔Please utilize Brew MP APIs<br />
where applicable<br />
17
Brew MP API Usage Considerations<br />
• New Brew MP APIs are generally improved over<br />
Brew 3.1.5 legacy interfaces<br />
– BrewMP ISockPort replaces Brew3.1.5’s 315’ ISocket interface<br />
• Reference:<br />
https://brewmobileplatform.qualcomm.com/devnet/docviewer.jspm<br />
ethod=show&id=5790&path=%2FdevEx%2Flibrary%2Ftechguides<br />
%2Fc%2Fnetworking%2Fnetworking_tech_guide%2Fhtml_out_oe<br />
m%2Fframeset.html<br />
– IFileSystem2<br />
– Minimize calls to GETTIMESECONDS and GETUPTIME<br />
• Frequently called by app more often the necessary<br />
• These calls have have a relatively high performance cost<br />
– GETTIMESECONDS has the higher overhead of the two<br />
18
Brew MP API Usage Considerations: Scheduling<br />
Work<br />
• Always use ISHELL_Resume to break an IApplet’s<br />
time-consuming tasks into smaller, interruptible chunks<br />
use<br />
– A common performance mistake is to use<br />
ISHELL_PostEvent() for this purpose, p which works<br />
functionally<br />
– ISHELL_PostEvent() is more costly than<br />
ISHELL_Resume<br />
• ISHELL_PostEvent() destination applet class ID as argument<br />
• Iterates through the applet database to match the class ID<br />
19
20<br />
CPU Utilization Considerations
JPEG Decoding Optimization<br />
• 5Mp image decoding time vastly improved with 1/8 scaling factor<br />
• IDownSample API: Integer DownScaling While decoding (1/8, 1/4, /1/2)<br />
• AEECLSID_GenericViewerAsync: Prevent hogging UI while decoding<br />
• Also a great memory saving as the generated bitmaps are close to QVGA<br />
IDownSample Fidelity Window<br />
Scaled Decoded Image<br />
Factor 1/8 or 1/4 or 1/2 or 1:1<br />
Factor selected against the<br />
fidelity size.<br />
The highest scaling factor<br />
meeting the fidelity minimum size<br />
requirement<br />
ERR_BAIL(IEnv_CreateInstance(pMe CreateInstance(pMe->piEnv, AEECLSID_JPEGDecoder, (void **)&piImageDecoder));<br />
nErr= IIMAGEDECODER_QueryInterface(piImageDecoder,AEEIID_IDownSample,(void**)&piDownSample)<br />
if(AEE_SUCCESS == nErr)<br />
{<br />
if (NULL != piDownSample){<br />
IDownSample_SetTargetFidelity(piDownSample, fidelityX, fidelityY);<br />
}<br />
}<br />
ERR_BAIL(IEnv_CreateInstance(pMe->piEnv, AEECLSID_GenericViewerAsync, (void **)&piImage));<br />
IImage_SetParm(piImage, IPARM_DECODER, (int)piImageDecoder, 0);<br />
21
Resource Usage: Reduce Number of Images<br />
• Reduce number of<br />
images and widgets to<br />
be used (e.g. If you<br />
have a key that shows<br />
lower and upper case,<br />
use one static widget<br />
per key instead of two<br />
to save time to display)<br />
This key can be done<br />
with one static widget<br />
instead of two<br />
22
Resource Usage: Use Bitmap Fonts or Simple<br />
Typeface Fonts<br />
• Font execution<br />
and memory costs<br />
– $ BitMap<br />
– $$ Simple True Type<br />
– $$$ Complex True Type<br />
• Font render quality<br />
– ✚ BitMap<br />
– ✚✚ Simple True Type<br />
– ✚✚✚Complex True<br />
Type<br />
23
Resource Usage: Merge Static Images<br />
With Background Images if Possible<br />
• If there are some static images that are always displayed, merge them<br />
into the background<br />
Screen to show<br />
different shapes on<br />
the top portion only<br />
But the lower<br />
portion remains the<br />
same, but these are<br />
always loaded!<br />
Background has a<br />
plain color<br />
Should use this as<br />
the background<br />
24
Resource Usage: Avoid Blending in Run-time<br />
• If two images can be composited at authoring time rather than runtime,<br />
they probably should be, as this can be a way to reduce use of alpha<br />
blending<br />
If these images are always blended in run time,<br />
and can be composited at authoring time<br />
Then, load this image instead<br />
of loading 2 images and then<br />
blending<br />
25
Resource Usage: Use Fixed Height for Lists<br />
• Use fixed height for<br />
different rows to be<br />
displayed on Forms,<br />
if multiple lines are<br />
needed, try to fit into<br />
a fixed height<br />
Fixed height<br />
to avoid<br />
computation<br />
during<br />
run-time.<br />
26
Resource Usage: Preloading and Reusing<br />
• Pre-loading an application or images if it’s time critical to<br />
start that application or image. This has a negative impact<br />
on boot time<br />
• Pre-loading can be deferred and doesn't need to occur on<br />
app startup. Once the app has started, a background task<br />
can be used to progressively pre-load resources that t will be<br />
needed later on. Or if the image is hidden, don’t load it<br />
during app startup<br />
• Cache the local time offset. Update only on a widely spaced<br />
interval (minutes) or at the beginning of a short duration<br />
use case<br />
27
Resource Usage: JPEG/PNG vs BMP<br />
• PNG and JPG require performance costly decoding<br />
• Non-transparent (opaque) images may be converted to bitmaps<br />
– Increase the size of the bar file, but will lower load time<br />
• BMP (left), vs. PNG (right) format quality comparison<br />
bmp file: 232KBytes<br />
png file: 46KBytes<br />
• Use PNG if you would like to use the transparency feature<br />
A is transparent to B<br />
A is opaque to B<br />
28
Resource Usage: Use PNG Crush<br />
• Encoding PNGs without “rowfilters” faster decodes<br />
– Cost is a large PNG file<br />
– Negative implications for EFS and RAM usage<br />
– Leads to an overall decrease in system performance<br />
• Recommended pngcrush options<br />
– pngcrush -f 0 to minimize RAM and EFS usage<br />
– pngcrush -rem tEXt to remove unneeded text added by<br />
authoring tools<br />
– pngcrush -rem gAMA -rem cHRM -rem iCCP -rem sRGB<br />
• To remove chunks for gamma and color correction tables<br />
• Use the pngcrush –brute for broad spectrum optimizations<br />
29
Resource Usage: Additional PNG<br />
Performance Hints<br />
• Don’t use interlaced PNGs<br />
• ARGB with unused “A” – alpha channel convert<br />
to RGB<br />
• Some images that are 24 or 32 bit can be more<br />
efficiently (and without loss of fidelity) encoded<br />
as a palletized PNG<br />
• Filmstrip type animations can be really large<br />
– Using an animated GIF should be used if an undue<br />
loss of fidelity is not realized<br />
30
31<br />
Memory Utilization<br />
Considerations
Resource Usage: Use Widget Properties<br />
• Use Iwidget to set properties to display<br />
vs loading an image file<br />
– Smaller and faster<br />
– Loading<br />
if_textinput_qwerty_alphabet_bg.png<br />
(320x240 pixels)<br />
– Should use the following for better<br />
performance<br />
• tif_qwerty_alphabet_title.png (320x21 px)<br />
• IWidget_SetGradientStyle(GRADIENT_STYL<br />
E_CENTERVERT)<br />
• IWidget_SetGradientColor(219,<br />
222,227)<br />
• IWidget_SetBGColor(222,227,231)<br />
Background with<br />
a plain color<br />
Background with a plain<br />
color and gradient<br />
32
Storage Optimization: Use a Smaller Image for Tiles<br />
Instead of a Full Image<br />
• Use a smaller image that will be tiled, instead of a large image<br />
Large image: 159K Bytes<br />
Small image: 19K Bytes, can be tiled<br />
33
Database Applications Optimizations<br />
• Database Update — dbc_IStatement<br />
– Disable Journaling when data integrity is not required (SQLite Pragma)<br />
– Minimize db operations (query/update):<br />
• Use transactions<br />
static const char* cpszPragma = "PRAGMA journal_mode = PERSIST;"<br />
// static const char* cpszPragma = "PRAGMA journal_mode = OFF;"<br />
nErr = DBCConnection_New(pszPath, me->piEnv, po, eFlags, ppic);<br />
cpszUpdate = cpszPragma;<br />
ERR_BAIL(dbc_IConnection_CreateStatement(*ppic, cpszUpdate, &nTail, &pis));<br />
ERR_BAIL(dbc_IStatement_ExecuteUpdate(pis))<br />
– Use Cache (control cache size via SQLite Pragmas)<br />
• Database Queries<br />
– Refer to Database Technology Guide —<br />
https://brewmobileplatform.qualcomm.com/devnet/index.jsp#databases<br />
34
Memory Management Considerations<br />
• Avoid fragmenting heap<br />
– See next page for explanation of heap fragmentation<br />
• Consider using block allocator<br />
– for (i=0; i
Brew MP-Specific Techniques to Avoid Heap<br />
Fragmentation<br />
• Example of heap fragmentation<br />
16KB used<br />
48KB free<br />
1:<br />
first alloc<br />
16KB used 16KB used 32KB free<br />
first alloc second alloc<br />
2:<br />
16KB used 16KB used 16KB used 16KB free<br />
3:<br />
first alloc second alloc third alloc<br />
16KB used 16KB free 16KB used 16KB free<br />
4:<br />
first alloc<br />
third alloc<br />
36
Apply Brew MP Best Practices to Avoid Memory<br />
Corruptions<br />
• The performance of a ‘crashed’ system is ZERO<br />
• Release all BMP API’s through IQI_RELEASEIF().IQI_RELEASEIF<br />
– Ensures only memory of a non-null null API pointer is released<br />
– After the release the pointer will be set to NULL<br />
• All the FREE() should be replaced with FREEIF()<br />
– FREEIF() will ensure that only memory of a non-null null pointer is freed<br />
– After the free, the pointer will be set to NULL<br />
• Only use the version of the helper functions that accept “size” as one its<br />
parameters for boundary checking. This includes functions like strncpy,<br />
snprintf, etc… This will avoid any possible buffer-overrun.<br />
37
Some Simple Tips for Memory Leaks<br />
• Analyze any possible memory leak reported by BMP<br />
– Typical message from the logger<br />
• *AEEModule.c 00266 Warning—memory leak, freeing, 0xABCD1010, NONAME,<br />
size 328<br />
– Indicates a memory leak of 328 bytes from memory address 0xABCD1010<br />
– The source of the memory is NONAME because the memory block was<br />
not tagged<br />
– To enable the heap blocks to be tagged with the filename/linenumber, insert<br />
• “#define AEE_DEBUG_HEAP 1” before the app includes AEEStdLib.h.<br />
• Remove this #define before shipping i commercial product<br />
• *AEEModule.c Leverage c_heaptracker app to track app’s heap usage<br />
over time to see whether the available heap keeps decreasing for no<br />
reason<br />
38
39<br />
Power Utilization<br />
Considerations
Power Management<br />
• Avoid/minimize periodic wakeups<br />
• LEDs and LCD burn an enormous amount of power<br />
– Turn off the keyboard backlight when not needed<br />
– Turn off the display when not needed<br />
– Turn off the backlight when not needed<br />
• Apps responding “TRUE” to EVT_APP_NO_SLEEP<br />
defer entry to BLPM by 10 seconds<br />
• Responding “TRUE” burns power<br />
40
Power Management—Minimize Number of Wake-ups<br />
• Avoid using periodic timers to wake up the phone<br />
• An example of waking up every 5 seconds may reduce the standby<br />
time by 29%!<br />
Idle power<br />
consumption<br />
is really low<br />
Each wake up<br />
consumes lots<br />
of power<br />
41
42<br />
Tools
Principles of Code Optimizations Using<br />
Profiling Tools<br />
• Timing data in the profiler report in Simulator is not the<br />
same as that of the phone. Sometimes they are not even<br />
proportional.<br />
• Profiling report by PC tools (e.g. GlowCode) is still helpful<br />
in shedding light into the behavior of you module on target.<br />
– The time spent in a function and its children. This lets you<br />
follow the “hot path”—the series of function calls responsible<br />
for the majority of the execution time.<br />
– The number of hits in a given function. Function with higher hits<br />
could be part of the “inner loop”, and should be looked at with<br />
more carefully.<br />
43
Sample Output for basicmod1app Module<br />
Inner-loop: HandleEvent<br />
Hot-path: dbg_Message: 13.8%<br />
44
CS Heap Tracker (CHT)<br />
• CHT is a utility to debug memory leaks, double-frees, writing past allocated memory, etc.<br />
• Support for simulator and on-target debugging<br />
• CHT provides, in the instance of detected heap corruption, a stack trace of the offending function<br />
• Also supports high watermark tracking<br />
45
46<br />
Conclusion
Summary<br />
• Performance is comprised of many components<br />
– Execution time<br />
– Response time<br />
– Memory utilization<br />
– Power consumption<br />
• Optimal performance is achieved through making tradeoffs<br />
• These tradeoffs are most effectively made through<br />
understanding the application specific considerations and<br />
general system considerations<br />
• Ultimately the application user experience and overall user<br />
experience of the system defines optimal performance<br />
47
48<br />
Additional Information
Resource Usage: Miscellaneous<br />
• Images that are reused several times should be cached<br />
and not re-loaded from the file system<br />
– Make the application as a singleton if it is expected to be<br />
used for multiple applications. E.g. virtual keypad that is<br />
used by messaging, email, dialer applications.<br />
• If the device is performing slowly when flickering,<br />
you may have to increase the Event Expiration Timer<br />
(IObserver_SetPointerEvtExpirationTimer) toimprove<br />
the performance of list flicks (perhaps a value closer to<br />
1 second instead of the default 150ms) to get a longer<br />
sampling rate of the pointer events<br />
49
50<br />
CHT Example: Double-free